support prefill cache mode use fia op #3652

shiyuan680 · 2025-10-23T03:51:17Z

What this PR does / why we need it?

support prefill cache mode use fia op for full graph

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.11.0rc3
vLLM main: https://github.com/vllm-project/vllm/commit/v0.11.0

github-actions · 2025-10-23T03:51:30Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

gemini-code-assist

Code Review

This pull request adds support for the Fused Infer Attention (FIA) operator for prefill with cache hits on CANN 8.3. The changes introduce a new code path in the attention implementation and adjust the attention mask creation accordingly. While the changes are generally in the right direction, I have identified a critical issue regarding the incorrect sequence lengths being passed to the new operator, which could lead to incorrect attention results. Additionally, I've noted the use of a hardcoded value for an attention mask dimension, which could cause issues if model configurations change, and I've suggested a more robust alternative.

gemini-code-assist · 2025-10-23T03:52:42Z

vllm_ascend/attention/attention_v1.py

+                input_layout="TND",
+                block_size=block_size,
+                actual_seq_lengths=attn_metadata.actual_seq_lengths_q,
+                actual_seq_lengths_kv=attn_metadata.actual_seq_lengths_q,


The actual_seq_lengths_kv parameter is incorrectly set to attn_metadata.actual_seq_lengths_q, which represents the query lengths. In prefill scenarios, the key/value sequence lengths should correspond to the total context lengths, not just the new query tokens. This will likely lead to incorrect attention calculations because the model will attend to the wrong token range in the KV cache. The correct value, representing the context lengths, appears to be available in attn_metadata.seq_lens_list, which is used in other calls to this same operator within this file.

Suggested change

actual_seq_lengths_kv=attn_metadata.actual_seq_lengths_q,

actual_seq_lengths_kv=attn_metadata.seq_lens_list,

gemini-code-assist · 2025-10-23T03:52:42Z

vllm_ascend/worker/model_runner_v1.py

+                return ~torch.tril(torch.ones((2048, 2048),
+                                              dtype=torch.bool,
+                                              device=self.device)
+                                   )


The dimension 2048 for the attention mask is hardcoded. This can lead to runtime errors if the model's max_model_len is configured to be larger than 2048. To make the code more robust and maintainable, it's better to use a value derived from the model configuration, such as self.model_config.max_model_len.

If 2048 is a strict limitation of the underlying kernel for CANN 8.3, it should be defined as a named constant (e.g., _CANN_8_3_FIA_MAX_LEN = 2048) and an assertion should be added during initialization to ensure self.model_config.max_model_len does not exceed this limit.

Suggested change

return ~torch.tril(torch.ones((2048, 2048),

dtype=torch.bool,

device=self.device)

)

return ~torch.tril(torch.ones((self.model_config.max_model_len, self.model_config.max_model_len),

dtype=torch.bool,

device=self.device)

)

vllm_ascend/attention/attention_v1.py

Signed-off-by: shiyuan680 <[email protected]>

momo609 · 2025-10-24T01:15:28Z

vllm_ascend/worker/model_runner_v1.py

-            return self.attn_mask_builder.get_attn_mask(
-                128, self.dtype, self.device)
+            if torch.version.cann.startswith("8.3"):
+                return self.attn_mask_builder.get_attn_mask(


can use chunkprefill attn_mask fuc instead.

gemini-code-assist bot reviewed Oct 23, 2025

View reviewed changes

shiyuan680 force-pushed the fia_replace branch 3 times, most recently from 6d9716a to e90f3f2 Compare October 23, 2025 08:03

zzzzwwjj suggested changes Oct 23, 2025

View reviewed changes

vllm_ascend/attention/attention_v1.py Outdated Show resolved Hide resolved

shiyuan680 force-pushed the fia_replace branch from e90f3f2 to 6a5fc6a Compare October 23, 2025 11:41

support prefill cache mode use fia op

7c4072e

Signed-off-by: shiyuan680 <[email protected]>

shiyuan680 force-pushed the fia_replace branch from 6a5fc6a to 7c4072e Compare October 24, 2025 01:11

shiyuan680 requested a review from zzzzwwjj October 24, 2025 01:11

momo609 reviewed Oct 24, 2025

View reviewed changes

shiyuan680 closed this Oct 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

support prefill cache mode use fia op #3652

support prefill cache mode use fia op #3652

shiyuan680 commented Oct 23, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Oct 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 23, 2025

Uh oh!

gemini-code-assist bot Oct 23, 2025

Uh oh!

Uh oh!

momo609 Oct 24, 2025

Uh oh!

shiyuan680 Oct 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	actual_seq_lengths_kv=attn_metadata.actual_seq_lengths_q,
	actual_seq_lengths_kv=attn_metadata.seq_lens_list,

support prefill cache mode use fia op #3652

support prefill cache mode use fia op #3652

Conversation

shiyuan680 commented Oct 23, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

github-actions bot commented Oct 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

momo609 Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

shiyuan680 Oct 24, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shiyuan680 commented Oct 23, 2025 •

edited by github-actions bot

Loading