Skip to content

Conversation

@MengqingCao
Copy link
Collaborator

@MengqingCao MengqingCao commented Sep 22, 2025

What this PR does / why we need it?

Refactor kv cache tensor initialization logic.

  1. Unify the kvcache tensor initialization logic of deepseek and normal models
  2. spilt initialize_kv_cache_tensors into _allocate_kv_cache_tensors and _reshape_kv_cache_tensors, following gpu modelrunner in vllm

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

CI passed with existing test.

  1. prefill disaggregation scenario
  2. deepseek + aclgraph/eager mode
  3. qwen3 next

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

@MengqingCao MengqingCao marked this pull request as ready for review September 23, 2025 07:26
@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

1 similar comment
@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

yiz-liu pushed a commit that referenced this pull request Oct 29, 2025
…ype (#3760)

### What this PR does / why we need it?
Part of #3106
Fix Hybrid kvcache sharing bug in same attention type
Change the `shared_by` logic so that the same attention spec could share
the same buffer instead of allocating more hbm.
After this pr, kvcache memory saved 50% in qwen3-next compared with
before (`self_attn:linear_attn=1:3` in an `attn_group`), and
`gpu_memory_utilization` could increase to `0.8` on Qwen3-Next when
running on A2 64G/card with tp4

<img width="2833" height="1540" alt="image"
src="https://github.com/user-attachments/assets/2a91fa99-fb0f-447c-9e8b-acd587890fbe"
/>

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?
Test pass with the latest e2e test case on qwen3-next

- vLLM version: v0.11.0rc3
- vLLM main:
vllm-project/vllm@c9461e0

---------

Signed-off-by: MengqingCao <[email protected]>
@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

  * Unify the kvcache tensor initialization logic of deepseek and normal models
  * spilt initialize_kv_cache_tensors into _allocate_kv_cache_tensors and _reshape_kv_cache_tensors, following gpu modelrunner in vllm
  * Fix the shared_by logic so that the same attention spec could share the same buffer instead of allocating more hbm.

Signed-off-by: MengqingCao <[email protected]>
Signed-off-by: MengqingCao <[email protected]>
Signed-off-by: MengqingCao <[email protected]>
@MengqingCao MengqingCao added ready read for review ready-for-test start test by label for PR labels Oct 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready read for review ready-for-test start test by label for PR

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant