Skip to content

Conversation

coder-fny
Copy link

@coder-fny coder-fny commented Aug 27, 2025

Purpose

The adapted weights from https://www.modelscope.cn/models/AngelSlim/Qwen3-32B_eagle3/ can perform Eagle3 inference, but the head dimension (80) of these weights is inconsistent with that of the target model (128), causing an error; therefore, adaptation is required.

Test Plan

Test Result


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added qwen Related to Qwen models v1 labels Aug 27, 2025
@coder-fny coder-fny changed the title Adapting Qwen3-32B to Eagle3 mode to resolve head dimension (headdim)… Adapting Qwen3-32B to Eagle3 mode to resolve head dimension (headdim)different Aug 27, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces changes to handle KV cache configurations for models with variable head dimensions, specifically for Qwen3-32B in Eagle3 mode. It adds a new path for creating KV cache configurations when all layers use FullAttentionSpec but may have different head_size values. While the memory allocation logic correctly uses the maximum page size, the logic for creating the KVCacheGroupSpec is flawed. It uses the spec of the first layer to represent the entire group, which can lead to incorrect concurrency calculations and potential runtime errors if other layers have larger memory requirements. My review includes a high-severity fix for this issue to ensure the group's specification reflects the most demanding layer, preventing potential out-of-memory errors.

layer_specs = [
kv_cache_spec[layer_name] for layer_name in layer_names_one_group
]
merged_layer_spec = layer_specs[0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

Using the spec of the first layer (layer_specs[0]) to represent the entire group can be problematic when layers have different head_size values, as indicated by the PR's purpose. This leads to an incorrect KVCacheGroupSpec for the group.

For instance, get_max_concurrency_for_kv_cache_config uses group.kv_cache_spec.page_size_bytes for its calculation. If the first layer has a smaller head_size than other layers in the group, its page_size_bytes will be smaller than the maximum page size used for allocation in _get_kv_cache_config_allFullAttentionSpec_type. This will cause max_concurrency to be overestimated, which could lead to out-of-memory errors at runtime.

To ensure correctness, the merged_layer_spec should represent the most demanding configuration within the group, which corresponds to the layer with the largest page_size_bytes.

Suggested change
merged_layer_spec = layer_specs[0]
merged_layer_spec = max(layer_specs, key=lambda spec: spec.page_size_bytes)

Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@coder-fny coder-fny changed the title Adapting Qwen3-32B to Eagle3 mode to resolve head dimension (headdim)different Adapting Qwen3-32B to Eagle3 mode to resolve head dimension mismatch issues Aug 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
qwen Related to Qwen models v1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant