Skip to content

Conversation

@MatthewBonanni
Copy link
Contributor

@MatthewBonanni MatthewBonanni commented Oct 3, 2025

Purpose

MTP models crash with embedding lookup errors when using padded speculation (PR #24539) in large batches with variable-length inputs.
Root Cause: Eagle's padded speculation uses -1 as sentinel values to mark discarded/invalid tokens in batched requests. MTP models call embed_tokens() directly and can't handle these invalid indices, while other speculators don't have this issue as they don't perform embedding lookups.
Solution: Filter out -1 sentinel values in Eagle proposer before they reach MTP models by clamping input_ids to valid vocabulary range [0, vocab_size-1]. This preserves the efficiency benefits of padded speculation while ensuring MTP models receive valid token indices.

Test Plan

Basic functionality

VLLM_ATTENTION_BACKEND=FLASH_ATTN_MLA \
vllm serve deepseek-ai/DeepSeek-R1 \
    --tensor-parallel-size 1 \
    --enable-expert-parallel \
    --data-parallel-size 8 \
    --no-enable-prefix-caching \
    --trust-remote-code \
    --speculative-config '{"method": "deepseek_mtp", "num_speculative_tokens": 1}' \
    --enforce-eager \
    --hf-overrides.num_hidden_layers=4 \
    --load-format=dummy
vllm bench serve --base-url http://0.0.0.0:8000 --model deepseek-ai/DeepSeek-R1 --dataset-name random --random-input-len 1024 --random-output-len 1 --random-range-ratio 0.1 --num-prompts 1024

Correctness

VLLM_ATTENTION_BACKEND=FLASH_ATTN_MLA \
vllm serve deepseek-ai/DeepSeek-R1 \
    --tensor-parallel-size 1 \
    --enable-expert-parallel \
    --data-parallel-size 8 \
    --no-enable-prefix-caching \
    --trust-remote-code \
    --speculative-config '{"method": "deepseek_mtp", "num_speculative_tokens": 1}' \
    --enforce-eager \
lm_eval --model local-completions --tasks gsm8k --model_args model=deepseek-ai/DeepSeek-R1,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=1,max_retries=3,tokenized_requests=False

Test Result

Basic functionality

Doesn't crash

Correctness


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Matthew Bonanni <[email protected]>
Signed-off-by: Matthew Bonanni <[email protected]>
Signed-off-by: Matthew Bonanni <[email protected]>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a crash in MTP models when using padded speculation by clamping invalid token IDs. The change in propose_tree is a good step towards fixing the issue. However, the fix is incomplete as the propose method is not patched, leaving it vulnerable to the same crash. Additionally, I've noticed another critical issue in propose_tree where the model's return value is not handled correctly for MTP models, which will lead to a TypeError. While the second issue is outside the scope of the current diff, addressing both is crucial for a robust solution. I have added a comment on the diff to detail the incomplete fix.

Comment on lines +709 to +713
if self.method == "mtp":
# Filter out -1 sentinel values that mark discarded/invalid
# tokens
vocab_size = self.model.model.embed_tokens.weight.size(0)
input_ids = torch.clamp(input_ids, min=0, max=vocab_size - 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

While this logic correctly handles sentinel values for MTP models within propose_tree, the fix is incomplete. The propose method (around line 210) also constructs input_ids from target_token_ids and next_token_ids, which can contain -1 from padded speculation or rejection sampling. This will lead to the same embedding lookup error that this PR aims to fix.

To fully resolve the issue, a similar clamping mechanism should be implemented in the propose method as well. You can add the following code block after self.input_ids[last_token_indices] = next_token_ids:

if self.method == "mtp":
    # Handle -1 sentinel values from padded speculation for MTP models
    # which call embed_tokens() and can't handle invalid indices.
    vocab_size = self.model.model.embed_tokens.weight.size(0)
    clamped_input_ids = torch.clamp(self.input_ids[:num_tokens], min=0, max=vocab_size - 1)
    self.input_ids[:num_tokens] = clamped_input_ids

@benchislett
Copy link
Collaborator

Same question as the bot, why only add the fix in propose_tree?

@seven-mile
Copy link
Contributor

Hello~ Could you also check out my fix #26231 to determine if we encountered the same issue?

@MatthewBonanni
Copy link
Contributor Author

Hello~ Could you also check out my fix #26231 to determine if we encountered the same issue?

@seven-mile This does appear to be the same issue and your fix addresses my issue as well. I believe your fix is better so I'll close my PR. Thanks!

@benchislett Could you review #26231?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants