[Bugfix] Fix MTP bug with padded speculation #26198

MatthewBonanni · 2025-10-03T21:36:41Z

Purpose

MTP models crash with embedding lookup errors when using padded speculation (PR #24539) in large batches with variable-length inputs.
Root Cause: Eagle's padded speculation uses -1 as sentinel values to mark discarded/invalid tokens in batched requests. MTP models call embed_tokens() directly and can't handle these invalid indices, while other speculators don't have this issue as they don't perform embedding lookups.
Solution: Filter out -1 sentinel values in Eagle proposer before they reach MTP models by clamping input_ids to valid vocabulary range [0, vocab_size-1]. This preserves the efficiency benefits of padded speculation while ensuring MTP models receive valid token indices.

Test Plan

Basic functionality

VLLM_ATTENTION_BACKEND=FLASH_ATTN_MLA \
vllm serve deepseek-ai/DeepSeek-R1 \
    --tensor-parallel-size 1 \
    --enable-expert-parallel \
    --data-parallel-size 8 \
    --no-enable-prefix-caching \
    --trust-remote-code \
    --speculative-config '{"method": "deepseek_mtp", "num_speculative_tokens": 1}' \
    --enforce-eager \
    --hf-overrides.num_hidden_layers=4 \
    --load-format=dummy

vllm bench serve --base-url http://0.0.0.0:8000 --model deepseek-ai/DeepSeek-R1 --dataset-name random --random-input-len 1024 --random-output-len 1 --random-range-ratio 0.1 --num-prompts 1024

Correctness

VLLM_ATTENTION_BACKEND=FLASH_ATTN_MLA \
vllm serve deepseek-ai/DeepSeek-R1 \
    --tensor-parallel-size 1 \
    --enable-expert-parallel \
    --data-parallel-size 8 \
    --no-enable-prefix-caching \
    --trust-remote-code \
    --speculative-config '{"method": "deepseek_mtp", "num_speculative_tokens": 1}' \
    --enforce-eager \

lm_eval --model local-completions --tasks gsm8k --model_args model=deepseek-ai/DeepSeek-R1,base_url=http://0.0.0.0:8000/v1/completions,num_concurrent=1,max_retries=3,tokenized_requests=False

Test Result

Basic functionality

Doesn't crash

Correctness

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Matthew Bonanni <[email protected]>

gemini-code-assist

Code Review

This pull request addresses a crash in MTP models when using padded speculation by clamping invalid token IDs. The change in propose_tree is a good step towards fixing the issue. However, the fix is incomplete as the propose method is not patched, leaving it vulnerable to the same crash. Additionally, I've noticed another critical issue in propose_tree where the model's return value is not handled correctly for MTP models, which will lead to a TypeError. While the second issue is outside the scope of the current diff, addressing both is crucial for a robust solution. I have added a comment on the diff to detail the incomplete fix.

gemini-code-assist · 2025-10-03T21:39:45Z

vllm/v1/spec_decode/eagle.py

+            if self.method == "mtp":
+                # Filter out -1 sentinel values that mark discarded/invalid
+                # tokens
+                vocab_size = self.model.model.embed_tokens.weight.size(0)
+                input_ids = torch.clamp(input_ids, min=0, max=vocab_size - 1)


While this logic correctly handles sentinel values for MTP models within propose_tree, the fix is incomplete. The propose method (around line 210) also constructs input_ids from target_token_ids and next_token_ids, which can contain -1 from padded speculation or rejection sampling. This will lead to the same embedding lookup error that this PR aims to fix.

To fully resolve the issue, a similar clamping mechanism should be implemented in the propose method as well. You can add the following code block after self.input_ids[last_token_indices] = next_token_ids:

if self.method == "mtp": # Handle -1 sentinel values from padded speculation for MTP models # which call embed_tokens() and can't handle invalid indices. vocab_size = self.model.model.embed_tokens.weight.size(0) clamped_input_ids = torch.clamp(self.input_ids[:num_tokens], min=0, max=vocab_size - 1) self.input_ids[:num_tokens] = clamped_input_ids

benchislett · 2025-10-03T22:13:02Z

Same question as the bot, why only add the fix in propose_tree?

seven-mile · 2025-10-04T15:32:59Z

Hello~ Could you also check out my fix #26231 to determine if we encountered the same issue?

MatthewBonanni · 2025-10-06T15:26:17Z

Hello~ Could you also check out my fix #26231 to determine if we encountered the same issue?

@seven-mile This does appear to be the same issue and your fix addresses my issue as well. I believe your fix is better so I'll close my PR. Thanks!

@benchislett Could you review #26231?

MatthewBonanni added 3 commits October 3, 2025 15:13

bugfix

51c7f2d

Signed-off-by: Matthew Bonanni <[email protected]>

move clamp to MTP layer

2957987

Signed-off-by: Matthew Bonanni <[email protected]>

move to eagle

d1d8a67

Signed-off-by: Matthew Bonanni <[email protected]>

MatthewBonanni requested review from benchislett and luccafong as code owners October 3, 2025 21:36

mergify bot added speculative-decoding v1 labels Oct 3, 2025

gemini-code-assist bot reviewed Oct 3, 2025

View reviewed changes

seven-mile mentioned this pull request Oct 4, 2025

[Bugfix][Spec Decode] Fix wrong valid_mask for padded speculation when chunked prefill occurs #26231

Merged

5 tasks

tomasruizt mentioned this pull request Oct 6, 2025

[Bug]: Performance Regression in Acceptance length for EAGLE3 #26191

Open

1 task

MatthewBonanni closed this Oct 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Bugfix] Fix MTP bug with padded speculation #26198

[Bugfix] Fix MTP bug with padded speculation #26198

Uh oh!

MatthewBonanni commented Oct 3, 2025 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 3, 2025

Uh oh!

benchislett commented Oct 3, 2025

Uh oh!

seven-mile commented Oct 4, 2025

Uh oh!

MatthewBonanni commented Oct 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[Bugfix] Fix MTP bug with padded speculation #26198

[Bugfix] Fix MTP bug with padded speculation #26198

Uh oh!

Conversation

MatthewBonanni commented Oct 3, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Basic functionality

Correctness

Test Result

Basic functionality

Correctness

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 3, 2025

Choose a reason for hiding this comment

Uh oh!

benchislett commented Oct 3, 2025

Uh oh!

seven-mile commented Oct 4, 2025

Uh oh!

MatthewBonanni commented Oct 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MatthewBonanni commented Oct 3, 2025 •

edited by github-actions bot

Loading