-
-
Notifications
You must be signed in to change notification settings - Fork 10.8k
[Core] Speculative Early-Exit for LRMs implemented on Eagle3 #27192
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Core] Speculative Early-Exit for LRMs implemented on Eagle3 #27192
Conversation
Signed-off-by: mgoin <[email protected]>
Signed-off-by: breno.skuk <[email protected]> Signed-off-by: Breno Baldas Skuk <[email protected]> Signed-off-by: mgoin <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Michael Goin <[email protected]>
…ect#23008) Signed-off-by: mgoin <[email protected]>
Signed-off-by: RuBing-Yang <[email protected]>
Signed-off-by: RuBing-Yang <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an early-exit mechanism for speculative decoding in Large Reasoning Models (LRMs), a significant performance optimization. The implementation is well-structured, touching upon model configurations, the GPU model runner, and the speculative decoding proposer. My review focused on the correctness and maintainability of this new, complex feature. I've identified a couple of high-severity issues: a misleading error message in both llama.py and qwen3.py due to a copy-paste error, and an incorrect type hint in eagle.py which also reveals some dead code in the caller. Addressing these will improve code clarity and ease of debugging. The overall logic for early stopping and the accompanying tests appear robust.
| assert len(start_think_ids) == 1 and len(stop_think_ids) == 1, \ | ||
| f"Invalid think IDs: " \ | ||
| f"</think> {start_think_ids}, " \ | ||
| f"</think> {stop_think_ids}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The f-string in this assertion message has a typo. It shows </think> for both start_think_ids and stop_think_ids, which can be misleading during debugging. The first one should be <think>.
| assert len(start_think_ids) == 1 and len(stop_think_ids) == 1, \ | |
| f"Invalid think IDs: " \ | |
| f"</think> {start_think_ids}, " \ | |
| f"</think> {stop_think_ids}" | |
| assert len(start_think_ids) == 1 and len(stop_think_ids) == 1, \ | |
| f"Invalid think IDs: " \ | |
| f"<think> {start_think_ids}, " \ | |
| f"</think> {stop_think_ids}" |
| assert len(start_think_ids) == 1 and len(stop_think_ids) == 1, \ | ||
| f"Invalid think IDs: " \ | ||
| f"</think> {start_think_ids}, " \ | ||
| f"</think> {stop_think_ids}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The f-string in this assertion message has a typo. It shows </think> for both start_think_ids and stop_think_ids, which can be misleading during debugging. The first one should be <think>.
| assert len(start_think_ids) == 1 and len(stop_think_ids) == 1, \ | |
| f"Invalid think IDs: " \ | |
| f"</think> {start_think_ids}, " \ | |
| f"</think> {stop_think_ids}" | |
| assert len(start_think_ids) == 1 and len(stop_think_ids) == 1, \ | |
| f"Invalid think IDs: " \ | |
| f"<think> {start_think_ids}, " \ | |
| f"</think> {stop_think_ids}" |
| input_batch: Optional[InputBatch] = None, | ||
| input_requests: Optional[dict[str, CachedRequestState]] = None, | ||
| mm_embeds: Optional[list[torch.Tensor]] = None, | ||
| ) -> torch.Tensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type hint for this method is torch.Tensor, but the implementation at line 333 returns a tuple (draft_token_ids, req_early_stop_signal). This mismatch can mislead static analysis tools and developers.
Additionally, the second element of the returned tuple, req_early_stop_signal, is always None as it's never assigned a value within the function. This makes the corresponding handling logic in the caller (GPUModelRunner.propose_draft_token_ids) dead code. The actual update of the early stop signal happens in-place within batch_to_req_early_stop_signal.
To improve code clarity and correctness, the return type hint should be updated to reflect the actual return type.
| ) -> torch.Tensor: | |
| ) -> tuple[torch.Tensor, Optional[torch.Tensor]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codex Review
Here are some automated review suggestions for this pull request.
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| tokenizer = AutoTokenizer.from_pretrained( | ||
| vllm_config.model_config.model) | ||
| start_think_ids = tokenizer.encode("<think>", add_special_tokens=False) | ||
| stop_think_ids = tokenizer.encode("</think>", add_special_tokens=False) | ||
| assert len(start_think_ids) == 1 and len(stop_think_ids) == 1, \ | ||
| f"Invalid think IDs: " \ | ||
| f"</think> {start_think_ids}, " \ | ||
| f"</think> {stop_think_ids}" | ||
|
|
||
| self.think_settings = ThinkSettings( | ||
| start_think_id=start_think_ids[0], | ||
| stop_think_id=stop_think_ids[0], | ||
| ) | ||
|
|
||
| for text in self.think_settings.step_split_tokens: | ||
| encoded_tokens = tokenizer.encode(text, add_special_tokens=False) | ||
| if len(encoded_tokens) == 1: | ||
| self.think_settings.step_split_token_ids.add(encoded_tokens[0]) | ||
| for text in self.think_settings.discourse_marker_tokens: | ||
| encoded_tokens = tokenizer.encode(text, add_special_tokens=False) | ||
| if len(encoded_tokens) == 1: | ||
| self.think_settings.discourse_marker_token_ids.add( | ||
| encoded_tokens[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard thinking-token setup behind configuration
The model now unconditionally constructs an AutoTokenizer and asserts that both <think> and </think> map to a single token. Many Llama checkpoints (especially ones without reasoning-specific tokens) do not satisfy this assumption, so model initialization will raise the assertion even when speculative early exit is disabled. This effectively makes ordinary Llama models unusable in vLLM unless they happen to contain those tokens. Consider performing this setup only when early_stop_thinking is requested and handling the absence of dedicated tokens gracefully.
Useful? React with 👍 / 👎.
| tokenizer = AutoTokenizer.from_pretrained( | ||
| vllm_config.model_config.model) | ||
| start_think_ids = tokenizer.encode("<think>", add_special_tokens=False) | ||
| stop_think_ids = tokenizer.encode("</think>", add_special_tokens=False) | ||
| assert len(start_think_ids) == 1 and len(stop_think_ids) == 1, \ | ||
| f"Invalid think IDs: " \ | ||
| f"</think> {start_think_ids}, " \ | ||
| f"</think> {stop_think_ids}" | ||
|
|
||
| self.think_settings = ThinkSettings( | ||
| start_think_id=start_think_ids[0], | ||
| stop_think_id=stop_think_ids[0], | ||
| ) | ||
|
|
||
| for text in self.think_settings.step_split_tokens: | ||
| encoded_tokens = tokenizer.encode(text, add_special_tokens=False) | ||
| if len(encoded_tokens) == 1: | ||
| self.think_settings.step_split_token_ids.add(encoded_tokens[0]) | ||
| for text in self.think_settings.discourse_marker_tokens: | ||
| encoded_tokens = tokenizer.encode(text, add_special_tokens=False) | ||
| if len(encoded_tokens) == 1: | ||
| self.think_settings.discourse_marker_token_ids.add( | ||
| encoded_tokens[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Avoid hard failure when
<think> tokens are missing
The same unconditional tokenizer and assertion was added to the Qwen3 model. If a checkpoint does not encode <think>/</think> as single tokens (which is true for many non-reasoning Qwen variants), the assertion will trigger during model construction, preventing the model from loading even when early-stopping is not in use. This should be optional and should not abort initialization when the tokens are absent.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The branch base is very old and so every file has conflicts. You can follow the instructions here to get past the ruff reformat
This PR is the implementation of the paper: SpecExit: Accelerating Large Reasoning Model via Speculative Exit
This PR introduces early stopping for LRMs (Large Reasoning Models) through a speculative exit mechanism. It allows LRMs to dynamically terminate the thinking process when all signals reach the corresponding threshold, significantly reducing inference latency for long-form reasoning tasks without compromising output quality.
Purpose
LRMs (e.g., Qwen3) generate lengthy
<think>...</think>sequences. Many reasoning steps are redundant: the model often "knows" the answer before consuming all allocated thinking tokens. Early stopping enables:Implementation Details
Key additions:
GPUModelRunnerModified files:
vllm/v1/worker/gpu_model_runner.py: Core early stopping logicvllm/v1/worker/gpu_input_batch.py: EWMA score computationvllm/v1/spec_decode/eagle.py: EAGLE3 integrationvllm/model_executor/utils.py: Score evaluationTest Plan
</think>tokensTest Result
Essential Elements Checklist