[Core] Speculative Early-Exit for LRMs implemented on Eagle3 #27192

RuBing-Yang · 2025-10-20T11:19:44Z

This PR is the implementation of the paper: SpecExit: Accelerating Large Reasoning Model via Speculative Exit

This PR introduces early stopping for LRMs (Large Reasoning Models) through a speculative exit mechanism. It allows LRMs to dynamically terminate the thinking process when all signals reach the corresponding threshold, significantly reducing inference latency for long-form reasoning tasks without compromising output quality.

Purpose

LRMs (e.g., Qwen3) generate lengthy <think>...</think> sequences. Many reasoning steps are redundant: the model often "knows" the answer before consuming all allocated thinking tokens. Early stopping enables:

Latency reduction: Skip unnecessary reasoning steps
Better throughput: Process more requests in parallel
Maintained quality: Configurable thresholds preserve reasoning quality
Flexible strategies: Support multiple stopping criteria (confidence, progress, token remaining)

Implementation Details

Key additions:

Early stop signal detection in GPUModelRunner
EWMA-based scoring for stable stopping decisions
Multi-criterion scoring methods (confidence, progress, token remaining)
Integration with EAGLE3 speculative decoding
Batch-level early exit optimization

Modified files:

vllm/v1/worker/gpu_model_runner.py: Core early stopping logic
vllm/v1/worker/gpu_input_batch.py: EWMA score computation
vllm/v1/spec_decode/eagle.py: EAGLE3 integration
vllm/model_executor/utils.py: Score evaluation

Test Plan

Tested with Llama and similar architectures
Validated early stop decision logic with multiple threshold combinations
Verified correct insertion of </think> tokens

Test Result

All pre-commit checks passed
No type errors after fixes
Early stop insertion verified to work correctly

Essential Elements Checklist

Purpose: Early stopping for reasoning models via speculative exit
Test plan provided
Test results included
Documentation updated (if user-facing)
Release notes updated (if user-facing - optional for infrastructure changes)

Signed-off-by: mgoin <[email protected]>

Signed-off-by: breno.skuk <[email protected]> Signed-off-by: Breno Baldas Skuk <[email protected]> Signed-off-by: mgoin <[email protected]> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Michael Goin <[email protected]>

…ect#23008) Signed-off-by: mgoin <[email protected]>

Signed-off-by: RuBing-Yang <[email protected]>

github-actions · 2025-10-20T11:19:53Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

mergify · 2025-10-20T11:20:30Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @RuBing-Yang.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces an early-exit mechanism for speculative decoding in Large Reasoning Models (LRMs), a significant performance optimization. The implementation is well-structured, touching upon model configurations, the GPU model runner, and the speculative decoding proposer. My review focused on the correctness and maintainability of this new, complex feature. I've identified a couple of high-severity issues: a misleading error message in both llama.py and qwen3.py due to a copy-paste error, and an incorrect type hint in eagle.py which also reveals some dead code in the caller. Addressing these will improve code clarity and ease of debugging. The overall logic for early stopping and the accompanying tests appear robust.

gemini-code-assist · 2025-10-20T11:21:59Z

vllm/model_executor/models/llama.py

+        assert len(start_think_ids) == 1 and len(stop_think_ids) == 1, \
+            f"Invalid think IDs: " \
+            f"</think> {start_think_ids}, " \
+            f"</think> {stop_think_ids}"


The f-string in this assertion message has a typo. It shows </think> for both start_think_ids and stop_think_ids, which can be misleading during debugging. The first one should be <think>.

Suggested change

assert len(start_think_ids) == 1 and len(stop_think_ids) == 1, \

f"Invalid think IDs: " \

f"</think> {start_think_ids}, " \

f"</think> {stop_think_ids}"

assert len(start_think_ids) == 1 and len(stop_think_ids) == 1, \

f"Invalid think IDs: " \

f"<think> {start_think_ids}, " \

f"</think> {stop_think_ids}"

gemini-code-assist · 2025-10-20T11:21:59Z

vllm/model_executor/models/qwen3.py

+        assert len(start_think_ids) == 1 and len(stop_think_ids) == 1, \
+            f"Invalid think IDs: " \
+            f"</think> {start_think_ids}, " \
+            f"</think> {stop_think_ids}"


The f-string in this assertion message has a typo. It shows </think> for both start_think_ids and stop_think_ids, which can be misleading during debugging. The first one should be <think>.

Suggested change

assert len(start_think_ids) == 1 and len(stop_think_ids) == 1, \

f"Invalid think IDs: " \

f"</think> {start_think_ids}, " \

f"</think> {stop_think_ids}"

assert len(start_think_ids) == 1 and len(stop_think_ids) == 1, \

f"Invalid think IDs: " \

f"<think> {start_think_ids}, " \

f"</think> {stop_think_ids}"

gemini-code-assist · 2025-10-20T11:21:59Z

vllm/v1/spec_decode/eagle.py

+        input_batch: Optional[InputBatch] = None,
+        input_requests: Optional[dict[str, CachedRequestState]] = None,
        mm_embeds: Optional[list[torch.Tensor]] = None,
    ) -> torch.Tensor:


The return type hint for this method is torch.Tensor, but the implementation at line 333 returns a tuple (draft_token_ids, req_early_stop_signal). This mismatch can mislead static analysis tools and developers.

Additionally, the second element of the returned tuple, req_early_stop_signal, is always None as it's never assigned a value within the function. This makes the corresponding handling logic in the caller (GPUModelRunner.propose_draft_token_ids) dead code. The actual update of the early stop signal happens in-place within batch_to_req_early_stop_signal.

To improve code clarity and correctness, the return type hint should be updated to reflect the actual return type.

Suggested change

) -> torch.Tensor:

) -> tuple[torch.Tensor, Optional[torch.Tensor]]:

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2025-10-20T11:25:43Z

vllm/model_executor/models/llama.py

+        tokenizer = AutoTokenizer.from_pretrained(
+            vllm_config.model_config.model)
+        start_think_ids = tokenizer.encode("<think>", add_special_tokens=False)
+        stop_think_ids = tokenizer.encode("</think>", add_special_tokens=False)
+        assert len(start_think_ids) == 1 and len(stop_think_ids) == 1, \
+            f"Invalid think IDs: " \
+            f"</think> {start_think_ids}, " \
+            f"</think> {stop_think_ids}"
+
+        self.think_settings = ThinkSettings(
+            start_think_id=start_think_ids[0],
+            stop_think_id=stop_think_ids[0],
+        )
+
+        for text in self.think_settings.step_split_tokens:
+            encoded_tokens = tokenizer.encode(text, add_special_tokens=False)
+            if len(encoded_tokens) == 1:
+                self.think_settings.step_split_token_ids.add(encoded_tokens[0])
+        for text in self.think_settings.discourse_marker_tokens:
+            encoded_tokens = tokenizer.encode(text, add_special_tokens=False)
+            if len(encoded_tokens) == 1:
+                self.think_settings.discourse_marker_token_ids.add(
+                    encoded_tokens[0])


Guard thinking-token setup behind configuration

The model now unconditionally constructs an AutoTokenizer and asserts that both <think> and </think> map to a single token. Many Llama checkpoints (especially ones without reasoning-specific tokens) do not satisfy this assumption, so model initialization will raise the assertion even when speculative early exit is disabled. This effectively makes ordinary Llama models unusable in vLLM unless they happen to contain those tokens. Consider performing this setup only when early_stop_thinking is requested and handling the absence of dedicated tokens gracefully.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2025-10-20T11:25:44Z

vllm/model_executor/models/qwen3.py

+        tokenizer = AutoTokenizer.from_pretrained(
+            vllm_config.model_config.model)
+        start_think_ids = tokenizer.encode("<think>", add_special_tokens=False)
+        stop_think_ids = tokenizer.encode("</think>", add_special_tokens=False)
+        assert len(start_think_ids) == 1 and len(stop_think_ids) == 1, \
+            f"Invalid think IDs: " \
+            f"</think> {start_think_ids}, " \
+            f"</think> {stop_think_ids}"
+
+        self.think_settings = ThinkSettings(
+            start_think_id=start_think_ids[0],
+            stop_think_id=stop_think_ids[0],
+        )
+
+        for text in self.think_settings.step_split_tokens:
+            encoded_tokens = tokenizer.encode(text, add_special_tokens=False)
+            if len(encoded_tokens) == 1:
+                self.think_settings.step_split_token_ids.add(encoded_tokens[0])
+        for text in self.think_settings.discourse_marker_tokens:
+            encoded_tokens = tokenizer.encode(text, add_special_tokens=False)
+            if len(encoded_tokens) == 1:
+                self.think_settings.discourse_marker_token_ids.add(
+                    encoded_tokens[0])


Avoid hard failure when <think> tokens are missing

The same unconditional tokenizer and assertion was added to the Qwen3 model. If a checkpoint does not encode <think>/</think> as single tokens (which is true for many non-reasoning Qwen variants), the assertion will trigger during model construction, preventing the model from loading even when early-stopping is not in use. This should be optional and should not abort initialization when the tokens are absent.

Useful? React with 👍 / 👎.

hmellor

The branch base is very old and so every file has conflicts. You can follow the instructions here to get past the ruff reformat

mgoin and others added 5 commits August 18, 2025 15:27

[CI Bugfix] Pin openai<1.100 to unblock CI (vllm-project#23118)

befedf8

Signed-off-by: mgoin <[email protected]>

Use Blackwell FlashInfer MXFP4 MoE by default if available (vllm-proj…

aab5498

…ect#23008) Signed-off-by: mgoin <[email protected]>

SpecExit: Accelerating Large Reasoning Model via Speculative Exit

cb49a05

Signed-off-by: RuBing-Yang <[email protected]>

add pytest for SpecExit thinking early stop

e402b30

Signed-off-by: RuBing-Yang <[email protected]>

mergify bot added frontend llama Related to Llama models qwen Related to Qwen models labels Oct 20, 2025

mergify bot added speculative-decoding v1 labels Oct 20, 2025

mergify bot added the needs-rebase label Oct 20, 2025

gemini-code-assist bot reviewed Oct 20, 2025

View reviewed changes

chatgpt-codex-connector bot reviewed Oct 20, 2025

View reviewed changes

RuBing-Yang mentioned this pull request Oct 20, 2025

add SpecExit vLLM PR Link Tencent/AngelSlim#91

Merged

hmellor requested changes Oct 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Core] Speculative Early-Exit for LRMs implemented on Eagle3 #27192

[Core] Speculative Early-Exit for LRMs implemented on Eagle3 #27192

RuBing-Yang commented Oct 20, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Oct 20, 2025

Uh oh!

mergify bot commented Oct 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 20, 2025

Uh oh!

gemini-code-assist bot Oct 20, 2025

Uh oh!

gemini-code-assist bot Oct 20, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Oct 20, 2025

Uh oh!

chatgpt-codex-connector bot Oct 20, 2025

Uh oh!

hmellor left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

	) -> torch.Tensor:
	) -> tuple[torch.Tensor, Optional[torch.Tensor]]:

Uh oh!

[Core] Speculative Early-Exit for LRMs implemented on Eagle3 #27192

Are you sure you want to change the base?

[Core] Speculative Early-Exit for LRMs implemented on Eagle3 #27192

Conversation

RuBing-Yang commented Oct 20, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Implementation Details

Test Plan

Test Result

Uh oh!

github-actions bot commented Oct 20, 2025

Uh oh!

mergify bot commented Oct 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector bot Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

hmellor left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

RuBing-Yang commented Oct 20, 2025 •

edited by github-actions bot

Loading

hmellor left a comment •

edited

Loading