Skip to content

Conversation

@RuBing-Yang
Copy link

@RuBing-Yang RuBing-Yang commented Oct 20, 2025

This PR is the implementation of the paper: SpecExit: Accelerating Large Reasoning Model via Speculative Exit

This PR introduces early stopping for LRMs (Large Reasoning Models) through a speculative exit mechanism. It allows LRMs to dynamically terminate the thinking process when all signals reach the corresponding threshold, significantly reducing inference latency for long-form reasoning tasks without compromising output quality.

Purpose

LRMs (e.g., Qwen3) generate lengthy <think>...</think> sequences. Many reasoning steps are redundant: the model often "knows" the answer before consuming all allocated thinking tokens. Early stopping enables:

  • Latency reduction: Skip unnecessary reasoning steps
  • Better throughput: Process more requests in parallel
  • Maintained quality: Configurable thresholds preserve reasoning quality
  • Flexible strategies: Support multiple stopping criteria (confidence, progress, token remaining)

Implementation Details

Key additions:

  • Early stop signal detection in GPUModelRunner
  • EWMA-based scoring for stable stopping decisions
  • Multi-criterion scoring methods (confidence, progress, token remaining)
  • Integration with EAGLE3 speculative decoding
  • Batch-level early exit optimization
Clipboard_Screenshot_1760958531

Modified files:

  • vllm/v1/worker/gpu_model_runner.py: Core early stopping logic
  • vllm/v1/worker/gpu_input_batch.py: EWMA score computation
  • vllm/v1/spec_decode/eagle.py: EAGLE3 integration
  • vllm/model_executor/utils.py: Score evaluation

Test Plan

  • Tested with Llama and similar architectures
  • Validated early stop decision logic with multiple threshold combinations
  • Verified correct insertion of </think> tokens

Test Result

  • All pre-commit checks passed
  • No type errors after fixes
  • Early stop insertion verified to work correctly

Essential Elements Checklist
  • Purpose: Early stopping for reasoning models via speculative exit
  • Test plan provided
  • Test results included
  • Documentation updated (if user-facing)
  • Release notes updated (if user-facing - optional for infrastructure changes)

mgoin and others added 5 commits August 18, 2025 15:27
Signed-off-by: breno.skuk <[email protected]>
Signed-off-by: Breno Baldas Skuk <[email protected]>
Signed-off-by: mgoin <[email protected]>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Michael Goin <[email protected]>
@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@mergify mergify bot added frontend llama Related to Llama models qwen Related to Qwen models labels Oct 20, 2025
@mergify
Copy link

mergify bot commented Oct 20, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @RuBing-Yang.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Oct 20, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an early-exit mechanism for speculative decoding in Large Reasoning Models (LRMs), a significant performance optimization. The implementation is well-structured, touching upon model configurations, the GPU model runner, and the speculative decoding proposer. My review focused on the correctness and maintainability of this new, complex feature. I've identified a couple of high-severity issues: a misleading error message in both llama.py and qwen3.py due to a copy-paste error, and an incorrect type hint in eagle.py which also reveals some dead code in the caller. Addressing these will improve code clarity and ease of debugging. The overall logic for early stopping and the accompanying tests appear robust.

Comment on lines +557 to +560
assert len(start_think_ids) == 1 and len(stop_think_ids) == 1, \
f"Invalid think IDs: " \
f"</think> {start_think_ids}, " \
f"</think> {stop_think_ids}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The f-string in this assertion message has a typo. It shows </think> for both start_think_ids and stop_think_ids, which can be misleading during debugging. The first one should be <think>.

Suggested change
assert len(start_think_ids) == 1 and len(stop_think_ids) == 1, \
f"Invalid think IDs: " \
f"</think> {start_think_ids}, " \
f"</think> {stop_think_ids}"
assert len(start_think_ids) == 1 and len(stop_think_ids) == 1, \
f"Invalid think IDs: " \
f"<think> {start_think_ids}, " \
f"</think> {stop_think_ids}"

Comment on lines +295 to +298
assert len(start_think_ids) == 1 and len(stop_think_ids) == 1, \
f"Invalid think IDs: " \
f"</think> {start_think_ids}, " \
f"</think> {stop_think_ids}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The f-string in this assertion message has a typo. It shows </think> for both start_think_ids and stop_think_ids, which can be misleading during debugging. The first one should be <think>.

Suggested change
assert len(start_think_ids) == 1 and len(stop_think_ids) == 1, \
f"Invalid think IDs: " \
f"</think> {start_think_ids}, " \
f"</think> {stop_think_ids}"
assert len(start_think_ids) == 1 and len(stop_think_ids) == 1, \
f"Invalid think IDs: " \
f"<think> {start_think_ids}, " \
f"</think> {stop_think_ids}"

input_batch: Optional[InputBatch] = None,
input_requests: Optional[dict[str, CachedRequestState]] = None,
mm_embeds: Optional[list[torch.Tensor]] = None,
) -> torch.Tensor:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The return type hint for this method is torch.Tensor, but the implementation at line 333 returns a tuple (draft_token_ids, req_early_stop_signal). This mismatch can mislead static analysis tools and developers.

Additionally, the second element of the returned tuple, req_early_stop_signal, is always None as it's never assigned a value within the function. This makes the corresponding handling logic in the caller (GPUModelRunner.propose_draft_token_ids) dead code. The actual update of the early stop signal happens in-place within batch_to_req_early_stop_signal.

To improve code clarity and correctness, the return type hint should be updated to reflect the actual return type.

Suggested change
) -> torch.Tensor:
) -> tuple[torch.Tensor, Optional[torch.Tensor]]:

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +553 to +575
tokenizer = AutoTokenizer.from_pretrained(
vllm_config.model_config.model)
start_think_ids = tokenizer.encode("<think>", add_special_tokens=False)
stop_think_ids = tokenizer.encode("</think>", add_special_tokens=False)
assert len(start_think_ids) == 1 and len(stop_think_ids) == 1, \
f"Invalid think IDs: " \
f"</think> {start_think_ids}, " \
f"</think> {stop_think_ids}"

self.think_settings = ThinkSettings(
start_think_id=start_think_ids[0],
stop_think_id=stop_think_ids[0],
)

for text in self.think_settings.step_split_tokens:
encoded_tokens = tokenizer.encode(text, add_special_tokens=False)
if len(encoded_tokens) == 1:
self.think_settings.step_split_token_ids.add(encoded_tokens[0])
for text in self.think_settings.discourse_marker_tokens:
encoded_tokens = tokenizer.encode(text, add_special_tokens=False)
if len(encoded_tokens) == 1:
self.think_settings.discourse_marker_token_ids.add(
encoded_tokens[0])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard thinking-token setup behind configuration

The model now unconditionally constructs an AutoTokenizer and asserts that both <think> and </think> map to a single token. Many Llama checkpoints (especially ones without reasoning-specific tokens) do not satisfy this assumption, so model initialization will raise the assertion even when speculative early exit is disabled. This effectively makes ordinary Llama models unusable in vLLM unless they happen to contain those tokens. Consider performing this setup only when early_stop_thinking is requested and handling the absence of dedicated tokens gracefully.

Useful? React with 👍 / 👎.

Comment on lines +291 to +313
tokenizer = AutoTokenizer.from_pretrained(
vllm_config.model_config.model)
start_think_ids = tokenizer.encode("<think>", add_special_tokens=False)
stop_think_ids = tokenizer.encode("</think>", add_special_tokens=False)
assert len(start_think_ids) == 1 and len(stop_think_ids) == 1, \
f"Invalid think IDs: " \
f"</think> {start_think_ids}, " \
f"</think> {stop_think_ids}"

self.think_settings = ThinkSettings(
start_think_id=start_think_ids[0],
stop_think_id=stop_think_ids[0],
)

for text in self.think_settings.step_split_tokens:
encoded_tokens = tokenizer.encode(text, add_special_tokens=False)
if len(encoded_tokens) == 1:
self.think_settings.step_split_token_ids.add(encoded_tokens[0])
for text in self.think_settings.discourse_marker_tokens:
encoded_tokens = tokenizer.encode(text, add_special_tokens=False)
if len(encoded_tokens) == 1:
self.think_settings.discourse_marker_token_ids.add(
encoded_tokens[0])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid hard failure when <think> tokens are missing

The same unconditional tokenizer and assertion was added to the Qwen3 model. If a checkpoint does not encode <think>/</think> as single tokens (which is true for many non-reasoning Qwen variants), the assertion will trigger during model construction, preventing the model from loading even when early-stopping is not in use. This should be optional and should not abort initialization when the tokens are absent.

Useful? React with 👍 / 👎.

Copy link
Member

@hmellor hmellor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The branch base is very old and so every file has conflicts. You can follow the instructions here to get past the ruff reformat

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

frontend llama Related to Llama models needs-rebase qwen Related to Qwen models speculative-decoding v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants