Skip to content

Conversation

xuechendi
Copy link
Collaborator

@xuechendi xuechendi commented Aug 15, 2025

Design concept:

with spec decode, each decode request might have more than 1 tokens
* leads to the decode num_tokens range from [batch_size, batch_size * (num_decode_tokens+1)]
* temp solution: always assuming worst case, so we will use the max_num_tokens
* change to HPU_model_runner prepare_decodes_input, all shape will be [padded_batch_size, (num_draft_tokens + 1)]

workflow:

prompt path:
 => [draft_model] => draft_tokens

decode path:
=> (combine input_tokens, draft_tokens) => [prepare_inputs] => (padded_input_tokens) => [target_model] 
    => (target_tokens, bonus_tokens) => [reject_sampler] 
    => output_tokens (combination of draft_token + bonus_tokens) => [draft_model] 
    => (new_draft_tokens, output_tokens) => update to input_batch data structure

# input_tokens: shape is [num_decodes, 1]
# draft_tokens: shape is [num_decode * num_draft_tokens, 1]
# combine input_tokens, draft_tokens is with dynamic shape, range from num_decodes to num_decode * (num_draft_tokens+1)
# padded_input_tokens: shape is [num_decode * (num_draft_tokens+1), 1] => same to positions, slot_mapping, block_groups
# output_tokens: shape is [num_decodes * (num_drafte_tokens + 1), 1]
# new_draft_tokens: shape is [num_decode, num_drafte_tokens]

Design Doc:
image

Jira: SW-234434

Updated on WW35.2:

This PR is working on Eagle and NGRAM at this moment
For Eagle, only support num_spec_decode_token = 1

PT_HPU_LAZY_MODE=1 python tests/full_tests/spec_decode.py --task eagle --osl 512
================ spec_eagle =================
acc_counts: [3331, 0]
acc_rate: 0.3735142408611796
num_draft_tokens: 8918
num_drafts: 8918
---
Prompt: Hello, my name is
Generated text:  [Name]. I am a [Your Profession/Student] and I am here to learn more about [Topic/Industry]. I am excited to be a part of this [Event/Community] and I am looking forward to connecting with others who'...'
---
Prompt: The president of the United States is
Generated text:  the head of state and government of the United States, and is the highest-ranking official in the country. The president is responsible for executing the laws of the United States, and is also the co'...'
---
Prompt: The capital of France is
Generated text:  Paris, which is located in the north-central part of the country. Paris is the most populous city in France and is known for its stunning architecture, art museums, fashion, and romantic atmosphere. '...'
---
Prompt: The future of AI is
Generated text:  bright, but it's not without its challenges. Here are some of the key challenges that AI faces in the future:
1. Explainability and Transparency: As AI systems become more complex and autonomous, it''...'
---
Prompt: San Francisco is know for its
Generated text:  vibrant arts and culture scene, and the city is home to a wide range of museums, galleries, and performance venues. Here are some of the top arts and culture attractions in San Francisco:
1. de Young'...'
---
Prompt: Facebook was created in 2004 by
Generated text:  Mark Zuckerberg, along with his college roommates and fellow Harvard University students Eduardo Saverin, Andrew McCollum, Dustin Moskovitz, and Chris Hughes. Initially, the platform was called "Thef'...'
---
Prompt: Curious George is a
Generated text:  beloved children's book series created by H.A. and Margret Rey. The series follows the adventures of a curious and mischievous monkey named George, who lives with his friend the Man in the Yellow Hat'...'
---
Prompt: Python 3.11 brings improvements to its
Generated text:  type hinting system, including support for type hints in lambda functions and improvements to the type checker. Here are some of the key changes:

1. **Type hints in lambda functions**: You can now a'...'
=========================================
PT_HPU_LAZY_MODE=1 python tests/full_tests/spec_decode.py --task ngram --osl 512
================= spec_ngram =================
acc_counts: [1452, 0]
acc_rate: 0.18558282208588958
num_draft_tokens: 7824
num_drafts: 7824
---
Prompt: Hello, my name is
Generated text:  Xiaoyu, and I'm a student at the University of Science and Technology of China. I'm currently working on a research project about the application of machine learning in the field of materials science'...'
---
Prompt: The president of the United States is
Generated text:  the head of state and government of the United States. The president is the head of the executive branch of the U.S. government, and is the commander-in-chief of the United States Armed Forces. The p'...'
---
Prompt: The capital of France is
Generated text:  Paris. The capital of Germany is Berlin. The capital of Italy is Rome. The capital of Spain is Madrid. The capital of Portugal is Lisbon. The capital of Greece is Athens. The capital of Belgium is Br'...'
---
Prompt: The future of AI is
Generated text:  a topic that has been the subject of much speculation and debate. As the technology continues to evolve, it's clear that AI is going to have a significant impact on society, the economy, and the way '...'
---
Prompt: San Francisco is know for its
Generated text:  fog, but the fog is not the only thing that is fog-like. The city is also known for its fog-like "fog" in the form of a fog-like substance that is not actually fog. What is this substance? Also, what'...'
---
Prompt: Facebook was created in 2004 by
Generated text:  Mark Zuckerberg, and it has grown into a global social media platform with over 2.8 billion monthly active users. The platform allows users to create profiles, connect with friends, share content, an'...'
---
Prompt: Curious George is a
Generated text:  2015 American 3D computer-animated comedy film directed by Tom McCamus and written by David W. Zucker, and starring the titular character, Curious George, who is a monkey. The film is the first in th'...'
---
Prompt: Python 3.11 brings improvements to its
Generated text:  standard library, including the `typing` module. One of the notable changes is the introduction of the `TypeAlias` feature, which allows for the creation of type aliases in a more readable and concis'...'
=========================================

@xuechendi xuechendi force-pushed the dev/spec_decode branch 5 times, most recently from 9d80e60 to 4a3e75d Compare August 26, 2025 20:49
@xuechendi xuechendi marked this pull request as ready for review August 26, 2025 20:50
@xuechendi xuechendi changed the title [DRAFT] Enable Spec Decode for HPU v1 Enable Spec Decode for HPU v1 - Part1(basic workflow + eagle) Aug 26, 2025
@xuechendi
Copy link
Collaborator Author

xuechendi commented Aug 26, 2025

Since spec decode eagle model needs HF_TOKEN, will not add the Test to public CI at this moment.

@xuechendi
Copy link
Collaborator Author

Not included in this PR:

  1. medusa enabling
  2. eagle 3
  3. MTP
  4. num_spec_decode > 1
  5. perf improvement

Status
1. eagle and ngram is working

TODO
add prefill to draft model
performance

Signed-off-by: Chendi.Xue <[email protected]>
Signed-off-by: Chendi.Xue <[email protected]>
Signed-off-by: Chendi.Xue <[email protected]>
Signed-off-by: Chendi.Xue <[email protected]>
@mswiniarsk mswiniarsk requested a review from Copilot August 28, 2025 13:03
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request enables speculative decoding for HPU (Habana Processing Unit) v1, implementing support for Eagle and NGRAM proposers. The PR introduces the fundamental workflow for speculative decoding where each decode request can generate multiple tokens, requiring dynamic batching and specialized sampling logic.

Key changes:

  • Implements Eagle and NGRAM speculative decoding proposers for HPU
  • Adds spec decode metadata handling and rejection sampling
  • Modifies input preparation to handle variable token counts per request

Reviewed Changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
vllm_gaudi/v1/worker/hpu_worker.py Adds draft token handling method to HPU worker
vllm_gaudi/v1/worker/hpu_model_runner.py Major implementation of spec decode workflow, proposers, and input preparation
vllm_gaudi/v1/worker/hpu_input_batch.py Adds spec decode unsupported request tracking
vllm_gaudi/v1/sample/hpu_rejection_sampler.py Implements PyTorch-based rejection sampling for HPU
vllm_gaudi/v1/attention/backends/hpu_attn.py Adds query_start_loc parameter for spec decode support
vllm_gaudi/platform.py Updates attention backend configuration
vllm_gaudi/ops/hpu_rotary_embedding.py Fixes tensor size calculation for variable shapes
vllm_gaudi/init.py Registers new rejection sampler module
tests/full_tests/spec_decode.py Comprehensive test suite for spec decode functionality

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Signed-off-by: Chendi.Xue <[email protected]>
@xuechendi
Copy link
Collaborator Author

@mswiniarsk , may you help to review, I have resolved copilot comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants