Enable Spec Decode for HPU v1 - Part1(basic workflow + eagle) #81

xuechendi · 2025-08-15T00:14:05Z

Design concept:

with spec decode, each decode request might have more than 1 tokens
* leads to the decode num_tokens range from [batch_size, batch_size * (num_decode_tokens+1)]
* temp solution: always assuming worst case, so we will use the max_num_tokens
* change to HPU_model_runner prepare_decodes_input, all shape will be [padded_batch_size, (num_draft_tokens + 1)]

workflow:

prompt path:
 => [draft_model] => draft_tokens

decode path:
=> (combine input_tokens, draft_tokens) => [prepare_inputs] => (padded_input_tokens) => [target_model] 
    => (target_tokens, bonus_tokens) => [reject_sampler] 
    => output_tokens (combination of draft_token + bonus_tokens) => [draft_model] 
    => (new_draft_tokens, output_tokens) => update to input_batch data structure

# input_tokens: shape is [num_decodes, 1]
# draft_tokens: shape is [num_decode * num_draft_tokens, 1]
# combine input_tokens, draft_tokens is with dynamic shape, range from num_decodes to num_decode * (num_draft_tokens+1)
# padded_input_tokens: shape is [num_decode * (num_draft_tokens+1), 1] => same to positions, slot_mapping, block_groups
# output_tokens: shape is [num_decodes * (num_drafte_tokens + 1), 1]
# new_draft_tokens: shape is [num_decode, num_drafte_tokens]

Design Doc:

Jira: SW-234434

Updated on WW35.2:

This PR is working on Eagle and NGRAM at this moment
For Eagle, only support num_spec_decode_token = 1

PT_HPU_LAZY_MODE=1 python tests/full_tests/spec_decode.py --task eagle --osl 512

================ spec_eagle =================
acc_counts: [3331, 0]
acc_rate: 0.3735142408611796
num_draft_tokens: 8918
num_drafts: 8918
---
Prompt: Hello, my name is
Generated text:  [Name]. I am a [Your Profession/Student] and I am here to learn more about [Topic/Industry]. I am excited to be a part of this [Event/Community] and I am looking forward to connecting with others who'...'
---
Prompt: The president of the United States is
Generated text:  the head of state and government of the United States, and is the highest-ranking official in the country. The president is responsible for executing the laws of the United States, and is also the co'...'
---
Prompt: The capital of France is
Generated text:  Paris, which is located in the north-central part of the country. Paris is the most populous city in France and is known for its stunning architecture, art museums, fashion, and romantic atmosphere. '...'
---
Prompt: The future of AI is
Generated text:  bright, but it's not without its challenges. Here are some of the key challenges that AI faces in the future:
1. Explainability and Transparency: As AI systems become more complex and autonomous, it''...'
---
Prompt: San Francisco is know for its
Generated text:  vibrant arts and culture scene, and the city is home to a wide range of museums, galleries, and performance venues. Here are some of the top arts and culture attractions in San Francisco:
1. de Young'...'
---
Prompt: Facebook was created in 2004 by
Generated text:  Mark Zuckerberg, along with his college roommates and fellow Harvard University students Eduardo Saverin, Andrew McCollum, Dustin Moskovitz, and Chris Hughes. Initially, the platform was called "Thef'...'
---
Prompt: Curious George is a
Generated text:  beloved children's book series created by H.A. and Margret Rey. The series follows the adventures of a curious and mischievous monkey named George, who lives with his friend the Man in the Yellow Hat'...'
---
Prompt: Python 3.11 brings improvements to its
Generated text:  type hinting system, including support for type hints in lambda functions and improvements to the type checker. Here are some of the key changes:

1. **Type hints in lambda functions**: You can now a'...'
=========================================

PT_HPU_LAZY_MODE=1 python tests/full_tests/spec_decode.py --task ngram --osl 512

================= spec_ngram =================
acc_counts: [1452, 0]
acc_rate: 0.18558282208588958
num_draft_tokens: 7824
num_drafts: 7824
---
Prompt: Hello, my name is
Generated text:  Xiaoyu, and I'm a student at the University of Science and Technology of China. I'm currently working on a research project about the application of machine learning in the field of materials science'...'
---
Prompt: The president of the United States is
Generated text:  the head of state and government of the United States. The president is the head of the executive branch of the U.S. government, and is the commander-in-chief of the United States Armed Forces. The p'...'
---
Prompt: The capital of France is
Generated text:  Paris. The capital of Germany is Berlin. The capital of Italy is Rome. The capital of Spain is Madrid. The capital of Portugal is Lisbon. The capital of Greece is Athens. The capital of Belgium is Br'...'
---
Prompt: The future of AI is
Generated text:  a topic that has been the subject of much speculation and debate. As the technology continues to evolve, it's clear that AI is going to have a significant impact on society, the economy, and the way '...'
---
Prompt: San Francisco is know for its
Generated text:  fog, but the fog is not the only thing that is fog-like. The city is also known for its fog-like "fog" in the form of a fog-like substance that is not actually fog. What is this substance? Also, what'...'
---
Prompt: Facebook was created in 2004 by
Generated text:  Mark Zuckerberg, and it has grown into a global social media platform with over 2.8 billion monthly active users. The platform allows users to create profiles, connect with friends, share content, an'...'
---
Prompt: Curious George is a
Generated text:  2015 American 3D computer-animated comedy film directed by Tom McCamus and written by David W. Zucker, and starring the titular character, Curious George, who is a monkey. The film is the first in th'...'
---
Prompt: Python 3.11 brings improvements to its
Generated text:  standard library, including the `typing` module. One of the notable changes is the introduction of the `TypeAlias` feature, which allows for the creation of type aliases in a more readable and concis'...'
=========================================

xuechendi · 2025-08-26T22:46:15Z

Since spec decode eagle model needs HF_TOKEN, will not add the Test to public CI at this moment.

xuechendi · 2025-08-27T00:48:27Z

Not included in this PR:

medusa enabling
eagle 3
MTP
num_spec_decode > 1
perf improvement

tests/full_tests/spec_decode.py

requirements.txt

vllm_gaudi/v1/sample/hpu_rejection_sampler.py

vllm_gaudi/v1/worker/hpu_model_runner.py

Status 1. eagle and ngram is working TODO add prefill to draft model performance Signed-off-by: Chendi.Xue <[email protected]>

Signed-off-by: Chendi.Xue <[email protected]>

Copilot

Pull Request Overview

This pull request enables speculative decoding for HPU (Habana Processing Unit) v1, implementing support for Eagle and NGRAM proposers. The PR introduces the fundamental workflow for speculative decoding where each decode request can generate multiple tokens, requiring dynamic batching and specialized sampling logic.

Key changes:

Implements Eagle and NGRAM speculative decoding proposers for HPU
Adds spec decode metadata handling and rejection sampling
Modifies input preparation to handle variable token counts per request

Reviewed Changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
vllm_gaudi/v1/worker/hpu_worker.py	Adds draft token handling method to HPU worker
vllm_gaudi/v1/worker/hpu_model_runner.py	Major implementation of spec decode workflow, proposers, and input preparation
vllm_gaudi/v1/worker/hpu_input_batch.py	Adds spec decode unsupported request tracking
vllm_gaudi/v1/sample/hpu_rejection_sampler.py	Implements PyTorch-based rejection sampling for HPU
vllm_gaudi/v1/attention/backends/hpu_attn.py	Adds query_start_loc parameter for spec decode support
vllm_gaudi/platform.py	Updates attention backend configuration
vllm_gaudi/ops/hpu_rotary_embedding.py	Fixes tensor size calculation for variable shapes
vllm_gaudi/init.py	Registers new rejection sampler module
tests/full_tests/spec_decode.py	Comprehensive test suite for spec decode functionality

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

vllm_gaudi/v1/worker/hpu_model_runner.py

vllm_gaudi/v1/sample/hpu_rejection_sampler.py

tests/full_tests/spec_decode.py

Signed-off-by: Chendi.Xue <[email protected]>

xuechendi · 2025-08-28T14:23:08Z

@mswiniarsk , may you help to review, I have resolved copilot comments

xuechendi force-pushed the dev/spec_decode branch 5 times, most recently from 9d80e60 to 4a3e75d Compare August 26, 2025 20:49

xuechendi marked this pull request as ready for review August 26, 2025 20:50

xuechendi requested review from kzawora-intel, mswiniarsk and adobrzyn as code owners August 26, 2025 20:50

xuechendi changed the title ~~[DRAFT] Enable Spec Decode for HPU v1~~ Enable Spec Decode for HPU v1 - Part1(basic workflow + eagle) Aug 26, 2025

xuechendi force-pushed the dev/spec_decode branch from 70920a6 to c938f4e Compare August 26, 2025 23:54

michalkuligowski reviewed Aug 27, 2025

View reviewed changes

xuechendi added 4 commits August 28, 2025 01:57

add spec decode to hpu_model_runner

8ad64c3

Status 1. eagle and ngram is working TODO add prefill to draft model performance Signed-off-by: Chendi.Xue <[email protected]>

Monkey patch rejection_sampler, once upstreamed, we can revert

214ec69

Signed-off-by: Chendi.Xue <[email protected]>

Fix mypy

5b29260

Signed-off-by: Chendi.Xue <[email protected]>

Fix for structure_output

fa2ef49

Signed-off-by: Chendi.Xue <[email protected]>

xuechendi force-pushed the dev/spec_decode branch from c938f4e to cfe3361 Compare August 27, 2025 23:53

Fix per comments

26bed54

Signed-off-by: Chendi.Xue <[email protected]>

xuechendi force-pushed the dev/spec_decode branch from cfe3361 to 26bed54 Compare August 28, 2025 00:28

michalkuligowski approved these changes Aug 28, 2025

View reviewed changes

mswiniarsk requested a review from Copilot August 28, 2025 13:03

Copilot AI reviewed Aug 28, 2025

View reviewed changes

resolve co-pilot comments

6cde954

Signed-off-by: Chendi.Xue <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable Spec Decode for HPU v1 - Part1(basic workflow + eagle) #81

Enable Spec Decode for HPU v1 - Part1(basic workflow + eagle) #81

xuechendi commented Aug 15, 2025 •

edited

Loading

Uh oh!

xuechendi commented Aug 26, 2025 •

edited

Loading

Uh oh!

xuechendi commented Aug 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xuechendi commented Aug 28, 2025

Uh oh!

Uh oh!

Enable Spec Decode for HPU v1 - Part1(basic workflow + eagle) #81

Are you sure you want to change the base?

Enable Spec Decode for HPU v1 - Part1(basic workflow + eagle) #81

Conversation

xuechendi commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xuechendi commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xuechendi commented Aug 27, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

xuechendi commented Aug 28, 2025

Uh oh!

Uh oh!

xuechendi commented Aug 15, 2025 •

edited

Loading

xuechendi commented Aug 26, 2025 •

edited

Loading