-
Notifications
You must be signed in to change notification settings - Fork 27
Enable Spec Decode for HPU v1 - Part1(basic workflow + eagle) #81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
9d80e60
to
4a3e75d
Compare
Since spec decode eagle model needs HF_TOKEN, will not add the Test to public CI at this moment. |
70920a6
to
c938f4e
Compare
Not included in this PR:
|
Status 1. eagle and ngram is working TODO add prefill to draft model performance Signed-off-by: Chendi.Xue <[email protected]>
Signed-off-by: Chendi.Xue <[email protected]>
Signed-off-by: Chendi.Xue <[email protected]>
Signed-off-by: Chendi.Xue <[email protected]>
c938f4e
to
cfe3361
Compare
Signed-off-by: Chendi.Xue <[email protected]>
cfe3361
to
26bed54
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request enables speculative decoding for HPU (Habana Processing Unit) v1, implementing support for Eagle and NGRAM proposers. The PR introduces the fundamental workflow for speculative decoding where each decode request can generate multiple tokens, requiring dynamic batching and specialized sampling logic.
Key changes:
- Implements Eagle and NGRAM speculative decoding proposers for HPU
- Adds spec decode metadata handling and rejection sampling
- Modifies input preparation to handle variable token counts per request
Reviewed Changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 6 comments.
Show a summary per file
File | Description |
---|---|
vllm_gaudi/v1/worker/hpu_worker.py | Adds draft token handling method to HPU worker |
vllm_gaudi/v1/worker/hpu_model_runner.py | Major implementation of spec decode workflow, proposers, and input preparation |
vllm_gaudi/v1/worker/hpu_input_batch.py | Adds spec decode unsupported request tracking |
vllm_gaudi/v1/sample/hpu_rejection_sampler.py | Implements PyTorch-based rejection sampling for HPU |
vllm_gaudi/v1/attention/backends/hpu_attn.py | Adds query_start_loc parameter for spec decode support |
vllm_gaudi/platform.py | Updates attention backend configuration |
vllm_gaudi/ops/hpu_rotary_embedding.py | Fixes tensor size calculation for variable shapes |
vllm_gaudi/init.py | Registers new rejection sampler module |
tests/full_tests/spec_decode.py | Comprehensive test suite for spec decode functionality |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Signed-off-by: Chendi.Xue <[email protected]>
@mswiniarsk , may you help to review, I have resolved copilot comments |
Design concept:
with spec decode, each decode request might have more than 1 tokens
* leads to the decode num_tokens range from [batch_size, batch_size * (num_decode_tokens+1)]
* temp solution: always assuming worst case, so we will use the max_num_tokens
* change to HPU_model_runner prepare_decodes_input, all shape will be [padded_batch_size, (num_draft_tokens + 1)]
workflow:
Design Doc:

Jira: SW-234434
Updated on WW35.2:
This PR is working on Eagle and NGRAM at this moment
For Eagle, only support num_spec_decode_token = 1