[NVIDIA] Support Flashinfer TRTLLM FP8-q/kv/out Attention Kernel #21716

elvischenv · 2025-07-28T05:19:42Z

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.

Purpose

Refactor the AttentionStaticQuantPattern in fusion_attn pass, which will fuse the attn+fp8_quant pattern for using the TRTLLM FP8-in FP8-out kernel.
Align with other backends, Flashinfer will insert FP8 quant op for query if kv cache dtype is set to FP8.

Test Plan && Test Result

Functional:

FP8 TRTLLM Prefill/Decode kernel unit test: tests/kernels/attention/test_flashinfer_trtllm_attention.py

====== 192 passed, 10 warnings in 143.25s (0:02:23) ======

AttentionStaticQuantPattern fusion_attn pass: tests/compile/test_fusion_attn.py::test_attention_quant_pattern

====== 6 passed, 4 warnings in 33.01s ========

E2E Performance: nvidia/Llama-4-Scout-17B-16E-Instruct-FP8

main

--kv-cache-dtype = auto                                                      --kv-cache-dtype = fp8
============ Serving Benchmark Result ============                           ============ Serving Benchmark Result ============
Successful requests:                     640                                 Successful requests:                     640
Maximum request concurrency:             128                                 Maximum request concurrency:             128
Benchmark duration (s):                  209.72                              Benchmark duration (s):                  229.07
Total input tokens:                      653975                              Total input tokens:                      653975
Total generated tokens:                  655360                              Total generated tokens:                  655360
Request throughput (req/s):              3.05                                Request throughput (req/s):              2.79
Output token throughput (tok/s):         3124.98                             Output token throughput (tok/s):         2860.98
Total Token throughput (tok/s):          6243.37                             Total Token throughput (tok/s):          5715.91
---------------Time to First Token----------------                           ---------------Time to First Token----------------
Mean TTFT (ms):                          1367.30                             Mean TTFT (ms):                          1413.01
Median TTFT (ms):                        1007.55                             Median TTFT (ms):                        1050.07
P99 TTFT (ms):                           5422.35                             P99 TTFT (ms):                           5486.11
-----Time per Output Token (excl. 1st token)------                           -----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          39.61                               Mean TPOT (ms):                          43.35
Median TPOT (ms):                        39.88                               Median TPOT (ms):                        43.60
P99 TPOT (ms):                           40.60                               P99 TPOT (ms):                           44.49
---------------Inter-token Latency----------------                           ---------------Inter-token Latency----------------
Mean ITL (ms):                           39.61                               Mean ITL (ms):                           43.35
Median ITL (ms):                         36.09                               Median ITL (ms):                         39.67
P99 ITL (ms):                            326.60                              P99 ITL (ms):                            317.85
==================================================                           ==================================================
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|    |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|    |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9181|±  |0.0076|    |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9234|±  |0.0073|
|     |       |strict-match    |     5|exact_match|↑  |0.8999|±  |0.0083|    |     |       |strict-match    |     5|exact_match|↑  |0.8999|±  |0.0083|

PR

--kv-cache-dtype = auto                                                      --kv-cache-dtype = fp8
============ Serving Benchmark Result ============                           ============ Serving Benchmark Result ============
Successful requests:                     640                                 Successful requests:                     640
Maximum request concurrency:             128                                 Maximum request concurrency:             128
Benchmark duration (s):                  213.18                              Benchmark duration (s):                  231.03
Total input tokens:                      653975                              Total input tokens:                      653975
Total generated tokens:                  655360                              Total generated tokens:                  655360
Request throughput (req/s):              3.00                                Request throughput (req/s):              2.77
Output token throughput (tok/s):         3074.24                             Output token throughput (tok/s):         2836.64
Total Token throughput (tok/s):          6141.98                             Total Token throughput (tok/s):          5667.28
---------------Time to First Token----------------                           ---------------Time to First Token----------------
Mean TTFT (ms):                          1375.34                             Mean TTFT (ms):                          1439.79
Median TTFT (ms):                        1019.52                             Median TTFT (ms):                        1061.45
P99 TTFT (ms):                           5287.76                             P99 TTFT (ms):                           5711.34
-----Time per Output Token (excl. 1st token)------                           -----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          40.28                               Mean TPOT (ms):                          43.71
Median TPOT (ms):                        40.52                               Median TPOT (ms):                        43.82
P99 TPOT (ms):                           41.37                               P99 TPOT (ms):                           44.86
---------------Inter-token Latency----------------                           ---------------Inter-token Latency----------------
Mean ITL (ms):                           40.28                               Mean ITL (ms):                           43.71
Median ITL (ms):                         36.71                               Median ITL (ms):                         40.04
P99 ITL (ms):                            309.98                              P99 ITL (ms):                            342.09
==================================================                           ==================================================
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|    |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|    |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9181|±  |0.0076|    |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9234|±  |0.0073|
|     |       |strict-match    |     5|exact_match|↑  |0.9007|±  |0.0082|    |     |       |strict-match    |     5|exact_match|↑  |0.9007|±  |0.0082|

--kv-cache-dtype = fp8, enable_attn_fusion
============ Serving Benchmark Result ============
Successful requests:                     640
Maximum request concurrency:             128
Benchmark duration (s):                  199.56
Total input tokens:                      653975
Total generated tokens:                  655360
Request throughput (req/s):              3.21
Output token throughput (tok/s):         3284.02
Total Token throughput (tok/s):          6561.10
---------------Time to First Token----------------
Mean TTFT (ms):                          1347.81
Median TTFT (ms):                        999.97
P99 TTFT (ms):                           5192.63
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          37.65
Median TPOT (ms):                        37.89
P99 TPOT (ms):                           38.71
---------------Inter-token Latency----------------
Mean ITL (ms):                           37.65
Median ITL (ms):                         34.13
P99 ITL (ms):                            302.65
==================================================
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.9128|±  |0.0078|
|     |       |strict-match    |     5|exact_match|↑  |0.8923|±  |0.0085|

(Optional) Documentation Update

github-actions · 2025-07-28T05:19:50Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Code Review

This pull request adds support for the Flashinfer TRT-LLM FP8-query/output attention kernel. The changes span across benchmarks, tests, and core attention backend logic. The main changes involve updating the Flashinfer API calls to support an out parameter for in-place operations, and adding logic to handle FP8 quantization for queries and outputs. The PR also includes a significant refactoring of CUDA graph support for attention backends.

My review identifies two main issues. First, a critical bug in vllm/attention/layer.py where the query_scale for FP8 quantization is not being correctly propagated to the attention implementation. Second, a high-severity issue in vllm/v1/attention/backends/flashinfer.py where the usage of the TRT-LLM attention kernel is hardcoded, which limits flexibility.

vllm/v1/attention/backends/flashinfer.py

mergify · 2025-07-29T17:42:48Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @elvischenv.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/attention/backends/abstract.py

vllm/attention/layer.py

vllm/compilation/fusion_attn.py

nvpohanh · 2025-08-07T07:28:20Z

vllm/utils/flashinfer.py

not related to this PR, but I think has_nvidia_artifactory() can be removed because FlashInfer now supports downloading all cubins at once.

can do this in another PR

vllm/v1/attention/backends/flashinfer.py

Signed-off-by: elvischenv <[email protected]>

ProExpertProg

This PR is looking really good! Thanks for all your hard work

mergify · 2025-08-19T03:04:58Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @elvischenv.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: elvischenv <[email protected]>

…m-project#21716) Signed-off-by: elvischenv <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Luka Govedič <[email protected]>

…m-project#21716) Signed-off-by: elvischenv <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Signed-off-by: Duncan Moss <[email protected]>

…m-project#21716) Signed-off-by: elvischenv <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Luka Govedič <[email protected]>

…m-project#21716) Signed-off-by: elvischenv <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Signed-off-by: Xiao Yu <[email protected]>

…m-project#21716) Signed-off-by: elvischenv <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: Luka Govedič <[email protected]>

Sekri0 · 2025-10-13T11:34:24Z

May I ask why TTFT increases when qkv fp8 is enabled？I assume fp8 Tensor Core should be used for accelerating qk and pv matmul when qkv is quantized to fp8.

nvpohanh · 2025-10-14T01:15:38Z

May I ask why TTFT increases when qkv fp8 is enabled？I assume fp8 Tensor Core should be used for accelerating qk and pv matmul when qkv is quantized to fp8.

Using FP8 kv-cache introduces an additional FP8-Quant kernel for the Query tensor, so the performance may have a small drop if the attention speed up is too small. Ideally, that Quant should be fused with RoPE and that work is tracked in #24678

mergify bot added llama Related to Llama models performance Performance-related issues rocm Related to AMD ROCm v1 labels Jul 28, 2025

gemini-code-assist bot reviewed Jul 28, 2025

View reviewed changes

vllm/v1/attention/backends/flashinfer.py Outdated Show resolved Hide resolved

ProExpertProg reviewed Jul 28, 2025

View reviewed changes

vllm/v1/attention/backends/flashinfer.py Outdated Show resolved Hide resolved

mgoin self-requested a review July 28, 2025 16:16

elvischenv force-pushed the elvischenv/fp8-trtllm-attn branch 3 times, most recently from c999f36 to 689b426 Compare July 29, 2025 07:32

mergify bot added the needs-rebase label Jul 29, 2025

elvischenv force-pushed the elvischenv/fp8-trtllm-attn branch from 689b426 to 577d49f Compare July 30, 2025 01:57

elvischenv changed the title ~~[Feat] Support Flashinfer TRT-LLM FP8-query/output Attention Kernel~~ [Feat] Support Flashinfer TRTLLM FP8-qkv Attention Kernel Aug 5, 2025

elvischenv force-pushed the elvischenv/fp8-trtllm-attn branch from 577d49f to b30f23a Compare August 5, 2025 12:06

mergify bot removed the needs-rebase label Aug 5, 2025

elvischenv force-pushed the elvischenv/fp8-trtllm-attn branch 2 times, most recently from 3f5c953 to 37570d2 Compare August 7, 2025 04:11

elvischenv changed the title ~~[Feat] Support Flashinfer TRTLLM FP8-qkv Attention Kernel~~ [NVIDIA] Support Flashinfer TRTLLM FP8-q/kv/out Attention Kernel Aug 7, 2025

elvischenv force-pushed the elvischenv/fp8-trtllm-attn branch 2 times, most recently from 70f14ae to 9777371 Compare August 7, 2025 04:53

nvpohanh suggested changes Aug 7, 2025

View reviewed changes

elvischenv force-pushed the elvischenv/fp8-trtllm-attn branch from 9777371 to 90699f3 Compare August 7, 2025 09:02

elvischenv marked this pull request as ready for review August 7, 2025 09:08

elvischenv requested review from WoosukKwon, njhill, robertgshaw2-redhat and tlrmchlsmth as code owners August 7, 2025 09:09

elvischenv added 2 commits August 18, 2025 18:14

add mix quant dtype for decode input

78befbf

Signed-off-by: elvischenv <[email protected]>

Merge branch 'main' into elvischenv/fp8-trtllm-attn

ab0871f

ProExpertProg approved these changes Aug 19, 2025

View reviewed changes

mergify bot added the needs-rebase label Aug 19, 2025

Merge branch 'main' into elvischenv/fp8-trtllm-attn

3178bd0

Signed-off-by: elvischenv <[email protected]>

mergify bot removed the needs-rebase label Aug 19, 2025

ProExpertProg merged commit 03752db into vllm-project:main Aug 19, 2025
70 checks passed

elvischenv deleted the elvischenv/fp8-trtllm-attn branch August 25, 2025 03:20

Uh oh!

Uh oh!

[NVIDIA] Support Flashinfer TRTLLM FP8-q/kv/out Attention Kernel #21716

[NVIDIA] Support Flashinfer TRTLLM FP8-q/kv/out Attention Kernel #21716

Uh oh!

Conversation

elvischenv commented Jul 28, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Essential Elements of an Effective PR Description Checklist

Purpose

Test Plan && Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jul 28, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Jul 29, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nvpohanh Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

nvpohanh Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

elvischenv Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ProExpertProg left a comment

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Aug 19, 2025

Uh oh!

Uh oh!

Sekri0 commented Oct 13, 2025

Uh oh!

nvpohanh commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

elvischenv commented Jul 28, 2025 •

edited by github-actions bot

Loading