Skip to content

Commit 7965842

Browse files
kaiyuxzbpatel
andauthored
[doc] Update Perf-Overview.MD with V0.20 Release Data (cherry-pick #5176) (#5324)
Signed-off-by: Kaiyu Xie <[email protected]> Co-authored-by: zpatel <[email protected]>
1 parent 109f28e commit 7965842

File tree

1 file changed

+95
-75
lines changed

1 file changed

+95
-75
lines changed

docs/source/performance/perf-overview.md

Lines changed: 95 additions & 75 deletions
Original file line numberDiff line numberDiff line change
@@ -28,101 +28,119 @@ nvidia/Llama-3.1-405B-Instruct-FP4
2828
```
2929

3030
#### Llama 3.3 70B FP4
31+
3132
| | GPU | B200 | | | |
32-
|:-----------------------------|:---|:----------|:----------|:----------|:----------|
33-
| | TP Size | 1 | 2 | 4 | 8 |
34-
| ISL, OSL| | | | | |
35-
| | | | | | |
36-
| 128, 128 | | 11,253.28 | 17,867.66 | 24,944.50 | 27,471.49 |
37-
| 128, 2048 | | 9,925.00 | 15,459.71 | 23,608.58 | 30,742.86 |
38-
| 128, 4096 | | 6,318.92 | 8,711.88 | 17,659.74 | 24,947.05 |
39-
| 500, 2000 | | 7,559.88 | 10,602.27 | 20,910.23 | 28,182.34 |
40-
| 1000, 1000 | | 6,866.96 | 10,838.01 | 16,567.86 | 19,991.64 |
41-
| 1000, 2000 | | 6,736.88 | 9,132.08 | 15,737.02 | 20,518.04 |
42-
| 1024, 2048 | | 6,580.56 | 8,767.45 | 15,722.55 | 20,437.96 |
43-
| 2048, 128 | | 1,375.49 | 1,610.69 | 2,707.58 | 3,717.82 |
44-
| 2048, 2048 | | 4,544.73 | 6,956.14 | 12,292.23 | 15,661.22 |
45-
| 5000, 500 | | 1,488.19 | 2,379.73 | 3,588.45 | 4,810.21 |
46-
| 20000, 2000 | | 580.96 | 1,043.58 | 1,957.84 | 3,167.30 |
33+
|:------------------------|:--------|:----------|:----------|:----------|:----------|
34+
| | TP Size | 1 | 2 | 4 | 8 |
35+
| ISL, OSL | | | | | |
36+
| | | | | | |
37+
| 128, 128 | | 10,994.48 | 17,542.11 | 24,667.31 | 27,272.27 |
38+
| 128, 2048 | | 9,580.46 | 15,432.35 | 23,568.12 | 31,174.31 |
39+
| 128, 4096 | | 6,418.39 | 9,841.53 | 17,808.76 | 25,229.25 |
40+
| 500, 2000 | | 7,343.32 | 11,850.57 | 20,709.67 | 28,038.78 |
41+
| 1000, 1000 | | 6,752.53 | 10,815.88 | 16,413.04 | 20,060.66 |
42+
| 1000, 2000 | | 6,670.07 | 9,830.73 | 15,597.49 | 20,672.37 |
43+
| 1024, 2048 | | 6,636.75 | 9,807.13 | 15,519.23 | 20,617.28 |
44+
| 2048, 128 | | 1,342.17 | 1,989.41 | 3,033.14 | 4,035.64 |
45+
| 5000, 500 | | 1,429.67 | 2,419.67 | 3,686.84 | 5,182.96 |
46+
| 20000, 2000 | | 629.77 | 1,177.01 | 2,120.66 | 3,429.03 |
4747

4848
#### Llama 3.1 405B FP4
49-
| | GPU | B200 |
50-
|:-----------------------------|:---|:----------|
51-
| | TP Size | 8 |
52-
| ISL, OSL| | |
53-
| | | |
54-
| 128, 128 | | 9,184.83 |
55-
| 128, 2048 | | 10,387.23 |
56-
| 128, 4096 | | 8,741.80 |
57-
| 500, 2000 | | 9,242.34 |
58-
| 1000, 1000 | | 7,565.50 |
59-
| 1000, 2000 | | 7,696.76 |
60-
| 1024, 2048 | | 7,568.93 |
61-
| 2048, 128 | | 953.57 |
62-
| 2048, 2048 | | 6,092.32 |
63-
| 5000, 500 | | 1,332.22 |
64-
| 20000, 2000 | | 961.58 |
49+
50+
| | GPU | B200 | |
51+
|:------------------------|:------- |:---------|:----------|
52+
| | TP Size | 4 | 8 |
53+
| ISL, OSL | | | |
54+
| | | | |
55+
| 128, 128 | | 6,163.81 | 9,002.90 |
56+
| 128, 2048 | | 7,081.21 | 10,288.28 |
57+
| 128, 4096 | | 6,028.37 | 8,713.77 |
58+
| 500, 2000 | | 5,858.75 | 9,125.86 |
59+
| 1000, 1000 | | 4,848.00 | 7,582.97 |
60+
| 1000, 2000 | | 5,375.25 | 7,626.28 |
61+
| 1024, 2048 | | 5,345.70 | 7,464.03 |
62+
| 2048, 128 | | 693.55 | 1,086.56 |
63+
| 5000, 500 | | 947.49 | 1,532.45 |
64+
| 20000, 2000 | | 641.11 | 1,097.84 |
6565

6666
### FP8 Models:
6767
```
6868
nvidia/Llama-3.1-8B-Instruct-FP8
69-
nvidia/Llama-3.1-70B-Instruct-FP8
69+
nvidia/Llama-3.3-70B-Instruct-FP8
7070
nvidia/Llama-3.1-405B-Instruct-FP8
71+
nvidia/Llama-4-Maverick-17B-128E-Instruct-FP8
7172
```
7273

7374
#### Llama 3.1 8B FP8
74-
| | GPU | H200 141GB HBM3 | H100 80GB HBM3 |
75+
76+
| | GPU | H200 141GB HBM3 | H100 80GB HBM3 |
7577
|:-----------------------------|:---|:------------------|:-----------------|
76-
| | TP Size | 1 | 1 |
78+
| | TP Size | 1 | 1 |
7779
| ISL, OSL | | | |
7880
| | | | |
79-
| 128, 128 | | 28,447.38 | 27,568.68 |
80-
| 128, 2048 | | 23,294.74 | 22,003.62 |
81-
| 128, 4096 | | 17,481.48 | 13,640.35 |
82-
| 500, 2000 | | 21,462.57 | 17,794.39 |
83-
| 1000, 1000 | | 17,590.60 | 15,270.02 |
84-
| 1000, 2000 | | 17,139.51 | 13,850.22 |
85-
| 1024, 2048 | | 16,970.63 | 13,374.15 |
86-
| 2048, 128 | | 3,531.33 | 3,495.05 |
87-
| 2048, 2048 | | 12,022.38 | 9,653.67 |
88-
| 5000, 500 | | 3,851.65 | 3,371.16 |
89-
| 20000, 2000 | | 1,706.06 | 1,340.92 |
90-
91-
#### Llama 3.1 70B FP8
92-
| | GPU | H200 141GB HBM3 | | | | H100 80GB HBM3 | | | |
81+
| 128, 128 | | 27,970.14 | 27,688.36 |
82+
| 128, 2048 | | 23,326.38 | 21,841.15 |
83+
| 128, 4096 | | 17,508.51 | 13,730.89 |
84+
| 500, 2000 | | 21,390.41 | 17,833.34 |
85+
| 1000, 1000 | | 17,366.89 | 15,270.62 |
86+
| 1000, 2000 | | 16,831.31 | 13,798.08 |
87+
| 1024, 2048 | | 16,737.03 | 13,385.50 |
88+
| 2048, 128 | | 3,488.03 | 3,414.67 |
89+
| 5000, 500 | | 3,813.69 | 3,394.54 |
90+
| 20000, 2000 | | 1,696.66 | 1,345.42 |
91+
92+
#### Llama 3.3 70B FP8
93+
94+
| | GPU | H200 141GB HBM3 | | | | H100 80GB HBM3 | | | |
9395
|:-----------------------------|:---|:------------------|:---------|:----------|:----------|:-----------------|:---------|:----------|:----------|
94-
| | TP Size | 1 | 2 | 4 | 8 | 1 | 2 | 4 | 8 |
95-
| ISL, OSL| | | | | | | | | |
96+
| | TP Size | 1 | 2 | 4 | 8 | 1 | 2 | 4 | 8 |
97+
| ISL, OSL | | | | | | | | | |
9698
| | | | | | | | | | |
97-
| 128, 128 | | 3,657.58 | 6,477.50 | 10,466.04 | 15,554.57 | 3,191.27 | 6,183.41 | 10,260.68 | 14,686.01 |
98-
| 128, 2048 | | 4,351.07 | 8,450.31 | 13,438.71 | 20,750.58 | 745.19 | 5,822.02 | 11,442.01 | 17,463.99 |
99-
| 128, 4096 | | 2,696.61 | 5,598.92 | 11,524.93 | 16,634.90 | | 3,714.87 | 8,209.91 | 12,598.55 |
100-
| 500, 2000 | | 3,475.58 | 6,712.35 | 12,332.32 | 17,311.28 | | 4,704.31 | 10,278.02 | 14,630.41 |
101-
| 1000, 1000 | | 2,727.42 | 5,097.36 | 8,698.15 | 12,794.92 | 734.67 | 4,191.26 | 7,427.35 | 11,082.48 |
102-
| 1000, 2000 | | 2,913.54 | 5,841.15 | 9,016.49 | 13,174.68 | 526.31 | 3,920.44 | 7,590.35 | 11,108.11 |
103-
| 1024, 2048 | | 2,893.02 | 5,565.28 | 9,017.72 | 13,117.34 | 525.43 | 3,896.14 | 7,557.32 | 11,028.32 |
104-
| 2048, 128 | | 433.30 | 772.97 | 1,278.26 | 1,947.33 | 315.90 | 747.51 | 1,240.12 | 1,840.12 |
105-
| 2048, 2048 | | 1,990.25 | 3,822.83 | 7,068.68 | 10,529.06 | 357.98 | 2,732.86 | 5,640.31 | 8,772.88 |
106-
| 5000, 500 | | 543.88 | 1,005.81 | 1,714.77 | 2,683.22 | 203.27 | 866.77 | 1,571.92 | 2,399.78 |
107-
| 20000, 2000 | | 276.99 | 618.01 | 1,175.35 | 2,021.08 | | 408.43 | 910.77 | 1,568.84 |
99+
| 128, 128 | | 3,605.47 | 6,427.69 | 10,407.42 | 15,434.37 | 3,128.33 | 6,216.91 | | |
100+
| 128, 2048 | | 4,315.80 | 8,464.03 | 13,508.59 | 20,759.72 | 756.42 | 5,782.57 | 11,464.94 | 17,424.32 |
101+
| 128, 4096 | | 2,701.17 | 5,573.55 | 11,458.56 | 16,668.75 | | 3,868.37 | 8,206.39 | 12,624.61 |
102+
| 500, 2000 | | 3,478.76 | 6,740.06 | 12,200.18 | | | 4,684.06 | 9,903.53 | 14,553.93 |
103+
| 1000, 1000 | | 2,744.32 | 5,119.72 | 8,685.44 | 12,744.51 | 742.14 | 4,247.19 | 7,435.65 | 11,018.81 |
104+
| 1000, 2000 | | 2,896.44 | 5,847.26 | 9,031.21 | 13,141.17 | 533.74 | 3,866.53 | 7,611.12 | 11,139.22 |
105+
| 1024, 2048 | | 2,874.18 | 5,568.61 | 8,946.71 | 13,082.62 | 530.16 | 3,796.68 | 7,575.24 | 11,004.31 |
106+
| 2048, 128 | | 435.90 | 772.67 | 1,264.76 | | | 736.89 | 1,213.33 | 1,839.22 |
107+
| 2048, 2048 | | | | | 10,412.85 | | | | |
108+
| 5000, 500 | | 545.96 | 997.15 | 1,698.22 | 2,655.28 | 204.94 | 862.91 | 1,552.68 | 2,369.84 |
109+
| 20000, 2000 | | 276.66 | 620.33 | 1,161.29 | 1,985.85 | | 416.13 | 903.66 | 1,554.10 |
108110

109111
#### Llama 3.1 405B FP8
110-
| | GPU | H200 141GB HBM3 | H100 80GB HBM3 |
112+
113+
| | GPU | H200 141GB HBM3 | H100 80GB HBM3 |
111114
|:-----------------------------|:---|:------------------|:-----------------|
112-
| | TP Size | 8 | 8 |
115+
| | TP Size | 8 | 8 |
113116
| ISL, OSL | | | |
114117
| | | | |
115-
| 128, 128 | | 3,800.11 | 3,732.40 |
116-
| 128, 2048 | | 5,661.13 | 4,572.23 |
117-
| 128, 4096 | | 5,167.18 | 2,911.42 |
118-
| 500, 2000 | | 4,854.29 | 3,661.85 |
119-
| 1000, 1000 | | 3,332.15 | 2,963.36 |
120-
| 1000, 2000 | | 3,682.15 | 3,253.17 |
121-
| 1024, 2048 | | 3,685.56 | 3,089.16 |
122-
| 2048, 128 | | 453.42 | 448.89 |
123-
| 2048, 2048 | | 3,055.73 | 2,139.94 |
124-
| 5000, 500 | | 656.11 | 579.14 |
125-
| 20000, 2000 | | 514.02 | 370.26 |
118+
| 128, 2048 | | 5,567.87 | |
119+
| 128, 4096 | | 5,136.85 | |
120+
| 500, 2000 | | 4,787.61 | 3,673.91 |
121+
| 1000, 1000 | | 3,286.30 | 3,012.22 |
122+
| 1000, 2000 | | 3,636.76 | 3,262.20 |
123+
| 1024, 2048 | | 3,618.66 | 3,109.70 |
124+
| 2048, 128 | | 443.10 | 449.02 |
125+
| 5000, 500 | | 645.46 | |
126+
| 20000, 2000 | | | 372.12 |
127+
128+
#### Llama 4 Maverick FP8
129+
130+
| | GPU | H200 141GB HBM3 | H100 80GB HBM3 |
131+
|:-----------------------------|:---|:------------------|:-----------------|
132+
| | TP Size | 8 | 8 |
133+
| ISL, OSL | | | |
134+
| | | | |
135+
| 128, 2048 | | 27,543.87 | |
136+
| 128, 4096 | | 18,541.01 | 11,163.12 |
137+
| 500, 2000 | | 21,117.34 | |
138+
| 1000, 2000 | | | 10,556.00 |
139+
| 1024, 2048 | | 16,859.45 | 11,584.33 |
140+
| 2048, 128 | | 4,364.06 | 3,832.38 |
141+
| 2048, 2048 | | 12,800.89 | |
142+
| 5000, 500 | | 5,128.60 | |
143+
| 20000, 2000 | | 1,764.27 | 1,400.79 |
126144

127145
## Reproducing Benchmarked Results
128146

@@ -198,6 +216,8 @@ a model name (HuggingFace reference or path to a local model), a [generated data
198216
trtllm-bench --model $model_name throughput --dataset $dataset_file --backend pytorch --extra_llm_api_options $llm_options
199217
```
200218

219+
The data collected for the v0.20 benchmarks was run with the following file:
220+
201221
`llm_options.yml`
202222
```yaml
203223

@@ -222,7 +242,7 @@ trtllm-bench --model $model_name throughput --dataset $dataset_file --backend py
222242
- 8192
223243
```
224244
225-
In majority of cases, we also use a higher KV cache percentage by setting `--kv_cache_free_gpu_mem_fraction 0.95` in the benchmark command. This allows us to obtain better performance than the default setting of `0.90`. We fall back to `0.90` if we hit an out of memory issue.
245+
In a majority of cases, we also use a higher KV cache percentage by setting `--kv_cache_free_gpu_mem_fraction 0.95` in the benchmark command. This allows us to obtain better performance than the default setting of `0.90`. We fall back to `0.90` if we hit an out of memory issue.
226246

227247
The results will be printed to the terminal upon benchmark completion. For example,
228248

0 commit comments

Comments
 (0)