Skip to content

Commit 05316d3

Browse files
authored
TensorRT-LLM v0.11 Update (#1969)
1 parent 9bd15f1 commit 05316d3

File tree

1,024 files changed

+2084834
-868880
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,024 files changed

+2084834
-868880
lines changed

.gitignore

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,9 @@ __pycache__/
66
*.nsys-rep
77
.VSCodeCounter
88
build*/
9+
!builders/
910
*.egg-info/
1011
.coverage
11-
*.csv
1212
*.onnx
1313
tmp/
1414
venv/

.pre-commit-config.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -46,5 +46,5 @@ repos:
4646
args:
4747
- --skip=".git,3rdparty"
4848
- --exclude-file=examples/whisper/tokenizer.py
49-
- --ignore-words-list=rouge,inout,atleast,strat,nd
49+
- --ignore-words-list=rouge,inout,atleast,strat,nd,subtile
5050
exclude: 'tests/llm-test-defs/turtle/test_input_files'

README.md

Lines changed: 46 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -6,9 +6,9 @@ TensorRT-LLM
66

77
[![Documentation](https://img.shields.io/badge/docs-latest-brightgreen.svg?style=flat)](https://nvidia.github.io/TensorRT-LLM/)
88
[![python](https://img.shields.io/badge/python-3.10.12-green)](https://www.python.org/downloads/release/python-31012/)
9-
[![cuda](https://img.shields.io/badge/cuda-12.4.0-green)](https://developer.nvidia.com/cuda-downloads)
10-
[![trt](https://img.shields.io/badge/TRT-10.0.1-green)](https://developer.nvidia.com/tensorrt)
11-
[![version](https://img.shields.io/badge/release-0.10.0.dev-green)](./setup.py)
9+
[![cuda](https://img.shields.io/badge/cuda-12.4.1-green)](https://developer.nvidia.com/cuda-downloads)
10+
[![trt](https://img.shields.io/badge/TRT-10.1.0-green)](https://developer.nvidia.com/tensorrt)
11+
[![version](https://img.shields.io/badge/release-0.11.0-green)](./tensorrt_llm/version.py)
1212
[![license](https://img.shields.io/badge/license-Apache%202-blue)](./LICENSE)
1313

1414
[Architecture](./docs/source/architecture/overview.md)   |   [Results](./docs/source/performance/perf-overview.md)   |   [Examples](./examples/)   |   [Documentation](./docs/source/)
@@ -17,7 +17,44 @@ TensorRT-LLM
1717
<div align="left">
1818

1919
## Latest News
20-
* [*Weekly*] Check out **[@NVIDIAAIDev](https://twitter.com/nvidiaaidev?lang=en)** & **[NVIDIA AI](https://www.linkedin.com/showcase/nvidia-ai/)** LinkedIn for the latest updates!
20+
* [2024/07/09] Checklist to maximize multi-language performance of @meta #Llama3 with #TensorRT #LLM inference:
21+
✅ MultiLingual
22+
✅ NIM
23+
✅ LoRA tuned adaptors[➡️ Tech blog](https://developer.nvidia.com/blog/deploy-multilingual-llms-with-nvidia-nim/)
24+
<div align="center">
25+
<img src="docs/source/media/picture-07-09-2024.png" width="45%">
26+
<div align="left">
27+
28+
* [2024/07/02] Let the @MistralAI MoE tokens fly 📈 🚀 #Mixtral 8x7B with NVIDIA #TensorRT #LLM on #H100.
29+
[➡️ Tech blog](https://developer.nvidia.com/blog/achieving-high-mixtral-8x7b-performance-with-nvidia-h100-tensor-core-gpus-and-tensorrt-llm?ncid=so-twit-928467)
30+
31+
* [2024/06/24] Enhanced with NVIDIA #TensorRT #LLM, @upstage.ai’s solar-10.7B-instruct is ready to power your developer projects through our API catalog 🏎️. ✨[➡️ link](https://build.nvidia.com/upstage/solar-10_7b-instruct?snippet_tab=Try )
32+
33+
* [2024/06/18] CYMI: 🤩 Stable Diffusion 3 dropped last week 🎊 🏎️ Speed up your SD3 with #TensorRT INT8 Quantization[➡️ link](https://build.nvidia.com/upstage/solar-10_7b-instruct?snippet_tab=Try )
34+
35+
* [2024/06/18] 🧰Deploying ComfyUI with TensorRT? Here’s your setup guide [➡️ link](https://github.com/comfyanonymous/ComfyUI_TensorRT)
36+
37+
* [2024/06/11] ✨#TensorRT Weight-Stripped Engines ✨
38+
Technical Deep Dive for serious coders ✅+99% compression ✅1 set of weights → ** GPUs ✅0 performance loss ✅** models…LLM, CNN, etc.[➡️ link](https://developer.nvidia.com/blog/maximum-performance-and-minimum-footprint-for-ai-apps-with-nvidia-tensorrt-weight-stripped-engines/)
39+
40+
* [2024/06/04] ✨ #TensorRT and GeForce #RTX unlock ComfyUI SD superhero powers 🦸⚡ 🎥 Demo: [➡️ link](https://youtu.be/64QEVfbPHyg)
41+
📗 DIY notebook: [➡️ link](https://console.brev.dev/launchable/deploy?userID=2x2sil999&orgID=ktj33l4xj&name=ComfyUI_TensorRT&instance=L4%40g2-standard-4%3Anvidia-l4%3A1&diskStorage=500&cloudID=GCP&baseImage=docker.io%2Fpytorch%2Fpytorch%3A2.2.0-cuda12.1-cudnn8-runtime&ports=ComfUI%3A8188&file=https%3A%2F%2Fgithub.com%2Fbrevdev%2Fnotebooks%2Fblob%2Fmain%2Ftensorrt-comfyui.ipynb&launchableID=env-2hQX3n7ae5mq3NjNZ32DfAG0tJf)
42+
43+
* [2024/05/28] ✨#TensorRT weight stripping for ResNet-50 ✨ ✅+99% compression
44+
✅1 set of weights → ** GPUs\ ✅0 performance loss ✅** models…LLM, CNN, etc
45+
👀 📚 DIY [➡️ link](https://console.brev.dev/launchable/deploy?userID=2x2sil999&orgID=ktj33l4xj&launchableID=env-2h6bym7h5GFNho3vpWQQeUYMwTM&instance=L4%40g6.xlarge&diskStorage=500&cloudID=devplane-brev-1&baseImage=nvcr.io%2Fnvidia%2Ftensorrt%3A24.05-py3&file=https%3A%2F%2Fgithub.com%2FNVIDIA%2FTensorRT%2Fblob%2Frelease%2F10.0%2Fsamples%2Fpython%2Fsample_weight_stripping%2Fnotebooks%2Fweight_stripping.ipynb&name=tensorrt_weight_stripping_resnet50)
46+
47+
* [2024/05/21] ✨@modal_labs has the codes for serverless @AIatMeta Llama 3 on #TensorRT #LLM ✨👀 📚 Marvelous Modal Manual:
48+
Serverless TensorRT-LLM (LLaMA 3 8B) | Modal Docs [➡️ link](https://modal.com/docs/examples/trtllm_llama)
49+
50+
* [2024/05/08] NVIDIA TensorRT Model Optimizer -- the newest member of the #TensorRT ecosystem is a library of post-training and training-in-the-loop model optimization techniques ✅quantization ✅sparsity ✅QAT [➡️ blog](https://developer.nvidia.com/blog/accelerate-generative-ai-inference-performance-with-nvidia-tensorrt-model-optimizer-now-publicly-available/)
51+
52+
53+
* [2024/05/07] 🦙🦙🦙 24,000 tokens per second 🛫Meta Llama 3 takes off with #TensorRT #LLM 📚[➡️ link](https://blogs.nvidia.com/blog/meta-llama3-inference-acceleration/)
54+
55+
<details close>
56+
<summary>Previous News</summary>
57+
2158
* [2024/02/06] [🚀 Speed up inference with SOTA quantization techniques in TRT-LLM](./docs/source/blogs/quantization-in-TRT-LLM.md)
2259
* [2024/01/30] [ New XQA-kernel provides 2.4x more Llama-70B throughput within the same latency budget](./docs/source/blogs/XQA-kernel.md)
2360
* [2023/12/04] [Falcon-180B on a single H200 GPU with INT4 AWQ, and 6.7x faster Llama-70B over A100](./docs/source/blogs/Falcon180B-H200.md)
@@ -29,6 +66,8 @@ TensorRT-LLM
2966
* [2023/10/17] [Large Language Models up to 4x Faster on RTX With TensorRT-LLM for Windows
3067
](https://blogs.nvidia.com/blog/2023/10/17/tensorrt-llm-windows-stable-diffusion-rtx/)
3168

69+
</details>
70+
3271
## TensorRT-LLM Overview
3372

3473
TensorRT-LLM is an easy-to-use Python API to define Large
@@ -75,3 +114,6 @@ To get started with TensorRT-LLM, visit our documentation:
75114
- [Installation Guide for Linux](https://nvidia.github.io/TensorRT-LLM/installation/linux.html)
76115
- [Installation Guide for Windows](https://nvidia.github.io/TensorRT-LLM/installation/windows.html)
77116
- [Supported Hardware, Models, and other Software](https://nvidia.github.io/TensorRT-LLM/reference/support-matrix.html)
117+
118+
## Community
119+
- [Model zoo](https://huggingface.co/TheFloat16) (generated by TRT-LLM rel 0.9 a9356d4b7610330e89c1010f342a9ac644215c52)

benchmarks/README.md

Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
# TensorRT-LLM Benchmarks
2+
3+
## Overview
4+
5+
There are currently three workflows to benchmark TensorRT-LLM:
6+
* [C++ benchmarks](./cpp)
7+
- The recommended workflow that uses TensorRT-LLM C++ API and can take advantage of the latest features of TensorRT-LLM.
8+
* [Python benchmarks](./python)
9+
- The Python benchmarking scripts can only benchmark the Python runtime, which do not support the latest features, such as in-flight batching.
10+
* [The Python benchmarking suite](./suite)
11+
- This benchmarking suite is a current work in progress and is prone to large changes.

benchmarks/cpp/README.md

Lines changed: 68 additions & 80 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
1-
# Benchmark for C++ Runtime
1+
# Benchmark C++ Runtime
22

33
This document explains how to benchmark the models supported by TensorRT-LLM on a single GPU, a single node with
4-
multiple GPUs or multiple nodes with multiple GPUs.
4+
multiple GPUs or multiple nodes with multiple GPUs using the C++ runtime.
55

66
## Usage
77

@@ -16,58 +16,11 @@ Windows users: Follow the
1616
instead, and be sure to set DLL paths as specified in
1717
[Extra Steps for C++ Runtime Usage](../../windows/README.md#extra-steps-for-c-runtime-usage).
1818

19-
### 2. Launch C++ benchmarking (Fixed BatchSize/InputLen/OutputLen)
20-
21-
#### Prepare TensorRT-LLM engine(s)
22-
23-
Before you launch C++ benchmarking, please make sure that you have already built engine(s) using TensorRT-LLM API, C++ benchmarking code cannot generate engine(s) for you.
24-
25-
Use `trtllm-build` to build the TRT-LLM engine. Alternatively, if you have already benchmarked Python Runtime, you can reuse the engine(s) built previously, please see that [`document`](../python/README.md).
26-
27-
#### Launch benchmarking
28-
29-
For detailed usage, you can do the following
30-
```
31-
cd cpp/build
32-
33-
# You can directly execute the binary for help information
34-
./benchmarks/gptSessionBenchmark --help
35-
./benchmarks/bertBenchmark --help
36-
```
37-
38-
Take GPT-350M as an example for single GPU
39-
40-
```
41-
./benchmarks/gptSessionBenchmark \
42-
--engine_dir "../../benchmarks/gpt_350m/" \
43-
--batch_size "1" \
44-
--input_output_len "60,20"
45-
46-
# Expected output:
47-
# [BENCHMARK] batch_size 1 input_length 60 output_length 20 latency(ms) 40.81
48-
```
49-
Take GPT-175B as an example for multiple GPUs
50-
```
51-
mpirun -n 8 ./benchmarks/gptSessionBenchmark \
52-
--engine_dir "../../benchmarks/gpt_175b/" \
53-
--batch_size "1" \
54-
--input_output_len "60,20"
55-
56-
# Expected output:
57-
# [BENCHMARK] batch_size 1 input_length 60 output_length 20 latency(ms) 792.14
58-
```
59-
60-
If you want to obtain context and generation logits, you could build an enigne with `--gather_context_logits` and `--gather_generation_logits`, respectively. Enable `--gather_all_token_logits` will enable both of them.
61-
62-
If you want to get the logits, you could run gptSessionBenchmark with `--print_all_logits`. This will print a large number of logit values and has a certain impact on performance.
63-
64-
*Please note that the expected outputs in that document are only for reference, specific performance numbers depend on the GPU you're using.*
65-
66-
### 3. Launch Batch Manager benchmarking (Inflight/V1 batching)
19+
### 2. Launch C++ benchmarking (Inflight/V1 batching)
6720

6821
#### Prepare dataset
6922

70-
Run a preprocessing script to prepare/generate dataset into a json that gptManagerBenchmark can consume later. The processed output json has *input tokens length, input token ids and output tokens length*
23+
Run a preprocessing script to prepare/generate dataset into a json that gptManagerBenchmark can consume later. The processed output json has *input tokens length, input token ids and output tokens length*.
7124

7225
This tool can be used in 2 different modes of traffic generation.
7326

@@ -127,7 +80,8 @@ For `tokenizer`, specifying the path to the local tokenizer that have already be
12780

12881

12982
#### Prepare TensorRT-LLM engines
130-
Please make sure that the engines are built with argument `--use_inflight_batching` and `--remove_input_padding` if you'd like to benchmark inflight batching, for more details, please see the document in TensorRT-LLM examples.
83+
84+
Before you launch C++ benchmarking, please make sure that you have already built engine(s) using `trtllm-build` command. For more details on building engine(s), please refer to the [Quick Start Guide](../../docs/source/quick-start-guide.md).
13185

13286
#### Launch benchmarking
13387

@@ -139,34 +93,24 @@ cd cpp/build
13993
./benchmarks/gptManagerBenchmark --help
14094
```
14195

142-
Take GPT-350M as an example for single GPU V1 batching
143-
```
144-
./benchmarks/gptManagerBenchmark \
145-
--engine_dir ../../examples/gpt/trt_engine/gpt2/fp16/1-gpu/ \
146-
--type V1 \
147-
--request_rate 10 \
148-
--dataset ../../benchmarks/cpp/preprocessed_dataset.json
149-
--max_num_samples 500
150-
```
151-
15296
Take GPT-350M as an example for 2-GPU inflight batching
15397
```
15498
mpirun -n 2 ./benchmarks/gptManagerBenchmark \
15599
--engine_dir ../../examples/gpt/trt_engine/gpt2-ib/fp16/2-gpu/ \
156-
--type IFB \
157100
--request_rate 10 \
158101
--dataset ../../benchmarks/cpp/preprocessed_dataset.json
159102
--max_num_samples 500
160103
```
161104

162-
`gptManagerBenchmark` can also be used with the high-level C++ API defined by the `executor::Executor` class (see `cpp/include/tensorrt_llm/executor/executor.h`). This can be done by passing the argument `--api executor`. Note that the Executor class is still under development and currently does not support models with tp or pp > 1.
105+
`gptManagerBenchmark` by default uses the high-level C++ API defined by the `executor::Executor` class (see `cpp/include/tensorrt_llm/executor/executor.h`).
163106

164107
#### Emulated static batching
165108

166-
To emulate `gptSessionBenchmark` static batching, you can use `gptManagerBenchmark` with the `--static_emulated_batch_size` and `--static_emulated-timeout` arguments.
109+
To emulate the deprecated `gptSessionBenchmark` static batching, you can use `gptManagerBenchmark` with the `--static_emulated_batch_size` and `--static_emulated-timeout` arguments.
110+
167111
Given a `static_emulated_batch_size` of `n` the server will wait for `n` requests to arrive before submitting them to the batch manager at once. If the `static_emulated_timeout` (in ms) is reached before `n` requests are collected, the batch will be submitted prematurely with the current request count. New batches will only be submitted once the previous batch has been processed comepletely.
168112

169-
`gptSessionBenchmark` uses fixed input/output lengths for benchmarking. A similar dataset for `gptManagerBenchmark` can be generated with the preprocessing script, e.g.
113+
Datasets with fixed input/output lengths for benchmarking can be generated with the preprocessing script, e.g.
170114
```
171115
python prepare_dataset.py \
172116
--output tokens-fixed-lengths.json \
@@ -181,7 +125,6 @@ Take GPT-350M as an example for single GPU with static batching
181125
```
182126
./benchmarks/gptManagerBenchmark \
183127
--engine_dir ../../examples/gpt/trt_engine/gpt2/fp16/1-gpu/ \
184-
--type IFB \
185128
--request-rate -1 \
186129
--static_emulated_batch_size 32 \
187130
--static_emulated_timeout 100 \
@@ -210,8 +153,10 @@ TP=2
210153
PP=1
211154
MAX_LEN=1024
212155
MAX_BATCH=32
213-
MAX_LORA_RANK=32
156+
NUM_LAYERS=40
157+
MAX_LORA_RANK=64
214158
NUM_LORA_MODS=7
159+
EOS_ID=2
215160
216161
SOURCE_LORA=chinese-llama-2-lora-13b
217162
CPP_LORA=chinese-llama-2-lora-13b-cpp
@@ -230,14 +175,14 @@ ${HOME}/.local/bin/trtllm-build \
230175
--output_dir ${LORA_ENGINE} \
231176
--max_batch_size ${MAX_BATCH} \
232177
--max_input_len $MAX_LEN \
233-
--max_output_len $MAX_LEN \
178+
--max_seq_len $((2*${MAX_LEN})) \
234179
--gemm_plugin float16 \
235180
--lora_plugin float16 \
236181
--use_paged_context_fmha enable \
237-
--lora_target_modules attn_qkv \
182+
--lora_target_modules attn_q attn_k attn_v attn_dense mlp_h_to_4h mlp_4h_to_h mlp_gate \
238183
--max_lora_rank ${MAX_LORA_RANK}
239184
240-
NUM_LORAS=(8 16 24 32 64 128 256)
185+
NUM_LORAS=(8 16)
241186
NUM_REQUESTS=1024
242187
243188
# Convert LoRA to cpp format
@@ -252,8 +197,6 @@ mkdir -p $EG_DIR/data
252197
# Prepare dataset without lora_task_id
253198
python benchmarks/cpp/prepare_dataset.py \
254199
--output "${EG_DIR}/data/token-norm-dist.json" \
255-
--request-rate -1 \
256-
--time-delay-dist constant \
257200
--tokenizer $TOKENIZER \
258201
token-norm-dist \
259202
--num-requests $NUM_REQUESTS \
@@ -263,8 +206,6 @@ python benchmarks/cpp/prepare_dataset.py \
263206
for nloras in ${NUM_LORAS[@]}; do
264207
python benchmarks/cpp/prepare_dataset.py \
265208
--output "${EG_DIR}/data/token-norm-dist-lora-${nloras}.json" \
266-
--request-rate -1 \
267-
--time-delay-dist constant \
268209
--rand-task-id 0 $(( $nloras - 1 )) \
269210
--tokenizer $TOKENIZER \
270211
token-norm-dist \
@@ -273,7 +214,7 @@ for nloras in ${NUM_LORAS[@]}; do
273214
done
274215
275216
# Generate random lora weights for 256 adapters
276-
python benchmarks/cpp/utils/generate_rand_loras.py ${CPP_LORA} ${EG_DIR}/loras 256
217+
python benchmarks/cpp/utils/generate_rand_loras.py ${CPP_LORA} ${EG_DIR}/loras 16
277218
278219
# perform benchmarking
279220
@@ -286,13 +227,13 @@ mpirun -n ${TP} --output-filename ${EG_DIR}/log-base-lora \
286227
--dataset "${EG_DIR}/data/token-norm-dist.json" \
287228
--lora_host_cache_bytes 8589934592 \
288229
--lora_num_device_mod_layers $(( 32 * $NUM_LAYERS * $NUM_LORA_MODS * $MAX_LORA_RANK )) \
289-
--kv_cache_free_gpu_mem_fraction 0.80 \
230+
--kv_cache_free_gpu_mem_fraction 0.70 \
290231
--log_level info \
291232
--eos_id ${EOS_ID}
292233
293234
# Now run inference with various numbers or loras
294235
# The host cache is set large enough to hold all the LoRAs in lora_dir
295-
# GPU cache is set to hold 32 LoRAs
236+
# GPU cache is set to hold 16 LoRAs
296237
# This benchmark will preload all the LoRAs into the host cache
297238
# We run inference on a range of active LoRAs exercising different cache miss rates.
298239
for nloras in ${NUM_LORAS[@]}; do
@@ -303,10 +244,57 @@ for nloras in ${NUM_LORAS[@]}; do
303244
--type IFB \
304245
--dataset "${EG_DIR}/data/token-norm-dist-lora-${nloras}.json" \
305246
--lora_host_cache_bytes 8589934592 \
306-
--lora_num_device_mod_layers $(( 32 * $NUM_LAYERS * $NUM_LORA_MODS * $MAX_LORA_RANK )) \
307-
--kv_cache_free_gpu_mem_fraction 0.80 \
247+
--lora_num_device_mod_layers $(( 16 * $NUM_LAYERS * $NUM_LORA_MODS * $MAX_LORA_RANK )) \
248+
--kv_cache_free_gpu_mem_fraction 0.70 \
308249
--log_level info \
309250
--eos_id ${EOS_ID} \
310251
--lora_dir ${EG_DIR}/loras
311252
done
312253
```
254+
255+
### 3. [DEPRECATED] Launch C++ static batching benchmarking (Fixed BatchSize/InputLen/OutputLen)
256+
257+
#### Prepare TensorRT-LLM engine(s)
258+
259+
Before you launch C++ benchmarking, please make sure that you have already built engine(s) using TensorRT-LLM API, C++ benchmarking code cannot generate engine(s) for you.
260+
261+
Use `trtllm-build` to build the TRT-LLM engine. Alternatively, if you have already benchmarked Python Runtime, you can reuse the engine(s) built previously, please see that [`document`](../python/README.md).
262+
263+
#### Launch benchmarking
264+
265+
For detailed usage, you can do the following
266+
```
267+
cd cpp/build
268+
269+
# You can directly execute the binary for help information
270+
./benchmarks/gptSessionBenchmark --help
271+
./benchmarks/bertBenchmark --help
272+
```
273+
274+
Take GPT-350M as an example for single GPU
275+
276+
```
277+
./benchmarks/gptSessionBenchmark \
278+
--engine_dir "../../benchmarks/gpt_350m/" \
279+
--batch_size "1" \
280+
--input_output_len "60,20"
281+
282+
# Expected output:
283+
# [BENCHMARK] batch_size 1 input_length 60 output_length 20 latency(ms) 40.81
284+
```
285+
Take GPT-175B as an example for multiple GPUs
286+
```
287+
mpirun -n 8 ./benchmarks/gptSessionBenchmark \
288+
--engine_dir "../../benchmarks/gpt_175b/" \
289+
--batch_size "1" \
290+
--input_output_len "60,20"
291+
292+
# Expected output:
293+
# [BENCHMARK] batch_size 1 input_length 60 output_length 20 latency(ms) 792.14
294+
```
295+
296+
If you want to obtain context and generation logits, you could build an enigne with `--gather_context_logits` and `--gather_generation_logits`, respectively. Enable `--gather_all_token_logits` will enable both of them.
297+
298+
If you want to get the logits, you could run gptSessionBenchmark with `--print_all_logits`. This will print a large number of logit values and has a certain impact on performance.
299+
300+
*Please note that the expected outputs in that document are only for reference, specific performance numbers depend on the GPU you're using.*

benchmarks/cpp/bertBenchmark.cpp

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@
1717
#include "tensorrt_llm/common/memoryUtils.h"
1818
#include "tensorrt_llm/plugins/api/tllmPlugin.h"
1919
#include "tensorrt_llm/runtime/iTensor.h"
20+
#include "tensorrt_llm/runtime/rawEngine.h"
2021
#include "tensorrt_llm/runtime/tllmLogger.h"
2122
#include "tensorrt_llm/runtime/tllmRuntime.h"
2223
#include "tensorrt_llm/runtime/worldConfig.h"
@@ -78,11 +79,10 @@ void benchmarkBert(std::string const& modelName, std::filesystem::path const& da
7879
{
7980
auto const worldConfig = WorldConfig::mpi();
8081
auto const enginePath = dataPath / engineFilename(dataPath, worldConfig, modelName);
81-
auto engineBlob = loadEngine(enginePath.string());
8282

8383
for (float gpuWeightsPercent : gpuWeightsPercents)
8484
{
85-
auto rt = std::make_shared<TllmRuntime>(engineBlob.data(), engineBlob.size(), gpuWeightsPercent, *logger);
85+
auto rt = std::make_shared<TllmRuntime>(RawEngine(enginePath), logger.get(), gpuWeightsPercent);
8686
rt->addContext(0);
8787
for (auto inLen : inLens)
8888
{

0 commit comments

Comments
 (0)