-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
System Info
Using the latest 1.1.0rc1 there seems to be a segfault when going to higher concurrency (not sure exactly where but higher than 10 requests in flight). This does not happen when speculative decoding is off so it's likely due to some of the recent refactors there and changes to cudagraph handling is my guess looking through recent commits.
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Config to reproduce:
low_latency.yaml
cuda_graph_config:
enable_padding: true
max_batch_size: 128
enable_attention_dp: false
moe_config:
backend: TRTLLM
#############
# This is what you can comment out for the error to go away
speculative_config:
decoding_type: AUTO
#############
enable_chunked_prefill: true
On a B200 GPU:
trtllm-serve openai/gpt-oss-120b --host 0.0.0.0 --port 23333 --backend pytorch --tp_size 1 --ep_size 1 --trust_remote_code --extra_llm_api_options /tmp/low_latency.yaml --kv_cache_free_gpu_memory_fraction 0.75
Then send many requests. Likely trt bench will do that but we've been testing with other openAI calls. concurrency of 1 works fine, concurrency of 10 seems fine then somewhere between 10 and 50 it segfaults.
This does not happen in the gpt-oss docker image that was provided at launch.
Expected behavior
It doesn't segfault
actual behavior
it does segfault:
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1471, in _forward_step
outputs = forward(scheduled_requests, self.resource_manager,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/nvtx/nvtx.py", line 122, in inner
result = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1459, in forward
return self.model_engine.forward(
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/utils.py", line 72, in wrapper
return func(self, *args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 2233, in forward
outputs = maybe_graph.run(inputs)
^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py", line 114, in run
self.input_ids[:seqlen].copy_(input_ids)
RuntimeError: The size of tensor a (56) must match the size of tensor b (77) at non-singleton dimension 0
[08/27/2025-18:19:32] [TRT-LLM] [E] Encountered an error in forward function: The size of tensor a (56) must match the size of tensor b (77) at non-singleton dimension 0
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:536 :0:977] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x60114d790)
==== backtrace (tid: 977) ====
0 /opt/hpcx/ompi/lib/openmpi/../../../ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7fc8f0aa2774]
1 /opt/hpcx/ompi/lib/openmpi/../../../ucx/lib/libucs.so.0(+0x3796a) [0x7fc8f0aa296a]
2 /opt/hpcx/ompi/lib/openmpi/../../../ucx/lib/libucs.so.0(+0x37ba8) [0x7fc8f0aa2ba8]
3 /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager16kv_cache_manager18WindowBlockManager11storeBlocksERKSt6vectorINS1_8BlockKeyESaIS4_EERKS3_IiSaIiEE+0x158) [0x7fc50f773248]
4 /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager16kv_cache_manager18WindowBlockManager19storeBlocksForReuseERNS1_17GenerationRequestENS_6common11OptionalRefIKNS0_10LlmRequestEEE+0xcc) [0x7fc50f77458c]
5 /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager16kv_cache_manager12BlockManager13releaseBlocksERNS1_17GenerationRequestENS_6common11OptionalRefIKNS0_10LlmRequestEEE+0xb5) [0x7fc50f7747d5]
6 /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager16kv_cache_manager14KVCacheManager14removeSequenceEmNS_6common11OptionalRefIKNS0_10LlmRequestEEE+0x156) [0x7fc50f774976]
7 /usr/local/lib/python3.12/dist-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(+0x4be0ec) [0x7fc5740e30ec]
8 /usr/local/lib/python3.12/dist-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(+0x474a0f) [0x7fc574099a0f]
9 /usr/bin/python() [0x58208f]
10 /usr/bin/python(_PyObject_MakeTpCall+0x75) [0x549185]
11 /usr/bin/python(_PyEval_EvalFrameDefault+0xa89) [0x5d73c9]
12 /usr/bin/python() [0x54cd32]
13 /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
14 /usr/bin/python() [0x54cd32]
15 /usr/bin/python() [0x6f826c]
16 /usr/bin/python() [0x6b917c]
17 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7fc8f8e6daa4]
18 /usr/lib/x86_64-linux-gnu/libc.so.6(__clone+0x44) [0x7fc8f8efaa34]
=================================
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] *** Process received signal ***
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] Signal: Segmentation fault (11)
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] Signal code: (-6)
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] Failing at address: 0xfffc00000218
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x7fc8f8e16330]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [ 1] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager16kv_cache_manager18WindowBlockManager11storeBlocksERKSt6vectorINS1_8BlockKeyESaIS4_EERKS3_IiSaIiEE+0x158)[0x7fc50f773248]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [ 2] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager16kv_cache_manager18WindowBlockManager19storeBlocksForReuseERNS1_17GenerationRequestENS_6common11OptionalRefIKNS0_10LlmRequestEEE+0xcc)[0x7fc50f77458c]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [ 3] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager16kv_cache_manager12BlockManager13releaseBlocksERNS1_17GenerationRequestENS_6common11OptionalRefIKNS0_10LlmRequestEEE+0xb5)[0x7fc50f7747d5]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [ 4] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager16kv_cache_manager14KVCacheManager14removeSequenceEmNS_6common11OptionalRefIKNS0_10LlmRequestEEE+0x156)[0x7fc50f774976]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [ 5] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(+0x4be0ec)[0x7fc5740e30ec]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [ 6] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(+0x474a0f)[0x7fc574099a0f]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [ 7] /usr/bin/python[0x58208f]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [ 8] /usr/bin/python(_PyObject_MakeTpCall+0x75)[0x549185]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [ 9] /usr/bin/python(_PyEval_EvalFrameDefault+0xa89)[0x5d73c9]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [10] /usr/bin/python[0x54cd32]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [11] /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b)[0x5db55b]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [12] /usr/bin/python[0x54cd32]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [13] /usr/bin/python[0x6f826c]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [14] /usr/bin/python[0x6b917c]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [15] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4)[0x7fc8f8e6daa4]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [16] /usr/lib/x86_64-linux-gnu/libc.so.6(__clone+0x44)[0x7fc8f8efaa34]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] *** End of error message ***
--------------------------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
additional notes
Recent commits around cuda graph changes, padding batches, and spec decoding changes seem like the most likely causes.