Skip to content

[Bug]: segfault when using speculative decoding on 1.1.0rc1 #7310

@zeiler

Description

@zeiler

System Info

Using the latest 1.1.0rc1 there seems to be a segfault when going to higher concurrency (not sure exactly where but higher than 10 requests in flight). This does not happen when speculative decoding is off so it's likely due to some of the recent refactors there and changes to cudagraph handling is my guess looking through recent commits.

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Config to reproduce:

low_latency.yaml

cuda_graph_config:
  enable_padding: true
  max_batch_size: 128

enable_attention_dp: false

moe_config:
    backend: TRTLLM

#############
# This is what you can comment out for the error to go away
speculative_config:
  decoding_type: AUTO
#############

enable_chunked_prefill: true

On a B200 GPU:

trtllm-serve openai/gpt-oss-120b   --host 0.0.0.0   --port 23333   --backend pytorch   --tp_size 1   --ep_size 1   --trust_remote_code   --extra_llm_api_options /tmp/low_latency.yaml   --kv_cache_free_gpu_memory_fraction 0.75

Then send many requests. Likely trt bench will do that but we've been testing with other openAI calls. concurrency of 1 works fine, concurrency of 10 seems fine then somewhere between 10 and 50 it segfaults.

This does not happen in the gpt-oss docker image that was provided at launch.

Expected behavior

It doesn't segfault

actual behavior

it does segfault:

Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1471, in _forward_step
    outputs = forward(scheduled_requests, self.resource_manager,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/nvtx/nvtx.py", line 122, in inner
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1459, in forward
    return self.model_engine.forward(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/utils.py", line 72, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 2233, in forward
    outputs = maybe_graph.run(inputs)
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py", line 114, in run
    self.input_ids[:seqlen].copy_(input_ids)
RuntimeError: The size of tensor a (56) must match the size of tensor b (77) at non-singleton dimension 0
[08/27/2025-18:19:32] [TRT-LLM] [E] Encountered an error in forward function: The size of tensor a (56) must match the size of tensor b (77) at non-singleton dimension 0
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:536  :0:977] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x60114d790)
==== backtrace (tid:    977) ====
 0  /opt/hpcx/ompi/lib/openmpi/../../../ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7fc8f0aa2774]
 1  /opt/hpcx/ompi/lib/openmpi/../../../ucx/lib/libucs.so.0(+0x3796a) [0x7fc8f0aa296a]
 2  /opt/hpcx/ompi/lib/openmpi/../../../ucx/lib/libucs.so.0(+0x37ba8) [0x7fc8f0aa2ba8]
 3  /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager16kv_cache_manager18WindowBlockManager11storeBlocksERKSt6vectorINS1_8BlockKeyESaIS4_EERKS3_IiSaIiEE+0x158) [0x7fc50f773248]
 4  /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager16kv_cache_manager18WindowBlockManager19storeBlocksForReuseERNS1_17GenerationRequestENS_6common11OptionalRefIKNS0_10LlmRequestEEE+0xcc) [0x7fc50f77458c]
 5  /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager16kv_cache_manager12BlockManager13releaseBlocksERNS1_17GenerationRequestENS_6common11OptionalRefIKNS0_10LlmRequestEEE+0xb5) [0x7fc50f7747d5]
 6  /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager16kv_cache_manager14KVCacheManager14removeSequenceEmNS_6common11OptionalRefIKNS0_10LlmRequestEEE+0x156) [0x7fc50f774976]
 7  /usr/local/lib/python3.12/dist-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(+0x4be0ec) [0x7fc5740e30ec]
 8  /usr/local/lib/python3.12/dist-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(+0x474a0f) [0x7fc574099a0f]
 9  /usr/bin/python() [0x58208f]
10  /usr/bin/python(_PyObject_MakeTpCall+0x75) [0x549185]
11  /usr/bin/python(_PyEval_EvalFrameDefault+0xa89) [0x5d73c9]
12  /usr/bin/python() [0x54cd32]
13  /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
14  /usr/bin/python() [0x54cd32]
15  /usr/bin/python() [0x6f826c]
16  /usr/bin/python() [0x6b917c]
17  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7fc8f8e6daa4]
18  /usr/lib/x86_64-linux-gnu/libc.so.6(__clone+0x44) [0x7fc8f8efaa34]
=================================
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] *** Process received signal ***
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] Signal: Segmentation fault (11)
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] Signal code:  (-6)
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] Failing at address: 0xfffc00000218
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x7fc8f8e16330]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [ 1] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager16kv_cache_manager18WindowBlockManager11storeBlocksERKSt6vectorINS1_8BlockKeyESaIS4_EERKS3_IiSaIiEE+0x158)[0x7fc50f773248]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [ 2] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager16kv_cache_manager18WindowBlockManager19storeBlocksForReuseERNS1_17GenerationRequestENS_6common11OptionalRefIKNS0_10LlmRequestEEE+0xcc)[0x7fc50f77458c]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [ 3] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager16kv_cache_manager12BlockManager13releaseBlocksERNS1_17GenerationRequestENS_6common11OptionalRefIKNS0_10LlmRequestEEE+0xb5)[0x7fc50f7747d5]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [ 4] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager16kv_cache_manager14KVCacheManager14removeSequenceEmNS_6common11OptionalRefIKNS0_10LlmRequestEEE+0x156)[0x7fc50f774976]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [ 5] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(+0x4be0ec)[0x7fc5740e30ec]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [ 6] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(+0x474a0f)[0x7fc574099a0f]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [ 7] /usr/bin/python[0x58208f]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [ 8] /usr/bin/python(_PyObject_MakeTpCall+0x75)[0x549185]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [ 9] /usr/bin/python(_PyEval_EvalFrameDefault+0xa89)[0x5d73c9]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [10] /usr/bin/python[0x54cd32]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [11] /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b)[0x5db55b]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [12] /usr/bin/python[0x54cd32]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [13] /usr/bin/python[0x6f826c]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [14] /usr/bin/python[0x6b917c]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [15] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4)[0x7fc8f8e6daa4]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [16] /usr/lib/x86_64-linux-gnu/libc.so.6(__clone+0x44)[0x7fc8f8efaa34]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] *** End of error message ***
--------------------------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------

additional notes

Recent commits around cuda graph changes, padding batches, and spec decoding changes seem like the most likely causes.

Metadata

Metadata

Assignees

No one assigned

    Labels

    KV-Cache Managementkv-cache management for efficient LLM inferenceSpeculative Decoding<NV>MTP/Eagle/Medusa/Lookahead/Prompt-Lookup-Decoding/Draft-Target-Model/ReDrafterbugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions