[Bug]: segfault when using speculative decoding on 1.1.0rc1

### System Info

Using the latest 1.1.0rc1 there seems to be a segfault when going to higher concurrency (not sure exactly where but higher than 10 requests in flight). This does not happen when speculative decoding is off so it's likely due to some of the recent refactors there and changes to cudagraph handling is my guess looking through recent commits.




### Who can help?

_No response_

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Tasks

- [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

Config to reproduce: 

low_latency.yaml 
```
cuda_graph_config:
  enable_padding: true
  max_batch_size: 128

enable_attention_dp: false

moe_config:
    backend: TRTLLM

#############
# This is what you can comment out for the error to go away
speculative_config:
  decoding_type: AUTO
#############

enable_chunked_prefill: true
```
On a B200 GPU:
```
trtllm-serve openai/gpt-oss-120b   --host 0.0.0.0   --port 23333   --backend pytorch   --tp_size 1   --ep_size 1   --trust_remote_code   --extra_llm_api_options /tmp/low_latency.yaml   --kv_cache_free_gpu_memory_fraction 0.75
```
Then send many requests. Likely trt bench will do that but we've been testing with other openAI calls. concurrency of 1 works fine, concurrency of 10 seems fine then somewhere between 10 and 50 it segfaults. 

This does not happen in the gpt-oss docker image that was provided at launch. 

### Expected behavior

It doesn't segfault

### actual behavior

it does segfault:


```
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1471, in _forward_step
    outputs = forward(scheduled_requests, self.resource_manager,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/nvtx/nvtx.py", line 122, in inner
    result = func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1459, in forward
    return self.model_engine.forward(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/utils.py", line 72, in wrapper
    return func(self, *args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 2233, in forward
    outputs = maybe_graph.run(inputs)
              ^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/cuda_graph_runner.py", line 114, in run
    self.input_ids[:seqlen].copy_(input_ids)
RuntimeError: The size of tensor a (56) must match the size of tensor b (77) at non-singleton dimension 0
[08/27/2025-18:19:32] [TRT-LLM] [E] Encountered an error in forward function: The size of tensor a (56) must match the size of tensor b (77) at non-singleton dimension 0
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:536  :0:977] Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0x60114d790)
==== backtrace (tid:    977) ====
 0  /opt/hpcx/ompi/lib/openmpi/../../../ucx/lib/libucs.so.0(ucs_handle_error+0x2e4) [0x7fc8f0aa2774]
 1  /opt/hpcx/ompi/lib/openmpi/../../../ucx/lib/libucs.so.0(+0x3796a) [0x7fc8f0aa296a]
 2  /opt/hpcx/ompi/lib/openmpi/../../../ucx/lib/libucs.so.0(+0x37ba8) [0x7fc8f0aa2ba8]
 3  /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager16kv_cache_manager18WindowBlockManager11storeBlocksERKSt6vectorINS1_8BlockKeyESaIS4_EERKS3_IiSaIiEE+0x158) [0x7fc50f773248]
 4  /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager16kv_cache_manager18WindowBlockManager19storeBlocksForReuseERNS1_17GenerationRequestENS_6common11OptionalRefIKNS0_10LlmRequestEEE+0xcc) [0x7fc50f77458c]
 5  /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager16kv_cache_manager12BlockManager13releaseBlocksERNS1_17GenerationRequestENS_6common11OptionalRefIKNS0_10LlmRequestEEE+0xb5) [0x7fc50f7747d5]
 6  /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager16kv_cache_manager14KVCacheManager14removeSequenceEmNS_6common11OptionalRefIKNS0_10LlmRequestEEE+0x156) [0x7fc50f774976]
 7  /usr/local/lib/python3.12/dist-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(+0x4be0ec) [0x7fc5740e30ec]
 8  /usr/local/lib/python3.12/dist-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(+0x474a0f) [0x7fc574099a0f]
 9  /usr/bin/python() [0x58208f]
10  /usr/bin/python(_PyObject_MakeTpCall+0x75) [0x549185]
11  /usr/bin/python(_PyEval_EvalFrameDefault+0xa89) [0x5d73c9]
12  /usr/bin/python() [0x54cd32]
13  /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b) [0x5db55b]
14  /usr/bin/python() [0x54cd32]
15  /usr/bin/python() [0x6f826c]
16  /usr/bin/python() [0x6b917c]
17  /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7fc8f8e6daa4]
18  /usr/lib/x86_64-linux-gnu/libc.so.6(__clone+0x44) [0x7fc8f8efaa34]
=================================
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] *** Process received signal ***
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] Signal: Segmentation fault (11)
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] Signal code:  (-6)
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] Failing at address: 0xfffc00000218
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [ 0] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x45330)[0x7fc8f8e16330]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [ 1] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager16kv_cache_manager18WindowBlockManager11storeBlocksERKSt6vectorINS1_8BlockKeyESaIS4_EERKS3_IiSaIiEE+0x158)[0x7fc50f773248]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [ 2] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager16kv_cache_manager18WindowBlockManager19storeBlocksForReuseERNS1_17GenerationRequestENS_6common11OptionalRefIKNS0_10LlmRequestEEE+0xcc)[0x7fc50f77458c]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [ 3] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager16kv_cache_manager12BlockManager13releaseBlocksERNS1_17GenerationRequestENS_6common11OptionalRefIKNS0_10LlmRequestEEE+0xb5)[0x7fc50f7747d5]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [ 4] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/libs/libtensorrt_llm.so(_ZN12tensorrt_llm13batch_manager16kv_cache_manager14KVCacheManager14removeSequenceEmNS_6common11OptionalRefIKNS0_10LlmRequestEEE+0x156)[0x7fc50f774976]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [ 5] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(+0x4be0ec)[0x7fc5740e30ec]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [ 6] /usr/local/lib/python3.12/dist-packages/tensorrt_llm/bindings.cpython-312-x86_64-linux-gnu.so(+0x474a0f)[0x7fc574099a0f]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [ 7] /usr/bin/python[0x58208f]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [ 8] /usr/bin/python(_PyObject_MakeTpCall+0x75)[0x549185]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [ 9] /usr/bin/python(_PyEval_EvalFrameDefault+0xa89)[0x5d73c9]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [10] /usr/bin/python[0x54cd32]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [11] /usr/bin/python(_PyEval_EvalFrameDefault+0x4c1b)[0x5db55b]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [12] /usr/bin/python[0x54cd32]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [13] /usr/bin/python[0x6f826c]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [14] /usr/bin/python[0x6b917c]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [15] /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4)[0x7fc8f8e6daa4]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] [16] /usr/lib/x86_64-linux-gnu/libc.so.6(__clone+0x44)[0x7fc8f8efaa34]
[runner-cl-864212209b8693ac64d4009c5474dcb8-76c9b9cd4-gqc7w:00536] *** End of error message ***
--------------------------------------------------------------------------
Child job 2 terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
```

### additional notes

Recent commits around cuda graph changes, padding batches, and spec decoding changes seem like the most likely causes. 


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Bug]: segfault when using speculative decoding on 1.1.0rc1 #7310

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: segfault when using speculative decoding on 1.1.0rc1 #7310

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions