-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Open
Labels
KV-Cache Managementkv-cache management for efficient LLM inferencekv-cache management for efficient LLM inferencePytorch<NV>Pytorch backend related issues<NV>Pytorch backend related issuesbugSomething isn't workingSomething isn't workingwaiting for feedback
Description
System Info
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 575.57.08 Driver Version: 575.57.08 CUDA Version: 12.9 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA B200 Off | 00000000:03:00.0 Off | 0 |
| N/A 33C P0 146W / 1000W | 0MiB / 183359MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 1 NVIDIA B200 Off | 00000000:13:00.0 Off | 0 |
| N/A 39C P0 142W / 1000W | 0MiB / 183359MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 2 NVIDIA B200 Off | 00000000:63:00.0 Off | 0 |
| N/A 33C P0 142W / 1000W | 0MiB / 183359MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 3 NVIDIA B200 Off | 00000000:73:00.0 Off | 0 |
| N/A 40C P0 145W / 1000W | 0MiB / 183359MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 4 NVIDIA B200 Off | 00000000:83:00.0 Off | 0 |
| N/A 34C P0 152W / 1000W | 0MiB / 183359MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 5 NVIDIA B200 Off | 00000000:93:00.0 Off | 0 |
| N/A 40C P0 144W / 1000W | 0MiB / 183359MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 6 NVIDIA B200 Off | 00000000:E3:00.0 Off | 0 |
| N/A 34C P0 142W / 1000W | 0MiB / 183359MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
| 7 NVIDIA B200 Off | 00000000:F3:00.0 Off | 0 |
| N/A 40C P0 146W / 1000W | 0MiB / 183359MiB | 0% Default |
| | | Disabled |
+-----------------------------------------+------------------------+----------------------+
+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Built on Tue_May_27_02:21:03_PDT_2025
Cuda compilation tools, release 12.9, V12.9.86
Build cuda_12.9.r12.9/compiler.36037853_0
Python 3.12.3
Name: tensorrt_llm
Version: 1.1.0rc2
Summary: TensorRT-LLM: A TensorRT Toolbox for Large Language Models
Home-page: https://github.com/NVIDIA/TensorRT-LLM
Author: NVIDIA Corporation
Author-email:
License: Apache License 2.0
Location: /usr/local/lib/python3.12/dist-packages
Requires: accelerate, aenum, backoff, blake3, blobfile, build, click, click_option_group, colored, cuda-python, cytoolz, datasets, diffusers, einops, etcd3, evaluate, fastapi, flashinfer-python, h5py, jsonschema, lark, llguidance, matplotlib, meson, mpi4py, mpmath, msgspec, ninja, numpy, nvidia-cuda-nvrtc-cu12, nvidia-ml-py, nvidia-modelopt, nvidia-nccl-cu12, nvtx, omegaconf, onnx, onnx_graphsurgeon, openai, opencv-python-headless, optimum, ordered-set, pandas, peft, pillow, polygraphy, prometheus_client, prometheus_fastapi_instrumentator, protobuf, psutil, pulp, pydantic, pydantic-settings, pynvml, pyzmq, sentencepiece, setuptools, soundfile, StrEnum, tensorrt, tiktoken, torch, torchvision, transformers, triton, uvicorn, wheel, xgrammar
Required-by:
---
Name: tensorrt
Version: 10.11.0.33
Summary: A high performance deep learning inference library
Home-page: https://github.com/nvidia/tensorrt
Author: NVIDIA Corporation
Author-email:
License: Proprietary
Location: /usr/local/lib/python3.12/dist-packages
Requires:
Required-by: tensorrt_llm
---
Name: torch
Version: 2.8.0a0+5228986c39.nv25.6
Summary: Tensors and Dynamic neural networks in Python with strong GPU acceleration
Home-page: https://pytorch.org/
Author: PyTorch Team
Author-email: [email protected]
License: BSD-3-Clause
Location: /usr/local/lib/python3.12/dist-packages
Requires: filelock, fsspec, jinja2, networkx, setuptools, sympy, typing-extensions
Required-by: accelerate, flash_attn, flashinfer-python, lightning-thunder, nvidia-modelopt, nvidia-resiliency-ext, optimum, peft, tensorrt_llm, torchprofile, torchvision, transformer_engine, xgrammar
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examples
folder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Running intranode 1P1D for Deepseek V3 FP4 with MTP. Once it's warmed up, I begin to run a GPQA benchmark at batch size 1. After some time (over 50 iterations), the server crashes.
Prefill:
cat >./prefill-extra-llm-api-config.yml<<EOF
enable_iter_perf_stats: true
print_iter_log: false
cuda_graph_config:
max_batch_size: 32
enable_padding: false
moe_config:
backend: TRTLLM
max_num_tokens: 32768
speculative_config:
decoding_type: MTP
num_nextn_predict_layers: 3
#use_relaxed_acceptance_for_thinking: true
# relaxed_topk: 3
# relaxed_delta: 0.6
disable_overlap_scheduler: true
enable_autotuner: false
kv_cache_config:
free_gpu_memory_fraction: 0.4
enable_block_reuse: true
enable_partial_reuse: false
enable_chunked_prefill: true
scheduler_config:
context_chunking_policy: EQUAL_PROGRESS
cache_transceiver_config:
backend: UCX
max_tokens_in_buffer: 32768
EOF
export TORCHDYNAMO_DISABLE=1
trtllm-serve "${MODEL_NAME}"\
--host 0.0.0.0 \
--port "$PORT" \
--backend pytorch \
--max_batch_size 32 \
--max_num_tokens 32768 \
--max_seq_len 162000 \
--tp_size 4 --ep_size 1 \
--extra_llm_api_options ./prefill-extra-llm-api-config.yml \
--log_level info
Decode:
cat >./decode-extra-llm-api-config.yml<<EOF
enable_iter_perf_stats: true
print_iter_log: false
cuda_graph_config:
max_batch_size: 32
enable_padding: false
moe_config:
backend: TRTLLM
max_num_tokens: 32768
speculative_config:
decoding_type: MTP
num_nextn_predict_layers: 3
#use_relaxed_acceptance_for_thinking: true
# relaxed_topk: 3
# relaxed_delta: 0.6
disable_overlap_scheduler: false
enable_autotuner: false
kv_cache_config:
free_gpu_memory_fraction: 0.5
enable_block_reuse: true
enable_partial_reuse: false
enable_chunked_prefill: true
cache_transceiver_config:
backend: UCX
max_tokens_in_buffer: 32768
EOF
export TORCHDYNAMO_DISABLE=1
trtllm-serve "${MODEL_NAME}"\
--host 0.0.0.0 \
--port "$PORT" \
--backend pytorch \
--max_batch_size 32 \
--max_num_tokens 32768 \
--max_seq_len 162000 \
--tp_size 4 --ep_size 1 \
--extra_llm_api_options ./decode-extra-llm-api-config.yml \
--log_level info
Orchestrator:
cat >./orchestrator-config.yml<<EOF
hostname: localhost
port: 8000
backend: pytorch
context_servers:
num_instances: ${PREFILL_COUNT}
urls:
${PREFILL_URL_LINES}
generation_servers:
num_instances: ${DECODE_COUNT}
urls:
${DECODE_URL_LINES}
EOF
trtllm-serve disaggregated -c orchestrator-config.yml
Expected behavior
not crashing
actual behavior
[TensorRT-LLM][ERROR] Exception in DataResponder response: [TensorRT-LLM][ERROR] Assertion failed: This executor does not have a prepared KV cache for request ID: 41, and the mReadyResponses size is: 1. mpi rank :0 (/src/tensorrt_llm/cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp:261)
1 0x7faa1c51ff66 tensorrt_llm::common::throwRuntimeError(char const*, int, char const*) + 97
2 0x7fa9f5e735b9 tensorrt_llm::batch_manager::DataResponder::Impl::response() + 3353
3 0x7fa9f5e6c81d std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)() noexcept, tensorrt_llm::batch_manager::DataResponder::Impl*> >, void> >::_M_invoke(std::_Any_data const&) + 45
4 0x7fa9f5e4db5d std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) + 45
5 0x7faed6829ed3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ed3) [0x7faed6829ed3]
6 0x7fa9f5e6dba8 std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)() noexcept, tensorrt_llm::batch_manager::DataResponder::Impl*> >, void>::_M_run() + 248
7 0x7fac4af2bdb4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7fac4af2bdb4]
8 0x7faed6824aa4 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7faed6824aa4]
9 0x7faed68b1c3c /usr/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7faed68b1c3c]
[TensorRT-LLM][ERROR] Exception in DataResponder response: [TensorRT-LLM][ERROR] Assertion failed: This executor does not have a prepared KV cache for request ID: 41, and the mReadyResponses size is: 1. mpi rank :1 (/src/tensorrt_llm/cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp:261)
1 0x7f2474d1ff66 tensorrt_llm::common::throwRuntimeError(char const*, int, char const*) + 97
2 0x7f244e6735b9 tensorrt_llm::batch_manager::DataResponder::Impl::response() + 3353
3 0x7f244e66c81d std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)() noexcept, tensorrt_llm::batch_manager::DataResponder::Impl*> >, void> >::_M_invoke(std::_Any_data const&) + 45
4 0x7f244e64db5d std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) + 45
5 0x7f2926ccced3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ed3) [0x7f2926ccced3]
6 0x7f244e66dba8 std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)() noexcept, tensorrt_llm::batch_manager::DataResponder::Impl*> >, void>::_M_run() + 248
7 0x7f26a3be0db4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7f26a3be0db4]
8 0x7f2926cc7aa4 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7f2926cc7aa4]
9 0x7f2926d54c3c /usr/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7f2926d54c3c]
[TensorRT-LLM][ERROR] Exception in DataResponder response: [TensorRT-LLM][ERROR] Assertion failed: This executor does not have a prepared KV cache for request ID: 41, and the mReadyResponses size is: 1. mpi rank :3 (/src/tensorrt_llm/cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp:261)
1 0x7f990fa1ff66 tensorrt_llm::common::throwRuntimeError(char const*, int, char const*) + 97
2 0x7f98e93735b9 tensorrt_llm::batch_manager::DataResponder::Impl::response() + 3353
3 0x7f98e936c81d std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)() noexcept, tensorrt_llm::batch_manager::DataResponder::Impl*> >, void> >::_M_invoke(std::_Any_data const&) + 45
4 0x7f98e934db5d std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) + 45
5 0x7f9dc198ced3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ed3) [0x7f9dc198ced3]
6 0x7f98e936dba8 std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)() noexcept, tensorrt_llm::batch_manager::DataResponder::Impl*> >, void>::_M_run() + 248
7 0x7f9b3e4a8db4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7f9b3e4a8db4]
8 0x7f9dc1987aa4 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7f9dc1987aa4]
9 0x7f9dc1a14c3c /usr/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7f9dc1a14c3c]
[TensorRT-LLM][ERROR] Exception in DataResponder response: [TensorRT-LLM][ERROR] Assertion failed: This executor does not have a prepared KV cache for request ID: 41, and the mReadyResponses size is: 1. mpi rank :2 (/src/tensorrt_llm/cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp:261)
1 0x7fe2eb91ff66 tensorrt_llm::common::throwRuntimeError(char const*, int, char const*) + 97
2 0x7fe2c52735b9 tensorrt_llm::batch_manager::DataResponder::Impl::response() + 3353
3 0x7fe2c526c81d std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)() noexcept, tensorrt_llm::batch_manager::DataResponder::Impl*> >, void> >::_M_invoke(std::_Any_data const&) + 45
4 0x7fe2c524db5d std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) + 45
5 0x7fe79e03aed3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ed3) [0x7fe79e03aed3]
6 0x7fe2c526dba8 std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)() noexcept, tensorrt_llm::batch_manager::DataResponder::Impl*> >, void>::_M_run() + 248
7 0x7fe51a31fdb4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7fe51a31fdb4]
8 0x7fe79e035aa4 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7fe79e035aa4]
9 0x7fe79e0c2c3c /usr/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7fe79e0c2c3c]
[10/08/2025-03:45:31] [TRT-LLM] [RANK 0] [I] statistics={request_id="265", timestamps={received=1759895131.8380768, scheduled=1759895131.8419557, prefill=1759895131.8925006, first_token=0.0, response=1759895131.892659}, counts={prompt_tokens=192, total_tokens=193, completion_tokens=1, cached_tokens=160, prefill={chunks=[{tokens={context=32, decode=0}, requests={context=1, decode=0}, latency=50.32086372375488}], iterations=1}, blocks={context=6, decode=0}, accepted_tokens=0, drafted_tokens=0, acceptance_histogram=[0, 0, 0, 0]}, system={page_size=32, maximum_capacity=32, maximum_blocks=22226, inflight_requests=1, inflight_blocks=6, utilization=0.03125}, preemption={count=0, latency=0.0, tokens=0}, total_time=54.58211898803711, prefill_latency=50.54497718811035, scheduling_latency=3.8788318634033203, ttft=-1759895131838.077, time_per_token=0.1583099365234375, decode_throughput=6316.722891566265, cache_hit_rate=0.8333333333333334, health=false, finish_reason="length"}
[10/08/2025-03:45:31] [TRT-LLM] [RANK 2] [E] Error in event loop: [TensorRT-LLM][ERROR] Assertion failed: This executor does not have a prepared KV cache for request ID: 41, and the mReadyResponses size is: 1. mpi rank :2 (/src/tensorrt_llm/cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp:261)
1 0x7fe2eb91ff66 tensorrt_llm::common::throwRuntimeError(char const*, int, char const*) + 97
2 0x7fe2c52735b9 tensorrt_llm::batch_manager::DataResponder::Impl::response() + 3353
3 0x7fe2c526c81d std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)() noexcept, tensorrt_llm::batch_manager::DataResponder::Impl*> >, void> >::_M_invoke(std::_Any_data const&) + 45
4 0x7fe2c524db5d std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) + 45
5 0x7fe79e03aed3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ed3) [0x7fe79e03aed3]
6 0x7fe2c526dba8 std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)() noexcept, tensorrt_llm::batch_manager::DataResponder::Impl*> >, void>::_M_run() + 248
7 0x7fe51a31fdb4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7fe51a31fdb4]
8 0x7fe79e035aa4 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7fe79e035aa4]
9 0x7fe79e0c2c3c /usr/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7fe79e0c2c3c]
[10/08/2025-03:45:31] [TRT-LLM] [RANK 1] [E] Error in event loop: [TensorRT-LLM][ERROR] Assertion failed: This executor does not have a prepared KV cache for request ID: 41, and the mReadyResponses size is: 1. mpi rank :1 (/src/tensorrt_llm/cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp:261)
1 0x7f2474d1ff66 tensorrt_llm::common::throwRuntimeError(char const*, int, char const*) + 97
2 0x7f244e6735b9 tensorrt_llm::batch_manager::DataResponder::Impl::response() + 3353
3 0x7f244e66c81d std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)() noexcept, tensorrt_llm::batch_manager::DataResponder::Impl*> >, void> >::_M_invoke(std::_Any_data const&) + 45
4 0x7f244e64db5d std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) + 45
5 0x7f2926ccced3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ed3) [0x7f2926ccced3]
6 0x7f244e66dba8 std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)() noexcept, tensorrt_llm::batch_manager::DataResponder::Impl*> >, void>::_M_run() + 248
7 0x7f26a3be0db4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7f26a3be0db4]
8 0x7f2926cc7aa4 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7f2926cc7aa4]
9 0x7f2926d54c3c /usr/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7f2926d54c3c]
[10/08/2025-03:45:31] [TRT-LLM] [RANK 3] [E] Error in event loop: [TensorRT-LLM][ERROR] Assertion failed: This executor does not have a prepared KV cache for request ID: 41, and the mReadyResponses size is: 1. mpi rank :3 (/src/tensorrt_llm/cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp:261)
1 0x7f990fa1ff66 tensorrt_llm::common::throwRuntimeError(char const*, int, char const*) + 97
2 0x7f98e93735b9 tensorrt_llm::batch_manager::DataResponder::Impl::response() + 3353
3 0x7f98e936c81d std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)() noexcept, tensorrt_llm::batch_manager::DataResponder::Impl*> >, void> >::_M_invoke(std::_Any_data const&) + 45
4 0x7f98e934db5d std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) + 45
5 0x7f9dc198ced3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ed3) [0x7f9dc198ced3]
6 0x7f98e936dba8 std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)() noexcept, tensorrt_llm::batch_manager::DataResponder::Impl*> >, void>::_M_run() + 248
7 0x7f9b3e4a8db4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7f9b3e4a8db4]
8 0x7f9dc1987aa4 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7f9dc1987aa4]
9 0x7f9dc1a14c3c /usr/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7f9dc1a14c3c]
[10/08/2025-03:45:31] [TRT-LLM] [RANK 2] [E] Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 311, in _event_loop_wrapper
self.event_loop()
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1112, in _executor_loop
ctx_transmission_reqs = self._send_disagg_ctx_cache(scheduled_batch.context_requests)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/nvtx/nvtx.py", line 122, in inner
result = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1558, in _send_disagg_ctx_cache
self.kv_cache_transceiver.check_context_transfer_status(0)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py", line 120, in check_context_transfer_status
return self.impl.check_context_transfer_status(at_least_request_num)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: This executor does not have a prepared KV cache for request ID: 41, and the mReadyResponses size is: 1. mpi rank :2 (/src/tensorrt_llm/cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp:261)
1 0x7fe2eb91ff66 tensorrt_llm::common::throwRuntimeError(char const*, int, char const*) + 97
2 0x7fe2c52735b9 tensorrt_llm::batch_manager::DataResponder::Impl::response() + 3353
3 0x7fe2c526c81d std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)() noexcept, tensorrt_llm::batch_manager::DataResponder::Impl*> >, void> >::_M_invoke(std::_Any_data const&) + 45
4 0x7fe2c524db5d std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) + 45
5 0x7fe79e03aed3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ed3) [0x7fe79e03aed3]
6 0x7fe2c526dba8 std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)() noexcept, tensorrt_llm::batch_manager::DataResponder::Impl*> >, void>::_M_run() + 248
7 0x7fe51a31fdb4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7fe51a31fdb4]
8 0x7fe79e035aa4 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7fe79e035aa4]
9 0x7fe79e0c2c3c /usr/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7fe79e0c2c3c]
Exception in thread Thread-7 (_event_loop_wrapper):
Traceback (most recent call last):
File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
[10/08/2025-03:45:31] [TRT-LLM] [RANK 1] [E] Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 311, in _event_loop_wrapper
self.event_loop()
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1112, in _executor_loop
ctx_transmission_reqs = self._send_disagg_ctx_cache(scheduled_batch.context_requests)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/nvtx/nvtx.py", line 122, in inner
result = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1558, in _send_disagg_ctx_cache
self.kv_cache_transceiver.check_context_transfer_status(0)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py", line 120, in check_context_transfer_status
return self.impl.check_context_transfer_status(at_least_request_num)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: This executor does not have a prepared KV cache for request ID: 41, and the mReadyResponses size is: 1. mpi rank :1 (/src/tensorrt_llm/cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp:261)
1 0x7f2474d1ff66 tensorrt_llm::common::throwRuntimeError(char const*, int, char const*) + 97
2 0x7f244e6735b9 tensorrt_llm::batch_manager::DataResponder::Impl::response() + 3353
3 0x7f244e66c81d std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)() noexcept, tensorrt_llm::batch_manager::DataResponder::Impl*> >, void> >::_M_invoke(std::_Any_data const&) + 45
4 0x7f244e64db5d std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) + 45
5 0x7f2926ccced3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ed3) [0x7f2926ccced3]
6 0x7f244e66dba8 std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)() noexcept, tensorrt_llm::batch_manager::DataResponder::Impl*> >, void>::_M_run() + 248
7 0x7f26a3be0db4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7f26a3be0db4]
8 0x7f2926cc7aa4 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7f2926cc7aa4]
9 0x7f2926d54c3c /usr/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7f2926d54c3c]
Exception in thread Thread-7 (_event_loop_wrapper):
Traceback (most recent call last):
File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
self.run()
File "/usr/lib/python3.12/threading.py", line 1010, in run
[10/08/2025-03:45:31] [TRT-LLM] [RANK 0] [E] Error in event loop: [TensorRT-LLM][ERROR] Assertion failed: This executor does not have a prepared KV cache for request ID: 41, and the mReadyResponses size is: 1. mpi rank :0 (/src/tensorrt_llm/cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp:261)
1 0x7faa1c51ff66 tensorrt_llm::common::throwRuntimeError(char const*, int, char const*) + 97
2 0x7fa9f5e735b9 tensorrt_llm::batch_manager::DataResponder::Impl::response() + 3353
3 0x7fa9f5e6c81d std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)() noexcept, tensorrt_llm::batch_manager::DataResponder::Impl*> >, void> >::_M_invoke(std::_Any_data const&) + 45
4 0x7fa9f5e4db5d std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) + 45
5 0x7faed6829ed3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ed3) [0x7faed6829ed3]
6 0x7fa9f5e6dba8 std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)() noexcept, tensorrt_llm::batch_manager::DataResponder::Impl*> >, void>::_M_run() + 248
7 0x7fac4af2bdb4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7fac4af2bdb4]
8 0x7faed6824aa4 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7faed6824aa4]
9 0x7faed68b1c3c /usr/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7faed68b1c3c]
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 318, in _event_loop_wrapper
raise e
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 311, in _event_loop_wrapper
self.event_loop()
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1112, in _executor_loop
ctx_transmission_reqs = self._send_disagg_ctx_cache(scheduled_batch.context_requests)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/nvtx/nvtx.py", line 122, in inner
result = func(*args, **kwargs)
[10/08/2025-03:45:31] [TRT-LLM] [RANK 3] [E] Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 311, in _event_loop_wrapper
self.event_loop()
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1112, in _executor_loop
ctx_transmission_reqs = self._send_disagg_ctx_cache(scheduled_batch.context_requests)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/nvtx/nvtx.py", line 122, in inner
result = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1558, in _send_disagg_ctx_cache
self.kv_cache_transceiver.check_context_transfer_status(0)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py", line 120, in check_context_transfer_status
return self.impl.check_context_transfer_status(at_least_request_num)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: This executor does not have a prepared KV cache for request ID: 41, and the mReadyResponses size is: 1. mpi rank :3 (/src/tensorrt_llm/cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp:261)
1 0x7f990fa1ff66 tensorrt_llm::common::throwRuntimeError(char const*, int, char const*) + 97
2 0x7f98e93735b9 tensorrt_llm::batch_manager::DataResponder::Impl::response() + 3353
3 0x7f98e936c81d std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)() noexcept, tensorrt_llm::batch_manager::DataResponder::Impl*> >, void> >::_M_invoke(std::_Any_data const&) + 45
4 0x7f98e934db5d std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) + 45
5 0x7f9dc198ced3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ed3) [0x7f9dc198ced3]
6 0x7f98e936dba8 std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)() noexcept, tensorrt_llm::batch_manager::DataResponder::Impl*> >, void>::_M_run() + 248
7 0x7f9b3e4a8db4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7f9b3e4a8db4]
8 0x7f9dc1987aa4 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7f9dc1987aa4]
9 0x7f9dc1a14c3c /usr/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7f9dc1a14c3c]
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1558, in _send_disagg_ctx_cache
Exception in thread Thread-7 (_event_loop_wrapper):
self.run()
File "/usr/lib/python3.12/threading.py", line 1010, in run
self.kv_cache_transceiver.check_context_transfer_status(0)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py", line 120, in check_context_transfer_status
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 318, in _event_loop_wrapper
return self.impl.check_context_transfer_status(at_least_request_num)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: This executor does not have a prepared KV cache for request ID: 41, and the mReadyResponses size is: 1. mpi rank :2 (/src/tensorrt_llm/cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp:261)
1 0x7fe2eb91ff66 tensorrt_llm::common::throwRuntimeError(char const*, int, char const*) + 97
2 0x7fe2c52735b9 tensorrt_llm::batch_manager::DataResponder::Impl::response() + 3353
3 0x7fe2c526c81d std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)() noexcept, tensorrt_llm::batch_manager::DataResponder::Impl*> >, void> >::_M_invoke(std::_Any_data const&) + 45
4 0x7fe2c524db5d std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) + 45
5 0x7fe79e03aed3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ed3) [0x7fe79e03aed3]
6 0x7fe2c526dba8 std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)() noexcept, tensorrt_llm::batch_manager::DataResponder::Impl*> >, void>::_M_run() + 248
7 0x7fe51a31fdb4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7fe51a31fdb4]
8 0x7fe79e035aa4 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7fe79e035aa4]
9 0x7fe79e0c2c3c /usr/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7fe79e0c2c3c]
raise e
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 311, in _event_loop_wrapper
self.event_loop()
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1112, in _executor_loop
Traceback (most recent call last):
File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
ctx_transmission_reqs = self._send_disagg_ctx_cache(scheduled_batch.context_requests)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/nvtx/nvtx.py", line 122, in inner
result = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1558, in _send_disagg_ctx_cache
self.run()
File "/usr/lib/python3.12/threading.py", line 1010, in run
self.kv_cache_transceiver.check_context_transfer_status(0)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py", line 120, in check_context_transfer_status
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 318, in _event_loop_wrapper
[10/08/2025-03:45:31] [TRT-LLM] [RANK 0] [E] Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 311, in _event_loop_wrapper
self.event_loop()
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1112, in _executor_loop
ctx_transmission_reqs = self._send_disagg_ctx_cache(scheduled_batch.context_requests)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/nvtx/nvtx.py", line 122, in inner
result = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1558, in _send_disagg_ctx_cache
self.kv_cache_transceiver.check_context_transfer_status(0)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py", line 120, in check_context_transfer_status
return self.impl.check_context_transfer_status(at_least_request_num)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: This executor does not have a prepared KV cache for request ID: 41, and the mReadyResponses size is: 1. mpi rank :0 (/src/tensorrt_llm/cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp:261)
1 0x7faa1c51ff66 tensorrt_llm::common::throwRuntimeError(char const*, int, char const*) + 97
2 0x7fa9f5e735b9 tensorrt_llm::batch_manager::DataResponder::Impl::response() + 3353
3 0x7fa9f5e6c81d std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)() noexcept, tensorrt_llm::batch_manager::DataResponder::Impl*> >, void> >::_M_invoke(std::_Any_data const&) + 45
4 0x7fa9f5e4db5d std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) + 45
5 0x7faed6829ed3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ed3) [0x7faed6829ed3]
6 0x7fa9f5e6dba8 std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)() noexcept, tensorrt_llm::batch_manager::DataResponder::Impl*> >, void>::_M_run() + 248
7 0x7fac4af2bdb4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7fac4af2bdb4]
8 0x7faed6824aa4 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7faed6824aa4]
9 0x7faed68b1c3c /usr/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7faed68b1c3c]
return self.impl.check_context_transfer_status(at_least_request_num)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: This executor does not have a prepared KV cache for request ID: 41, and the mReadyResponses size is: 1. mpi rank :1 (/src/tensorrt_llm/cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp:261)
1 0x7f2474d1ff66 tensorrt_llm::common::throwRuntimeError(char const*, int, char const*) + 97
2 0x7f244e6735b9 tensorrt_llm::batch_manager::DataResponder::Impl::response() + 3353
3 0x7f244e66c81d std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)() noexcept, tensorrt_llm::batch_manager::DataResponder::Impl*> >, void> >::_M_invoke(std::_Any_data const&) + 45
4 0x7f244e64db5d std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) + 45
5 0x7f2926ccced3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ed3) [0x7f2926ccced3]
6 0x7f244e66dba8 std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)() noexcept, tensorrt_llm::batch_manager::DataResponder::Impl*> >, void>::_M_run() + 248
7 0x7f26a3be0db4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7f26a3be0db4]
8 0x7f2926cc7aa4 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7f2926cc7aa4]
9 0x7f2926d54c3c /usr/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7f2926d54c3c]
Exception in thread Thread-7 (_event_loop_wrapper):
Traceback (most recent call last):
File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
raise e
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 311, in _event_loop_wrapper
self.event_loop()
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1112, in _executor_loop
ctx_transmission_reqs = self._send_disagg_ctx_cache(scheduled_batch.context_requests)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/nvtx/nvtx.py", line 122, in inner
result = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1558, in _send_disagg_ctx_cache
self.kv_cache_transceiver.check_context_transfer_status(0)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py", line 120, in check_context_transfer_status
return self.impl.check_context_transfer_status(at_least_request_num)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: This executor does not have a prepared KV cache for request ID: 41, and the mReadyResponses size is: 1. mpi rank :3 (/src/tensorrt_llm/cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp:261)
1 0x7f990fa1ff66 tensorrt_llm::common::throwRuntimeError(char const*, int, char const*) + 97
2 0x7f98e93735b9 tensorrt_llm::batch_manager::DataResponder::Impl::response() + 3353
3 0x7f98e936c81d std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)() noexcept, tensorrt_llm::batch_manager::DataResponder::Impl*> >, void> >::_M_invoke(std::_Any_data const&) + 45
4 0x7f98e934db5d std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) + 45
5 0x7f9dc198ced3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ed3) [0x7f9dc198ced3]
6 0x7f98e936dba8 std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)() noexcept, tensorrt_llm::batch_manager::DataResponder::Impl*> >, void>::_M_run() + 248
7 0x7f9b3e4a8db4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7f9b3e4a8db4]
8 0x7f9dc1987aa4 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7f9dc1987aa4]
9 0x7f9dc1a14c3c /usr/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7f9dc1a14c3c]
self.run()
File "/usr/lib/python3.12/threading.py", line 1010, in run
self._target(*self._args, **self._kwargs)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 318, in _event_loop_wrapper
raise e
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 311, in _event_loop_wrapper
self.event_loop()
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1112, in _executor_loop
ctx_transmission_reqs = self._send_disagg_ctx_cache(scheduled_batch.context_requests)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/nvtx/nvtx.py", line 122, in inner
result = func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor.py", line 1558, in _send_disagg_ctx_cache
self.kv_cache_transceiver.check_context_transfer_status(0)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/kv_cache_transceiver.py", line 120, in check_context_transfer_status
return self.impl.check_context_transfer_status(at_least_request_num)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: [TensorRT-LLM][ERROR] Assertion failed: This executor does not have a prepared KV cache for request ID: 41, and the mReadyResponses size is: 1. mpi rank :0 (/src/tensorrt_llm/cpp/tensorrt_llm/batch_manager/dataTransceiver.cpp:261)
1 0x7faa1c51ff66 tensorrt_llm::common::throwRuntimeError(char const*, int, char const*) + 97
2 0x7fa9f5e735b9 tensorrt_llm::batch_manager::DataResponder::Impl::response() + 3353
3 0x7fa9f5e6c81d std::_Function_handler<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> (), std::__future_base::_Task_setter<std::unique_ptr<std::__future_base::_Result<void>, std::__future_base::_Result_base::_Deleter>, std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)() noexcept, tensorrt_llm::batch_manager::DataResponder::Impl*> >, void> >::_M_invoke(std::_Any_data const&) + 45
4 0x7fa9f5e4db5d std::__future_base::_State_baseV2::_M_do_set(std::function<std::unique_ptr<std::__future_base::_Result_base, std::__future_base::_Result_base::_Deleter> ()>*, bool*) + 45
5 0x7faed6829ed3 /usr/lib/x86_64-linux-gnu/libc.so.6(+0xa1ed3) [0x7faed6829ed3]
6 0x7fa9f5e6dba8 std::__future_base::_Async_state_impl<std::thread::_Invoker<std::tuple<void (tensorrt_llm::batch_manager::DataResponder::Impl::*)() noexcept, tensorrt_llm::batch_manager::DataResponder::Impl*> >, void>::_M_run() + 248
7 0x7fac4af2bdb4 /usr/lib/x86_64-linux-gnu/libstdc++.so.6(+0xecdb4) [0x7fac4af2bdb4]
8 0x7faed6824aa4 /usr/lib/x86_64-linux-gnu/libc.so.6(+0x9caa4) [0x7faed6824aa4]
9 0x7faed68b1c3c /usr/lib/x86_64-linux-gnu/libc.so.6(+0x129c3c) [0x7faed68b1c3c]
additional notes
I tried running after disabling overlap scheduler for decode, still encountered same issue.
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.
Metadata
Metadata
Assignees
Labels
KV-Cache Managementkv-cache management for efficient LLM inferencekv-cache management for efficient LLM inferencePytorch<NV>Pytorch backend related issues<NV>Pytorch backend related issuesbugSomething isn't workingSomething isn't workingwaiting for feedback