Skip to content

[Bug]: Deadly Request: Model Crashes After Specific Chat Completion Call #7692

@Alireza3242

Description

@Alireza3242

System Info

gpu: A100

docker image:
nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc4

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

I served the model as follows:

trtllm-serve /app/data/gemma-3-4b-it/model --max_batch_size 128 --host 0.0.0.0 \
--kv_cache_free_gpu_memory_fraction 0.3

Then, I sent the following request:

messages = [{"role": "user", "content": "سلام"}]

data = {
    "model": "model",
    "messages": messages,
    "temperature": 0.1,
    "top_k": 1,
    "max_tokens": 0,
    "stream": False
}

url = 'http://127.0.0.1:8000/v1/chat/completions'
response = requests.post(url, json=data, headers={"Content-Type": "application/json"})

The first time, I got an error saying that max_tokens must not be zero.
The second time I sent the exact same request:

response = requests.post(url, json=data, headers={"Content-Type": "application/json"})

this time it completely brought the model down.

Here are the logs:

[2025-09-11 17:02:59] ERROR base_events.py:1821: Task exception was never retrieved
future: <Task finished name='Task-10' coro=<OpenAIServer.await_disconnected() done, defined at /usr/local/lib/python3.12/dist-packages/tensorrt_llm/serve/openai_server.py:161> exception=AttributeError("'NoneType' object has no attribute 'send'")>
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/serve/openai_server.py", line 167, in await_disconnected
    promise.abort()
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/result.py", line 532, in abort
    self._executor().abort_request(self.request_id)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/proxy.py", line 157, in abort_request
    self.request_queue.put(CancellingRequest(request_id))
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/ipc.py", line 123, in put
    self.socket.send(signed_data)
    ^^^^^^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'send'
[09/11/2025-17:03:07] [TRT-LLM] [E] Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/serve/openai_server.py", line 467, in openai_chat
    promise = self.llm.generate_async(
              ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/contextlib.py", line 81, in inner
    return func(*args, **kwds)
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 351, in generate_async
    raise RuntimeError("LLM is shutting down")
RuntimeError: LLM is shutting down

Expected behavior

.

actual behavior

.

additional notes

.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Inference runtime<NV>General operational aspects of TRTLLM execution not in other categories.bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions