[UPDATE] vLLM chat completion reasoning content parser #433

BaoLocPham · 2025-08-14T04:48:10Z

Hi maintainers,

There is an additional field when a backend vLLM enable reasoning parser
Such as this command

vllm serve /network-volume/local_models/DeepSeek-R1-Distill-Qwen-32B \
    --gpu-memory-utilization 0.99 \
    --served-model-name deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
    --tensor-parallel-size 1 \
    --dtype bfloat16 \
    --max-model-len 32000 \
    --max-num-seqs 256 \
    --trust-remote-code \
    --enable-prefix-caching \
    --reasoning-parser deepseek_r1 \ <---- This reasoning parser
    --enable-auto-tool-choice \
    --tool-call-parser deepseek_v3 \
    --chat-template ./tool_chat_template_deepseekr1.jinja \
    --load-format runai_streamer \
    --model-loader-extra-config '{"concurrency":16}'

Behaviour

Reference: https://docs.vllm.ai/en/v0.9.1/features/reasoning_outputs.html
If the flag --reasoning-parser:

disabled: then the model's thinking output and content output will put altogether inside the content field such as this example, where the thinking process is inside the special token such as <think>...</think> depends on the model

{
  "id": "chatcmpl-8e0d5a86d87f4e3cbc78591b49a023aa",
  "object": "chat.completion",
  "created": 1755146543,
  "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "<think>\nOkay, so I'm trying to understand this passage from Henry V. It's a bit dense, but I'll take it step by step. First, the characters are Canterbury and Ely, talking about King Henry. They seem to be praising him, saying he's a sudden scholar and that reformation came quickly. ....\n</think>\n\nThe passage from Henry V revolves around the justification of King Henry V's claim to the French throne, using both legal and historical arguments. Here's a structured summary of the key points:\n\n1. **Character Discussion and Metaphors**:\n   - The Archbishop of Canterbury and Bishop Ely praise King Henry, noting his transformation from a wild youth to a wise ruler. They use the metaphor of a strawberry growing under a nettle",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning_content": null
      }
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 1027,
    "total_tokens": 1539,
    "completion_tokens": 512,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "kv_transfer_params": null
}

enabled: then the model's thinking output will parse into additional field named reasoning_content instead of just content such as this example

{
  "id": "chatcmpl-4b2f9adcfbdb42fc9b0467135225e0f0",
  "object": "chat.completion",
  "created": 1755143743,
  "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        **"content"**: "\n\nThe passage from \"Coriolanus\" depicts a pivotal moment of conflict between the patricians and plebeians, highlighting themes of class struggle and governance. Coriolanus, a",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        **"reasoning_content"**: "\nOkay, so I'm trying to understand this passage from \"Coriolanus.\" It's a bit intense with all the shouting and accusations. Let me break it down.\n\nFirst, there's Coriolanus speaking about the people not deserving the corn they were given. .....\n",
        "logprobs": null,
        "finish_reason": "length",
        "stop_reason": null
      }
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 1027,
    "total_tokens": 1539,
    "completion_tokens": 512,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "kv_transfer_params": null
}

Bug

As i mentioned, because current code doesn't handle the logits for the reasoning_content, the results could be wrong.

GenAI-Perf config:

-m deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
--verbose
--warmup-request-count 2
--endpoint-type chat
--request-count 20
--random-seed 2102
--synthetic-input-tokens-mean 1024 <- Expected Input
--synthetic-input-tokens-stddev 0
--output-tokens-mean 512 <- Expected Output
--output-tokens-stddev 0
--tokenizer /app/local_models/DeepSeek-R1-Distill-Qwen-32B
--concurrency 2
--extra-inputs ignore_els:true
--endpoint-type chat

Wrong number of output tokens when the --reasoning-parser is enabled (genai-perf streaming OFF)
expected nearly required 512 tokens

                                        NVIDIA GenAI-Perf | LLM Metrics                                         
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓
┃                           Statistic ┃       avg ┃       min ┃       max ┃       p99 ┃       p90 ┃       p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩
│                Request Latency (ms) │ 13,123.11 │ 13,123.11 │ 13,123.11 │ 13,123.11 │ 13,123.11 │ 13,123.11 │
│     Output Sequence Length (tokens) │   -->38.00 │     38.00 │     38.00 │     38.00 │     38.00 │     38.00 │
│      Input Sequence Length (tokens) │  1,024.00 │  1,024.00 │  1,024.00 │  1,024.00 │  1,024.00 │  1,024.00 │
│             Output Token Throughput │      2.90 │       N/A │       N/A │       N/A │       N/A │       N/A │
│                        (tokens/sec) │           │           │           │           │           │           │
│        Request Throughput (per sec) │      0.76 │       N/A │       N/A │       N/A │       N/A │       N/A │
│               Request Count (count) │      1.00 │       N/A │       N/A │       N/A │       N/A │       N/A │
└─────────────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┴───────────┘

Wrong time to first token (TTFT) --reasoning-parser is enabled (genai-perf streaming ON)
The TTFT is nearly as the Request Latency because it skipped the reasoning_content, only get the content which generated after the reasoning_content.

                                        NVIDIA GenAI-Perf | LLM Metrics                                         
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓
┃                            Statistic ┃       avg ┃       min ┃       max ┃       p99 ┃       p90 ┃       p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩
│             Time To First Token (ms) │ -->10,991.97 │ 10,991.97 │ 10,991.97 │ 10,991.97 │ 10,991.97 │ 10,991.97 │
│            Time To Second Token (ms) │      0.00 │      0.00 │      0.00 │      0.00 │      0.00 │      0.00 │
│                 Request Latency (ms) │ 13,048.48 │ 13,048.48 │ 13,048.48 │ 13,048.48 │ 13,048.48 │ 13,048.48 │
│             Inter Token Latency (ms) │     18.04 │     18.04 │     18.04 │     18.04 │     18.04 │     18.04 │
│     Output Token Throughput Per User │     55.43 │     55.43 │     55.43 │     55.43 │     55.43 │     55.43 │
│                    (tokens/sec/user) │           │           │           │           │           │           │
│      Output Sequence Length (tokens) │    115.00 │    115.00 │    115.00 │    115.00 │    115.00 │    115.00 │
│       Input Sequence Length (tokens) │  1,024.00 │  1,024.00 │  1,024.00 │  1,024.00 │  1,024.00 │  1,024.00 │
│ Output Token Throughput (tokens/sec) │      8.81 │       N/A │       N/A │       N/A │       N/A │       N/A │
│         Request Throughput (per sec) │      0.77 │       N/A │       N/A │       N/A │       N/A │       N/A │
│                Request Count (count) │      1.00 │       N/A │       N/A │       N/A │       N/A │       N/A │
└──────────────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┴───────────┘

After updated

Correct number of output tokens when the --reasoning-parser is enabled (genai-perf streaming OFF)
expected nearly required 512 tokens -> Now is nearly correct

                                        NVIDIA GenAI-Perf | LLM Metrics                                        
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓
┃                           Statistic ┃       avg ┃       min ┃       max ┃       p99 ┃       p90 ┃       p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩
│                Request Latency (ms) │ 13,144.64 │ 13,120.82 │ 13,334.80 │ 13,316.09 │ 13,147.70 │ 13,125.32 │
│     Output Sequence Length (tokens) │ ---> 511.00 │    511.00 │    511.00 │    511.00 │    511.00 │    511.00 │
│      Input Sequence Length (tokens) │  1,024.00 │  1,024.00 │  1,024.00 │  1,024.00 │  1,024.00 │  1,024.00 │
│             Output Token Throughput │     38.87 │       N/A │       N/A │       N/A │       N/A │       N/A │
│                        (tokens/sec) │           │           │           │           │           │           │
│        Request Throughput (per sec) │      0.08 │       N/A │       N/A │       N/A │       N/A │       N/A │
│               Request Count (count) │     10.00 │       N/A │       N/A │       N/A │       N/A │       N/A │
└─────────────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┴───────────┘

Correct time to first token (TTFT) --reasoning-parser is enabled (genai-perf streaming ON)
The TTFT is now take in account the reasoning_content, after that the content which generated after the reasoning_content.

                                        NVIDIA GenAI-Perf | LLM Metrics                                         
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓
┃                            Statistic ┃       avg ┃       min ┃       max ┃       p99 ┃       p90 ┃       p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩
│             Time To First Token (ms) │--->  1,764.61 │  1,732.66 │  1,885.98 │  1,874.74 │  1,773.52 │  1,757.67 │
│            Time To Second Token (ms) │      0.00 │      0.00 │      0.01 │      0.01 │      0.01 │      0.00 │
│                 Request Latency (ms) │ 13,060.68 │ 13,042.42 │ 13,173.60 │ 13,163.06 │ 13,068.19 │ 13,050.66 │
│             Inter Token Latency (ms) │     22.17 │     22.13 │     22.23 │     22.23 │     22.19 │     22.18 │
│     Output Token Throughput Per User │     45.11 │     44.97 │     45.18 │     45.18 │     45.17 │     45.17 │
│                    (tokens/sec/user) │           │           │           │           │           │           │
│      Output Sequence Length (tokens) │    510.60 │    509.00 │    511.00 │    511.00 │    511.00 │    511.00 │
│       Input Sequence Length (tokens) │  1,024.00 │  1,024.00 │  1,024.00 │  1,024.00 │  1,024.00 │  1,024.00 │
│ Output Token Throughput (tokens/sec) │     39.09 │       N/A │       N/A │       N/A │       N/A │       N/A │
│         Request Throughput (per sec) │      0.08 │       N/A │       N/A │       N/A │       N/A │       N/A │
│                Request Count (count) │     10.00 │       N/A │       N/A │       N/A │       N/A │       N/A │
└──────────────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┴───────────┘

[UPDATE] vLLM chat completion reasoning content parser

9f454f0

BaoLocPham marked this pull request as ready for review August 14, 2025 06:53

BaoLocPham mentioned this pull request Aug 14, 2025

[Bug] wrong behaviour vLLM chat completion reasoning content parser #434

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[UPDATE] vLLM chat completion reasoning content parser #433

[UPDATE] vLLM chat completion reasoning content parser #433

Uh oh!

BaoLocPham commented Aug 14, 2025 •

edited

Loading

Uh oh!

Uh oh!

[UPDATE] vLLM chat completion reasoning content parser #433

Are you sure you want to change the base?

[UPDATE] vLLM chat completion reasoning content parser #433

Uh oh!

Conversation

BaoLocPham commented Aug 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Behaviour

Bug

After updated

Uh oh!

Uh oh!

BaoLocPham commented Aug 14, 2025 •

edited

Loading