Skip to content

Conversation

BaoLocPham
Copy link

@BaoLocPham BaoLocPham commented Aug 14, 2025

Hi maintainers,

There is an additional field when a backend vLLM enable reasoning parser
Such as this command

vllm serve /network-volume/local_models/DeepSeek-R1-Distill-Qwen-32B \
    --gpu-memory-utilization 0.99 \
    --served-model-name deepseek-ai/DeepSeek-R1-Distill-Qwen-32B \
    --tensor-parallel-size 1 \
    --dtype bfloat16 \
    --max-model-len 32000 \
    --max-num-seqs 256 \
    --trust-remote-code \
    --enable-prefix-caching \
    --reasoning-parser deepseek_r1 \ <---- This reasoning parser
    --enable-auto-tool-choice \
    --tool-call-parser deepseek_v3 \
    --chat-template ./tool_chat_template_deepseekr1.jinja \
    --load-format runai_streamer \
    --model-loader-extra-config '{"concurrency":16}' 

Behaviour

Reference: https://docs.vllm.ai/en/v0.9.1/features/reasoning_outputs.html
If the flag --reasoning-parser:

  • disabled: then the model's thinking output and content output will put altogether inside the content field such as this example, where the thinking process is inside the special token such as <think>...</think> depends on the model
{
  "id": "chatcmpl-8e0d5a86d87f4e3cbc78591b49a023aa",
  "object": "chat.completion",
  "created": 1755146543,
  "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "<think>\nOkay, so I'm trying to understand this passage from Henry V. It's a bit dense, but I'll take it step by step. First, the characters are Canterbury and Ely, talking about King Henry. They seem to be praising him, saying he's a sudden scholar and that reformation came quickly. ....\n</think>\n\nThe passage from Henry V revolves around the justification of King Henry V's claim to the French throne, using both legal and historical arguments. Here's a structured summary of the key points:\n\n1. **Character Discussion and Metaphors**:\n   - The Archbishop of Canterbury and Bishop Ely praise King Henry, noting his transformation from a wild youth to a wise ruler. They use the metaphor of a strawberry growing under a nettle",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        "reasoning_content": null
      }
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 1027,
    "total_tokens": 1539,
    "completion_tokens": 512,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "kv_transfer_params": null
}

  • enabled: then the model's thinking output will parse into additional field named reasoning_content instead of just content such as this example
{
  "id": "chatcmpl-4b2f9adcfbdb42fc9b0467135225e0f0",
  "object": "chat.completion",
  "created": 1755143743,
  "model": "deepseek-ai/DeepSeek-R1-Distill-Qwen-32B",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        **"content"**: "\n\nThe passage from \"Coriolanus\" depicts a pivotal moment of conflict between the patricians and plebeians, highlighting themes of class struggle and governance. Coriolanus, a",
        "refusal": null,
        "annotations": null,
        "audio": null,
        "function_call": null,
        "tool_calls": [],
        **"reasoning_content"**: "\nOkay, so I'm trying to understand this passage from \"Coriolanus.\" It's a bit intense with all the shouting and accusations. Let me break it down.\n\nFirst, there's Coriolanus speaking about the people not deserving the corn they were given. .....\n",
        "logprobs": null,
        "finish_reason": "length",
        "stop_reason": null
      }
    }
  ],
  "service_tier": null,
  "system_fingerprint": null,
  "usage": {
    "prompt_tokens": 1027,
    "total_tokens": 1539,
    "completion_tokens": 512,
    "prompt_tokens_details": null
  },
  "prompt_logprobs": null,
  "kv_transfer_params": null
}

Bug

As i mentioned, because current code doesn't handle the logits for the reasoning_content, the results could be wrong.

GenAI-Perf config:

-m deepseek-ai/DeepSeek-R1-Distill-Qwen-32B
--verbose
--warmup-request-count 2
--endpoint-type chat
--request-count 20
--random-seed 2102
--synthetic-input-tokens-mean 1024 <- Expected Input
--synthetic-input-tokens-stddev 0
--output-tokens-mean 512 <- Expected Output
--output-tokens-stddev 0
--tokenizer /app/local_models/DeepSeek-R1-Distill-Qwen-32B
--concurrency 2
--extra-inputs ignore_els:true
--endpoint-type chat
  • Wrong number of output tokens when the --reasoning-parser is enabled (genai-perf streaming OFF)
    expected nearly required 512 tokens
                                        NVIDIA GenAI-Perf | LLM Metrics                                         
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓
┃                           Statistic ┃       avg ┃       min ┃       max ┃       p99 ┃       p90 ┃       p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩
│                Request Latency (ms) │ 13,123.11 │ 13,123.11 │ 13,123.11 │ 13,123.11 │ 13,123.11 │ 13,123.11 │
│     Output Sequence Length (tokens) │   -->38.00 │     38.00 │     38.00 │     38.00 │     38.00 │     38.00 │
│      Input Sequence Length (tokens) │  1,024.00 │  1,024.00 │  1,024.00 │  1,024.00 │  1,024.00 │  1,024.00 │
│             Output Token Throughput │      2.90 │       N/A │       N/A │       N/A │       N/A │       N/A │
│                        (tokens/sec) │           │           │           │           │           │           │
│        Request Throughput (per sec) │      0.76 │       N/A │       N/A │       N/A │       N/A │       N/A │
│               Request Count (count) │      1.00 │       N/A │       N/A │       N/A │       N/A │       N/A │
└─────────────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┴───────────┘
  • Wrong time to first token (TTFT) --reasoning-parser is enabled (genai-perf streaming ON)
    The TTFT is nearly as the Request Latency because it skipped the reasoning_content, only get the content which generated after the reasoning_content.
                                        NVIDIA GenAI-Perf | LLM Metrics                                         
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓
┃                            Statistic ┃       avg ┃       min ┃       max ┃       p99 ┃       p90 ┃       p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩
│             Time To First Token (ms) │ -->10,991.97 │ 10,991.97 │ 10,991.97 │ 10,991.97 │ 10,991.97 │ 10,991.97 │
│            Time To Second Token (ms) │      0.00 │      0.00 │      0.00 │      0.00 │      0.00 │      0.00 │
│                 Request Latency (ms) │ 13,048.48 │ 13,048.48 │ 13,048.48 │ 13,048.48 │ 13,048.48 │ 13,048.48 │
│             Inter Token Latency (ms) │     18.04 │     18.04 │     18.04 │     18.04 │     18.04 │     18.04 │
│     Output Token Throughput Per User │     55.43 │     55.43 │     55.43 │     55.43 │     55.43 │     55.43 │
│                    (tokens/sec/user) │           │           │           │           │           │           │
│      Output Sequence Length (tokens) │    115.00 │    115.00 │    115.00 │    115.00 │    115.00 │    115.00 │
│       Input Sequence Length (tokens) │  1,024.00 │  1,024.00 │  1,024.00 │  1,024.00 │  1,024.00 │  1,024.00 │
│ Output Token Throughput (tokens/sec) │      8.81 │       N/A │       N/A │       N/A │       N/A │       N/A │
│         Request Throughput (per sec) │      0.77 │       N/A │       N/A │       N/A │       N/A │       N/A │
│                Request Count (count) │      1.00 │       N/A │       N/A │       N/A │       N/A │       N/A │
└──────────────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┴───────────┘

After updated

  • Correct number of output tokens when the --reasoning-parser is enabled (genai-perf streaming OFF)
    expected nearly required 512 tokens -> Now is nearly correct
                                        NVIDIA GenAI-Perf | LLM Metrics                                        
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓
┃                           Statistic ┃       avg ┃       min ┃       max ┃       p99 ┃       p90 ┃       p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩
│                Request Latency (ms) │ 13,144.64 │ 13,120.82 │ 13,334.80 │ 13,316.09 │ 13,147.70 │ 13,125.32 │
│     Output Sequence Length (tokens) │ ---> 511.00 │    511.00 │    511.00 │    511.00 │    511.00 │    511.00 │
│      Input Sequence Length (tokens) │  1,024.00 │  1,024.00 │  1,024.00 │  1,024.00 │  1,024.00 │  1,024.00 │
│             Output Token Throughput │     38.87 │       N/A │       N/A │       N/A │       N/A │       N/A │
│                        (tokens/sec) │           │           │           │           │           │           │
│        Request Throughput (per sec) │      0.08 │       N/A │       N/A │       N/A │       N/A │       N/A │
│               Request Count (count) │     10.00 │       N/A │       N/A │       N/A │       N/A │       N/A │
└─────────────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┴───────────┘
  • Correct time to first token (TTFT) --reasoning-parser is enabled (genai-perf streaming ON)
    The TTFT is now take in account the reasoning_content, after that the content which generated after the reasoning_content.
                                        NVIDIA GenAI-Perf | LLM Metrics                                         
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━┓
┃                            Statistic ┃       avg ┃       min ┃       max ┃       p99 ┃       p90 ┃       p75 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━┩
│             Time To First Token (ms) │--->  1,764.61 │  1,732.66 │  1,885.98 │  1,874.74 │  1,773.52 │  1,757.67 │
│            Time To Second Token (ms) │      0.00 │      0.00 │      0.01 │      0.01 │      0.01 │      0.00 │
│                 Request Latency (ms) │ 13,060.68 │ 13,042.42 │ 13,173.60 │ 13,163.06 │ 13,068.19 │ 13,050.66 │
│             Inter Token Latency (ms) │     22.17 │     22.13 │     22.23 │     22.23 │     22.19 │     22.18 │
│     Output Token Throughput Per User │     45.11 │     44.97 │     45.18 │     45.18 │     45.17 │     45.17 │
│                    (tokens/sec/user) │           │           │           │           │           │           │
│      Output Sequence Length (tokens) │    510.60 │    509.00 │    511.00 │    511.00 │    511.00 │    511.00 │
│       Input Sequence Length (tokens) │  1,024.00 │  1,024.00 │  1,024.00 │  1,024.00 │  1,024.00 │  1,024.00 │
│ Output Token Throughput (tokens/sec) │     39.09 │       N/A │       N/A │       N/A │       N/A │       N/A │
│         Request Throughput (per sec) │      0.08 │       N/A │       N/A │       N/A │       N/A │       N/A │
│                Request Count (count) │     10.00 │       N/A │       N/A │       N/A │       N/A │       N/A │
└──────────────────────────────────────┴───────────┴───────────┴───────────┴───────────┴───────────┴───────────┘

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant