[UPDATE] vLLM chat completion reasoning content parser #433
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi maintainers,
There is an additional field when a backend vLLM enable reasoning parser
Such as this command
Behaviour
Reference: https://docs.vllm.ai/en/v0.9.1/features/reasoning_outputs.html
If the flag
--reasoning-parser
:disabled
: then the model's thinking output and content output will put altogether inside thecontent
field such as this example, where the thinking process is inside the special token such as<think>...</think>
depends on the modelreasoning_content
instead of justcontent
such as this exampleBug
As i mentioned, because current code doesn't handle the logits for the
reasoning_content
, the results could be wrong.GenAI-Perf config:
--reasoning-parser
is enabled (genai-perf streaming OFF)expected nearly required
512
tokens--reasoning-parser
is enabled (genai-perf streaming ON)The TTFT is nearly as the Request Latency because it skipped the
reasoning_content
, only get thecontent
which generated after thereasoning_content
.After updated
--reasoning-parser
is enabled (genai-perf streaming OFF)expected nearly required
512
tokens -> Now is nearly correct--reasoning-parser
is enabled (genai-perf streaming ON)The TTFT is now take in account the
reasoning_content
, after that thecontent
which generated after thereasoning_content
.