Fix AzureOpenAI streaming token usage #19633

AshishKingdom · 2025-08-10T17:41:42Z

Description

The streaming response did not have usage info in case of AzureOpenAI when using {"stream_options": {"include_usage": True}}. OpenAI works fine but AzureOpenAI had this issue. AzureOpenAI depends on OpenAI base class and previously had following logic:

                    if isinstance(client, AzureOpenAI):
                        continue
                    else:
                        delta = ChoiceDelta()

Not sure, why we skipped in case of AzureOpenAI. We can do the same thing as being handled in case of OpenAI for empty choices message. Now, we only have delta = ChoiceDelta() in case of empty choices message.

Before:

ChatResponse(message=ChatMessage(role=<MessageRole.ASSISTANT: 'assistant'>, additional_kwargs={}, blocks=[TextBlock(block_type='text', text='The capital of France is Paris.')]), raw=ChatCompletionChunk(id='chatcmpl-C30i09ujEngR2IZvNCdRAypYFpcjO', choices=[Choice(delta=ChoiceDelta(content=None, function_call=None, refusal=None, role=None, tool_calls=None), finish_reason='stop', index=0, logprobs=None)], created=1754833744, model='gpt-4o-mini-2024-07-18', object='chat.completion.chunk', service_tier=None, system_fingerprint='fp_efad92c60b', usage=None), delta='', logprobs=None, additional_kwargs={})

CompletionResponse(text='The capital of France is Paris.', additional_kwargs={}, raw=ChatCompletionChunk(id='chatcmpl-C30i1DtvO7ZJZdR1mYfRcDp9QucUA', choices=[Choice(delta=ChoiceDelta(content=None, function_call=None, refusal=None, role=None, tool_calls=None), finish_reason='stop', index=0, logprobs=None)], created=1754833745, model='gpt-4o-mini-2024-07-18', object='chat.completion.chunk', service_tier=None, system_fingerprint='fp_efad92c60b', usage=None), logprobs=None, delta='')

After

ChatResponse(message=ChatMessage(role=<MessageRole.ASSISTANT: 'assistant'>, additional_kwargs={}, blocks=[TextBlock(block_type='text', text='The capital of France is Paris.')]), raw=ChatCompletionChunk(id='chatcmpl-C30j35gFbhGq9feqY5Q2E1mlxUKlp', choices=[], created=1754833809, model='gpt-4o-mini-2024-07-18', object='chat.completion.chunk', service_tier=None, system_fingerprint='fp_efad92c60b', usage=CompletionUsage(completion_tokens=8, prompt_tokens=24, total_tokens=32, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0))), delta='', logprobs=None, additional_kwargs={'prompt_tokens': 24, 'completion_tokens': 8, 'total_tokens': 32})

CompletionResponse(text='The capital of France is Paris.', additional_kwargs={'prompt_tokens': 14, 'completion_tokens': 8, 'total_tokens': 22}, raw=ChatCompletionChunk(id='chatcmpl-C30j3lVpCBjsxxCHD19urzSUYmJvH', choices=[], created=1754833809, model='gpt-4o-mini-2024-07-18', object='chat.completion.chunk', service_tier=None, system_fingerprint='fp_efad92c60b', usage=CompletionUsage(completion_tokens=8, prompt_tokens=14, total_tokens=22, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0))), logprobs=None, delta='')

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

Yes
No

Type of Change

Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

I tested using the following snippet:

from llama_index.llms.openai import OpenAI
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.core.llms import ChatMessage, MessageRole, ChatResponseAsyncGen
import asyncio
import os

# llm = OpenAI(
#     model="gpt-4.1-mini",
#     additional_kwargs={"stream_options": {"include_usage": True}},
# )
llm = AzureOpenAI(
    deployment_name=os.getenv("AZURE_DEPLOYEMENT_NAME"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    model=os.getenv("AZURE_MODEL_NAME"),
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_version=os.getenv("AZURE_OPENAI_VERSION"),
    additional_kwargs={"stream_options": {"include_usage": True}},
)

def test_streaming_usage_chat():
    chat_history = [
        ChatMessage(role=MessageRole.SYSTEM, content="You are a helpful assistant."),
        ChatMessage(role=MessageRole.USER, content="What is the capital of France?"),
    ]
    response_gen = llm.stream_chat(chat_history)
    current_response = None
    usage_response = None
    for response in response_gen:
        current_response = response

    print("Final response:", current_response.__repr__())

def test_streaming_usage_completion():
    response_gen = llm.stream_complete("What is the capital of France?")
    current_response = None
    usage_response = None
    for response in response_gen:
        current_response = response

    print("Final response:", current_response.__repr__())

async def test_streaming_usage_chat_async():
    chat_history = [
        ChatMessage(role=MessageRole.SYSTEM, content="You are a helpful assistant."),
        ChatMessage(role=MessageRole.USER, content="What is the capital of France?"),
    ]
    response_gen = await llm.astream_chat(chat_history)
    current_response = None
    usage_response = None
    async for response in response_gen:
        current_response = response

    print("Final response:", current_response.__repr__())


async def test_streaming_usage_completion_async():
    response_gen = await llm.astream_complete("What is the capital of France?")
    current_response = None
    usage_response = None
    async for response in response_gen:
        current_response = response

    print("Final response:", current_response.__repr__())

if __name__ == "__main__":
    test_streaming_usage_chat()
    test_streaming_usage_completion()
    asyncio.run(test_streaming_usage_chat_async())
    asyncio.run(test_streaming_usage_completion_async())

Suggested Checklist:

I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
My changes generate no new warnings
New and existing unit tests pass locally with my changes
I ran uv run make format; uv run make lint to appease the lint gods

logan-markewich · 2025-08-10T20:21:35Z

I was curious why this check was there, so I went digging
#9890

Basically, the old agent classes would detect if a response was a final response or a tool call by checking if there was text returned first or a tool call. By returning an empty delta, it was breaking that check

Since those old agent classes are deprecated and no longer supported, this should be ok to merge? Although I worry a little bit about making a change like this. I feel like we should not be yielding empty delta's in the first place. I might add check before the yield?

AshishKingdom · 2025-08-11T05:44:49Z

Hey @logan-markewich, agreed with your concern. I added a check for this. If choices are empty and usage is also None, we simply skip that chunk from yielding. Although, usage chunk is the last chunk in the streaming response, but lets be extra careful.

AshishKingdom · 2025-08-12T11:14:33Z

will add test

AshishKingdom · 2025-08-12T15:09:15Z

one thing is to note: I have not written test for async equivalent, i could not find any existing test for any async functions (even checked some other integrations). Anyways, they both have exactly same changes so this will be ok?

logan-markewich

Ok, I changed my mind, I think avoiding yielding is more breaking

Going to go back to the first iteration

AshishKingdom added 3 commits August 10, 2025 21:57

trake usage of tokens in streaming mode for AzureOpenAI

51159bb

added comments explaining the case when content is empty

69b4197

bump the version for llama-index-llms-openai

fdeb6ca

dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Aug 10, 2025

AshishKingdom added 3 commits August 11, 2025 11:08

ignore chunk with empty choices if usage is not present

d034853

bump version for llama-index-llms-openai correctly

4f340e3

Merge branch 'main' into fix/azure-openai-streaming-token-usage

2ef5e06

AstraBert approved these changes Aug 12, 2025

View reviewed changes

dosubot bot added the lgtm This PR has been approved by a maintainer label Aug 12, 2025

added test for streaming along with usage

0189e33

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Aug 12, 2025

logan-markewich approved these changes Aug 12, 2025

View reviewed changes

nit

78a76ab

logan-markewich approved these changes Aug 12, 2025

View reviewed changes

logan-markewich merged commit 9547dcc into run-llama:main Aug 12, 2025
10 of 11 checks passed

dosubot bot mentioned this pull request Aug 26, 2025

[Question]: Token count not working after streaming #19740

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix AzureOpenAI streaming token usage #19633

Fix AzureOpenAI streaming token usage #19633

Uh oh!

AshishKingdom commented Aug 10, 2025

Uh oh!

logan-markewich commented Aug 10, 2025

Uh oh!

AshishKingdom commented Aug 11, 2025

Uh oh!

AshishKingdom commented Aug 12, 2025

Uh oh!

AshishKingdom commented Aug 12, 2025

Uh oh!

logan-markewich left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Fix AzureOpenAI streaming token usage #19633

Fix AzureOpenAI streaming token usage #19633

Uh oh!

Conversation

AshishKingdom commented Aug 10, 2025

Description

Version Bump?

Type of Change

How Has This Been Tested?

Suggested Checklist:

Uh oh!

logan-markewich commented Aug 10, 2025

Uh oh!

AshishKingdom commented Aug 11, 2025

Uh oh!

AshishKingdom commented Aug 12, 2025

Uh oh!

AshishKingdom commented Aug 12, 2025

Uh oh!

logan-markewich left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants