Skip to content

Conversation

@AshishKingdom
Copy link
Contributor

Description

The streaming response did not have usage info in case of AzureOpenAI when using {"stream_options": {"include_usage": True}}. OpenAI works fine but AzureOpenAI had this issue. AzureOpenAI depends on OpenAI base class and previously had following logic:

                    if isinstance(client, AzureOpenAI):
                        continue
                    else:
                        delta = ChoiceDelta()

Not sure, why we skipped in case of AzureOpenAI. We can do the same thing as being handled in case of OpenAI for empty choices message. Now, we only have delta = ChoiceDelta() in case of empty choices message.

Before:

ChatResponse(message=ChatMessage(role=<MessageRole.ASSISTANT: 'assistant'>, additional_kwargs={}, blocks=[TextBlock(block_type='text', text='The capital of France is Paris.')]), raw=ChatCompletionChunk(id='chatcmpl-C30i09ujEngR2IZvNCdRAypYFpcjO', choices=[Choice(delta=ChoiceDelta(content=None, function_call=None, refusal=None, role=None, tool_calls=None), finish_reason='stop', index=0, logprobs=None)], created=1754833744, model='gpt-4o-mini-2024-07-18', object='chat.completion.chunk', service_tier=None, system_fingerprint='fp_efad92c60b', usage=None), delta='', logprobs=None, additional_kwargs={})

CompletionResponse(text='The capital of France is Paris.', additional_kwargs={}, raw=ChatCompletionChunk(id='chatcmpl-C30i1DtvO7ZJZdR1mYfRcDp9QucUA', choices=[Choice(delta=ChoiceDelta(content=None, function_call=None, refusal=None, role=None, tool_calls=None), finish_reason='stop', index=0, logprobs=None)], created=1754833745, model='gpt-4o-mini-2024-07-18', object='chat.completion.chunk', service_tier=None, system_fingerprint='fp_efad92c60b', usage=None), logprobs=None, delta='')

After

ChatResponse(message=ChatMessage(role=<MessageRole.ASSISTANT: 'assistant'>, additional_kwargs={}, blocks=[TextBlock(block_type='text', text='The capital of France is Paris.')]), raw=ChatCompletionChunk(id='chatcmpl-C30j35gFbhGq9feqY5Q2E1mlxUKlp', choices=[], created=1754833809, model='gpt-4o-mini-2024-07-18', object='chat.completion.chunk', service_tier=None, system_fingerprint='fp_efad92c60b', usage=CompletionUsage(completion_tokens=8, prompt_tokens=24, total_tokens=32, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0))), delta='', logprobs=None, additional_kwargs={'prompt_tokens': 24, 'completion_tokens': 8, 'total_tokens': 32})

CompletionResponse(text='The capital of France is Paris.', additional_kwargs={'prompt_tokens': 14, 'completion_tokens': 8, 'total_tokens': 22}, raw=ChatCompletionChunk(id='chatcmpl-C30j3lVpCBjsxxCHD19urzSUYmJvH', choices=[], created=1754833809, model='gpt-4o-mini-2024-07-18', object='chat.completion.chunk', service_tier=None, system_fingerprint='fp_efad92c60b', usage=CompletionUsage(completion_tokens=8, prompt_tokens=14, total_tokens=22, completion_tokens_details=CompletionTokensDetails(accepted_prediction_tokens=0, audio_tokens=0, reasoning_tokens=0, rejected_prediction_tokens=0), prompt_tokens_details=PromptTokensDetails(audio_tokens=0, cached_tokens=0))), logprobs=None, delta='')

Version Bump?

Did I bump the version in the pyproject.toml file of the package I am updating? (Except for the llama-index-core package)

  • Yes
  • No

Type of Change

Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)

How Has This Been Tested?

I tested using the following snippet:

from llama_index.llms.openai import OpenAI
from llama_index.llms.azure_openai import AzureOpenAI
from llama_index.core.llms import ChatMessage, MessageRole, ChatResponseAsyncGen
import asyncio
import os

# llm = OpenAI(
#     model="gpt-4.1-mini",
#     additional_kwargs={"stream_options": {"include_usage": True}},
# )
llm = AzureOpenAI(
    deployment_name=os.getenv("AZURE_DEPLOYEMENT_NAME"),
    api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    model=os.getenv("AZURE_MODEL_NAME"),
    azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
    api_version=os.getenv("AZURE_OPENAI_VERSION"),
    additional_kwargs={"stream_options": {"include_usage": True}},
)

def test_streaming_usage_chat():
    chat_history = [
        ChatMessage(role=MessageRole.SYSTEM, content="You are a helpful assistant."),
        ChatMessage(role=MessageRole.USER, content="What is the capital of France?"),
    ]
    response_gen = llm.stream_chat(chat_history)
    current_response = None
    usage_response = None
    for response in response_gen:
        current_response = response

    print("Final response:", current_response.__repr__())

def test_streaming_usage_completion():
    response_gen = llm.stream_complete("What is the capital of France?")
    current_response = None
    usage_response = None
    for response in response_gen:
        current_response = response

    print("Final response:", current_response.__repr__())

async def test_streaming_usage_chat_async():
    chat_history = [
        ChatMessage(role=MessageRole.SYSTEM, content="You are a helpful assistant."),
        ChatMessage(role=MessageRole.USER, content="What is the capital of France?"),
    ]
    response_gen = await llm.astream_chat(chat_history)
    current_response = None
    usage_response = None
    async for response in response_gen:
        current_response = response

    print("Final response:", current_response.__repr__())


async def test_streaming_usage_completion_async():
    response_gen = await llm.astream_complete("What is the capital of France?")
    current_response = None
    usage_response = None
    async for response in response_gen:
        current_response = response

    print("Final response:", current_response.__repr__())

if __name__ == "__main__":
    test_streaming_usage_chat()
    test_streaming_usage_completion()
    asyncio.run(test_streaming_usage_chat_async())
    asyncio.run(test_streaming_usage_completion_async())

Suggested Checklist:

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • My changes generate no new warnings
  • New and existing unit tests pass locally with my changes
  • I ran uv run make format; uv run make lint to appease the lint gods

@dosubot dosubot bot added the size:S This PR changes 10-29 lines, ignoring generated files. label Aug 10, 2025
@logan-markewich
Copy link
Collaborator

I was curious why this check was there, so I went digging
#9890

Basically, the old agent classes would detect if a response was a final response or a tool call by checking if there was text returned first or a tool call. By returning an empty delta, it was breaking that check

Since those old agent classes are deprecated and no longer supported, this should be ok to merge? Although I worry a little bit about making a change like this. I feel like we should not be yielding empty delta's in the first place. I might add check before the yield?

@AshishKingdom
Copy link
Contributor Author

Hey @logan-markewich, agreed with your concern. I added a check for this. If choices are empty and usage is also None, we simply skip that chunk from yielding. Although, usage chunk is the last chunk in the streaming response, but lets be extra careful.

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Aug 12, 2025
@AshishKingdom
Copy link
Contributor Author

will add test

@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Aug 12, 2025
@AshishKingdom
Copy link
Contributor Author

one thing is to note: I have not written test for async equivalent, i could not find any existing test for any async functions (even checked some other integrations). Anyways, they both have exactly same changes so this will be ok?

Copy link
Collaborator

@logan-markewich logan-markewich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I changed my mind, I think avoiding yielding is more breaking

Going to go back to the first iteration

@logan-markewich logan-markewich merged commit 9547dcc into run-llama:main Aug 12, 2025
10 of 11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm This PR has been approved by a maintainer size:M This PR changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants