diff --git a/examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb b/examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb index f2ae9bd7a9..a3a8ac8080 100644 --- a/examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb +++ b/examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb @@ -6,9 +6,9 @@ "source": [ "# Evals API: Audio Inputs\n", "\n", - "This cookbook demonstrates how to use OpenAI's Evals framework for audio-based tasks. Leveraging the Evals API, we will grade model-generated responses to an audio message and prompt by using **sampling** to generate model responses and **string match grading** to score the model responses against the output audio transcript and reference answer. Note that grading will be on text outputs from the sampled response. Graders that can grade audio input are not currently supported.\n", + "This cookbook demonstrates how to use OpenAI's Evals framework for audio-based tasks. Leveraging the Evals API, we will grade model-generated responses to an audio message and prompt by using **sampling** to generate model responses and **model grading** to score the model responses against the output audio and reference answer. Note that grading will be on audio outputs from the sampled response.\n", "\n", - "Before audio support was added, in order to evaluate audio conversations, they needed to be first transcribed to text. Now you can use the original audio and get samples from the model in audio as well. This will more accurately represent workflows such as a customer support agent where both the user and agent are using audio. For grading, we use the text transcript from the sampled audio so that we can leverage the existing suite of text graders. \n", + "Before audio support was added, to evaluate audio conversations, they first needed to be transcribed to text. Now you can use the original audio and get samples from the model in audio as well. This more accurately represents workflows such as a customer support scenario where both the user and the agent are using audio. For grading, we will use an audio model to grade the audio response with a model grader. We could alternatively, or in combination, use the text transcript from the sampled audio and leverage the existing suite of text graders.\n", "\n", "In this example, we will evaluate how well our model can:\n", "1. **Generate appropriate responses** to user prompts about an audio message\n", @@ -29,7 +29,7 @@ "outputs": [], "source": [ "# Install required packages\n", - "%pip install openai datasets pandas soundfile torch torchcodec --quiet" + "%pip install openai datasets pandas soundfile torch torchcodec pydub jiwer --quiet" ] }, { @@ -57,7 +57,7 @@ "source": [ "## Dataset Preparation\n", "\n", - "We use the [big_bench_audio](https://huggingface.co/datasets/ArtificialAnalysis/big_bench_audio) dataset that's hosted on Hugging Face. Big Bench Audio is an audio version of a subset of Big Bench Hard questions. The dataset can be used for evaluating the reasoning capabilities of models that support audio input. It contains an audio clip describing a logic problem, a category and an official answer." + "We use the [big_bench_audio](https://huggingface.co/datasets/ArtificialAnalysis/big_bench_audio) dataset that is hosted on Hugging Face. Big Bench Audio is an audio version of a subset of Big Bench Hard questions. The dataset can be used for evaluating the reasoning capabilities of models that support audio input. It contains an audio clip describing a logic problem, a category, and an official answer." ] }, { @@ -75,7 +75,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We extract the relevant fields and put it in a json-like format to pass in as a data source in the Evals API. Input audio data must be in the form of a base64 encoded string. So we process the data in the audio file and convert it to base64." + "We extract the relevant fields and put them in a JSON-like format to pass in as a data source in the Evals API. Input audio data must be in the form of a base64-encoded string. We process the data in the audio file and convert it to base64.\n", + "\n", + "Note: Audio models currently support WAV, MP3, FLAC, Opus, or PCM16 formats. See [audio inputs](https://platform.openai.com/docs/api-reference/chat/create#chat_create-audio) for details." ] }, { @@ -214,7 +216,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Since audio inputs are large, we need to save the examples to a file and upload it to the API" + "Since audio inputs are large, we need to save the examples to a file and upload it to the API." ] }, { @@ -240,16 +242,16 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Evals have two parts, the \"Eval\" and the \"Run\". In the \"Eval\", we define the expected structure of the data and the testing criteria (grader)." + "Evals have two parts: the \"Eval\" and the \"Run\". In the \"Eval\" we define the expected structure of the data and the testing criteria (grader)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "### Data Source Config\n", + "### Data Source Configuration\n", "\n", - "Based on the data that we have compiled, our data source config is as follows:" + "Based on the data that we have compiled, our data source configuration is as follows:" ] }, { @@ -285,9 +287,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For our testing criteria, we set up our grader config. In this example, it is a simple string_check grader that takes in the official answer and sampled model response (in the `sample` namespace), and then outputs a score between 0 and 1 based if the model response contains the reference answer. The response contains both audio and the text transcript of the audio. We will use the text transcript in the grader. For more info on graders, visit [API Grader docs](https://platform.openai.com/docs/api-reference/graders). \n", + "For our testing criteria, we set up our grader configuration. In this example, we use a score_model grader that takes in the official answer and sampled model response (in the `sample` namespace), and then outputs a score of 0 or 1 based on whether the model response matches the official answer. The response contains both audio and the text transcript of the audio. We will use the audio in the grader. For more information on graders, visit [API Grader docs](https://platform.openai.com/docs/api-reference/graders).\n", "\n", - "Getting both the data and the grader right are key for an effective evaluation. While this example uses a simple string check grader, a more powerful model grader could be used instead and you will likely want to iteratively refine the prompts for your graders. " + "Getting both the data and the grader right is key for an effective evaluation. You will likely want to iteratively refine the prompts for your graders." ] }, { @@ -296,13 +298,52 @@ "metadata": {}, "outputs": [], "source": [ + "grader_config = {\n", + " \"type\": \"score_model\",\n", + " \"name\": \"Reference answer audio model grader\",\n", + " \"model\": \"gpt-4o-audio-preview\",\n", + " \"input\": [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": 'You are a helpful assistant that evaluates audio clips to judge whether they match a provided reference answer. The audio clip is the model''s response to the question. Respond ONLY with a single JSON object matching: {\"steps\":[{\"description\":\"string\",\"conclusion\":\"string\"}],\"result\":number}. Do not include any extra text. result must be a float in [0.0, 1.0].'\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": [\n", + " {\n", + " \"type\": \"input_text\",\n", + " \"text\": \"Evaluate this audio clip to see if it reaches the same conclusion as the reference answer. Reference answer: {{item.official_answer}}\",\n", + " },\n", + " {\n", + " \"type\": \"input_audio\",\n", + " \"input_audio\": {\n", + " \"data\": \"{{ sample.output_audio.data }}\",\n", + " \"format\": \"wav\",\n", + " },\n", + " },\n", + " ],\n", + " },\n", + " ],\n", + " \"range\": [0, 1],\n", + " \"pass_threshold\": 0.6,\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Alternatively we could use a string_check grader that takes in the official answer and sampled model response (in the `sample` namespace), and then outputs a score between 0 and 1 based on if the model response contains the reference answer. The response contains both audio and the text transcript of the audio. We will use the text transcript in the grader. \n", + "\n", + "```python\n", "grader_config = {\n", " \"type\": \"string_check\",\n", " \"name\": \"String check grader\",\n", " \"input\": \"{{sample.output_text}}\",\n", " \"reference\": \"{{item.official_answer}}\",\n", " \"operation\": \"ilike\"\n", - "}" + "}\n", + "```" ] }, { @@ -350,14 +391,14 @@ "sampling_messages = [\n", " {\n", " \"role\": \"system\",\n", - " \"content\": \"You are a helpful and obedient assistant that can answer questions with audio input. You will be given an audio input containing a question and instructions on exactly how to answer. For example, if the user asks for a single word response, then you should only reply with a single word answer.\"\n", + " \"content\": \"You are a helpful and obedient assistant that can answer questions with audio input. You will be given an audio input containing a question to answer.\"\n", " },\n", " {\n", " \"role\": \"user\",\n", " \"type\": \"message\",\n", " \"content\": {\n", " \"type\": \"input_text\",\n", - " \"text\": \"Answer the following question by replying with a single word answer: 'valid' or 'invalid'.\"\n", + " \"text\": \"Answer the following question by replying with brief reasoning statements and a conclusion with a single word answer: 'valid' or 'invalid'.\"\n", " }\n", " },\n", " {\n", @@ -377,7 +418,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "We now kickoff an eval run." + "We now kick off an eval run." ] }, { @@ -401,8 +442,9 @@ " },\n", " \"input_messages\": {\n", " \"type\": \"template\", \n", - " \"template\": sampling_messages}\n", - " }\n", + " \"template\": sampling_messages},\n", + " \"modalities\": [\"audio\", \"text\"],\n", + " },\n", " )" ] }, @@ -417,7 +459,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "When the run finishes, we can take a look at the result. You can also check in your org's OpenAI evals dashboard to see the progress and results. " + "When the run finishes, we can take a look at the result. You can also check your organization's OpenAI Evals dashboard to see the progress and results." ] }, { @@ -479,11 +521,11 @@ "source": [ "## Conclusion\n", "\n", - "In this cookbook, we covered a workflow for evaluating native audio inputs to a model using the OpenAI Evals APIs. We demonstrated using a simple text grader to grade the text transcript of the audio response.\n", + "In this cookbook, we covered a workflow for evaluating native audio inputs to a model using the OpenAI Evals API. We demonstrated using a score model grader to grade the audio response.\n", "### Next steps\n", - "- Convert this example to your use case. \n", - "- Try using model based graders for additional flexibility in grading.\n", - "- If you have large audio clips, try using the [uploads API](https://platform.openai.com/docs/api-reference/uploads/create) for support up to 8 GB.\n" + "- Convert this example to your own use case.\n", + "- If you have large audio clips, try using the [uploads API](https://platform.openai.com/docs/api-reference/uploads/create) for support up to 8 GB.\n", + "- Navigate to the [Evals dashboard](https://platform.openai.com/evaluations) to visualize the outputs and get insights into the performance of the eval.\n" ] } ],