From 1498a9567ce22aecebb9f05b17dc20152811cc1f Mon Sep 17 00:00:00 2001 From: Todd Hendry Date: Wed, 13 Aug 2025 10:25:05 -0700 Subject: [PATCH 1/6] Add cookbook for audio evals --- .../use-cases/EvalsAPI_Audio_Inputs.ipynb | 497 ++++++++++++++++++ 1 file changed, 497 insertions(+) create mode 100644 examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb diff --git a/examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb b/examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb new file mode 100644 index 0000000000..d702ce56d0 --- /dev/null +++ b/examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb @@ -0,0 +1,497 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Evals API: Audio Inputs\n", + "\n", + "This cookbook demonstrates how to use OpenAI's Evals framework for audio-based tasks. Leveraging the Evals API, we will grade model-generated responses to an audio message and prompt by using **sampling** to generate model responses and **model grading** (LLM as a Judge) to score the model responses against the output audio transcript, prompt, and reference answer.\n", + "\n", + "In this example, we will evaluate how well our model can:\n", + "1. **Generate appropriate responses** to user prompts about an audio message\n", + "3. **Align with reference answers** that represent high-quality responses" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Installing Dependencies + Setup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Install required packages\n", + "!pip install openai datasets pandas soundfile torch torchcodec --quiet" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Import libraries\n", + "from datasets import load_dataset, Audio\n", + "from openai import OpenAI\n", + "import base64\n", + "import os\n", + "import json\n", + "import time\n", + "import io\n", + "import soundfile as sf\n", + "import numpy as np\n", + "import pandas as pd" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Dataset Preparation\n", + "\n", + "We use the [big_bench_audio](https://huggingface.co/datasets/ArtificialAnalysis/big_bench_audio) dataset that's hosted on Hugging Face. Big Bench Audio is an audio version of a subset of Big Bench Hard questions. The dataset can be used for evaluating the reasoning capabilities of models that support audio input." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dataset = load_dataset(\"ArtificialAnalysis/big_bench_audio\")\n", + "# Ensure audio column is decoded into a dict with 'array' and 'sampling_rate'\n", + "dataset = dataset.cast_column(\"audio\", Audio(decode=True))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We extract the relevant fields and put it in a json-like format to pass in as a data source in the Evals API. Input audio data must be in the form of a base64 encoded string. So we process the data in the audio file and convert it to base64." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Audio helpers: file/array to base64\n", + "def get_base64(audio_path_or_datauri: str) -> str:\n", + " if audio_path_or_datauri.startswith(\"data:\"):\n", + " # Already base64, just strip prefix\n", + " print(\"Already base64, just strip prefix\")\n", + " return audio_path_or_datauri.split(\",\", 1)[1]\n", + " else:\n", + " # It's a real file path\n", + " print(\"It's a real file path\")\n", + " with open(audio_path_or_datauri, \"rb\") as f:\n", + " return base64.b64encode(f.read()).decode(\"ascii\")\n", + "\n", + "\n", + "def audio_to_base64(audio_val) -> str:\n", + " \"\"\"\n", + " Accepts various Hugging Face audio representations and returns base64-encoded WAV bytes (no data: prefix).\n", + " Handles:\n", + " - dict or mapping-like with 'path'\n", + " - decoded dict with 'array' and 'sampling_rate'\n", + " - torchcodec AudioDecoder (mapping-like access via ['path'] or ['array'])\n", + " - raw bytes\n", + " \"\"\"\n", + " # Try to get a file path first\n", + " try:\n", + " path = None\n", + " if isinstance(audio_val, dict) and \"path\" in audio_val:\n", + " path = audio_val[\"path\"]\n", + " else:\n", + " # Mapping-like access\n", + " try:\n", + " path = audio_val[\"path\"] # works for many decoder objects\n", + " except Exception:\n", + " path = getattr(audio_val, \"path\", None)\n", + " if isinstance(path, str) and os.path.exists(path):\n", + " with open(path, \"rb\") as f:\n", + " return base64.b64encode(f.read()).decode(\"ascii\")\n", + " except Exception:\n", + " pass\n", + "\n", + " # Fallback: use array + sampling_rate and render to WAV in-memory\n", + " try:\n", + " array = None\n", + " sampling_rate = None\n", + " try:\n", + " array = audio_val[\"array\"]\n", + " sampling_rate = audio_val[\"sampling_rate\"]\n", + " except Exception:\n", + " array = getattr(audio_val, \"array\", None)\n", + " sampling_rate = getattr(audio_val, \"sampling_rate\", None)\n", + " if array is not None and sampling_rate is not None:\n", + " audio_np = np.array(array)\n", + " buf = io.BytesIO()\n", + " sf.write(buf, audio_np, int(sampling_rate), format=\"WAV\")\n", + " return base64.b64encode(buf.getvalue()).decode(\"ascii\")\n", + " except Exception:\n", + " pass\n", + "\n", + " if isinstance(audio_val, (bytes, bytearray)):\n", + " return base64.b64encode(audio_val).decode(\"ascii\")\n", + "\n", + " raise ValueError(\"Unsupported audio value; could not convert to base64\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "evals_data_source = []\n", + "audio_base64 = None\n", + "\n", + "for example in dataset[\"train\"].select(range(3)):\n", + " audio_val = example[\"audio\"]\n", + " try:\n", + " audio_base64 = audio_to_base64(audio_val)\n", + " except Exception as e:\n", + " print(f\"Warning: could not encode audio for id={example['id']}: {e}\")\n", + " audio_base64 = None\n", + " evals_data_source.append({\n", + " \"item\": {\n", + " \"id\": example[\"id\"],\n", + " \"category\": example[\"category\"],\n", + " \"official_answer\": example[\"official_answer\"],\n", + " \"audio_base64\": audio_base64\n", + " }\n", + " })\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you print the data source list, each item should be of a similar form to:\n", + "\n", + "```python\n", + "{\n", + " \"item\": {\n", + " \"id\": 0\n", + " \"category\": \"formal_fallacies\"\n", + " \"official_answer\": \"invalid\"\n", + " \"audio_base64\": \"UklGRjrODwBXQVZFZm10IBAAAAABAAEAIlYAAESsA...\"\n", + " }\n", + "}\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Eval Configuration\n", + "\n", + "Now that we have our data source and task, we will create our evals. For the OpenAI Evals API docs, visit [API docs](https://platform.openai.com/docs/evals/overview).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "client = OpenAI(\n", + " api_key=os.getenv(\"OPENAI_API_KEY\")\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Since audio inputs are large, we need to save the examples to a file and upload it to the API" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Save the examples to a file\n", + "file_name = \"evals_data_source.json\"\n", + "with open(file_name, \"w\", encoding=\"utf-8\") as f:\n", + " for obj in evals_data_source:\n", + " f.write(json.dumps(obj, ensure_ascii=False) + \"\\n\")\n", + "\n", + "# Upload the file to the API\n", + "file = client.files.create(\n", + " file=open(file_name, \"rb\"),\n", + " purpose=\"evals\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Evals have two parts, the \"Eval\" and the \"Run\". In the \"Eval\", we define the expected structure of the data and the testing criteria (grader)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Data Source Config\n", + "\n", + "Based on the data that we have compiled, our data source config is as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data_source_config = {\n", + " \"type\": \"custom\",\n", + " \"item_schema\": {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"id\": { \"type\": \"integer\" },\n", + " \"category\": { \"type\": \"string\" },\n", + " \"official_answer\": { \"type\": \"string\" },\n", + " \"audio_base64\": { \"type\": \"string\" }\n", + " },\n", + " \"required\": [\"id\", \"category\", \"official_answer\", \"audio_base64\"]\n", + " },\n", + " \"include_sample_schema\": True, # enables sampling\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Testing Criteria" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For our testing criteria, we set up our grader config. In this example, it is a simple string_check grader that takes in the official answer and sampled model response (in the `sample` namespace), and then outputs a score between 0 and 1 based if the model response matches the reference answer. For more info on graders, visit [API Grader docs](https://platform.openai.com/docs/api-reference/graders). \n", + "\n", + "Getting the both the data and the grader right are key for an effective evaluation. So, you will likely want to iteratively refine the prompts for your graders. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "grader_config = {\n", + " \"type\": \"string_check\",\n", + " \"name\": \"String check grader\",\n", + " \"input\": \"{{sample.output_text}}\",\n", + " \"reference\": \"{{item.official_answer}}\",\n", + " \"operation\": \"like\"\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, we create the eval object." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "eval_object = client.evals.create(\n", + " name=\"Audio Grading Cookbook\",\n", + " data_source_config=data_source_config,\n", + " testing_criteria=[grader_config],\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Eval Run" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To create the run, we pass in the eval object id, the data source (i.e., the data we compiled earlier), and the chat message input we will use for sampling to generate the model response.\n", + "\n", + "Here's the sampling message input we'll use for this example." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sampling_messages = [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": \"You are a helpful assistant that can answer questions with the audio input. You will be given an audio input and a question. You will need to answer the question based on the audio input.\"\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"type\": \"message\",\n", + " \"content\": {\n", + " \"type\": \"input_audio\",\n", + " \"input_audio\": {\n", + " \"data\": \"{{ item.audio_base64 }}\",\n", + " \"format\": \"wav\"\n", + " }\n", + " }\n", + " }]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We now kickoff an eval run." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "eval_run = client.evals.runs.create(\n", + " name=\"Audio Input Eval Run\",\n", + " eval_id=eval_object.id,\n", + " data_source={\n", + " \"type\": \"completions\", # sample using completions API\n", + " \"source\": {\n", + " \"type\": \"file_id\",\n", + " \"id\": file.id\n", + " },\n", + " \"model\": \"gpt-4o-audio-preview\", # model used to generate the response; check that the model you use supports audio inputs\n", + " \"input_messages\": {\n", + " \"type\": \"template\", \n", + " \"template\": sampling_messages}\n", + " }\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Poll and Display the Results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When the run finishes, we can take a look at the result. You can also check in your org's OpenAI evals dashboard to see the progress and results. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "while True:\n", + " run = client.evals.runs.retrieve(run_id=eval_run.id, eval_id=eval_object.id)\n", + " if run.status == \"completed\":\n", + " output_items = list(client.evals.runs.output_items.list(\n", + " run_id=run.id, eval_id=eval_object.id\n", + " ))\n", + " df = pd.DataFrame({\n", + " \"id\": [item.datasource_item[\"id\"]for item in output_items],\n", + " \"category\": [item.datasource_item[\"category\"] for item in output_items],\n", + " \"official_answer\": [item.datasource_item[\"official_answer\"] for item in output_items],\n", + " \"model_response\": [item.sample.output[0].content for item in output_items],\n", + " \"grading_results\": [\"passed\" if item.results[0][\"passed\"] else \"failed\"\n", + " for item in output_items]\n", + " })\n", + " display(df)\n", + " break\n", + " if run.status == \"failed\":\n", + " print(run.error)\n", + " break\n", + " time.sleep(5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Viewing Individual Output Items" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To see a full output item, we can do the following. The structure of an output item is specified in the API docs [here](https://platform.openai.com/docs/api-reference/evals/run-output-item-object)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "first_item = output_items[0]\n", + "\n", + "print(json.dumps(dict(first_item), indent=2, default=str))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "In this cookbook, we covered a workflow for evaluating an audio-based task using the OpenAI Evals API's. By using the audio input functionality, we were able to streamline our evals process for the task. It could also be useful to use the audio transcript as input to a model grader for additional flexibility in grading the response.\n", + "\n", + "We're excited to see you extend this to your own audio-based use cases!" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.10" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} From 113db03ba3006621babf1fa8706ed692373df7b3 Mon Sep 17 00:00:00 2001 From: Todd Hendry Date: Wed, 13 Aug 2025 10:27:27 -0700 Subject: [PATCH 2/6] add comment --- examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb b/examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb index d702ce56d0..30b281146b 100644 --- a/examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb +++ b/examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb @@ -381,7 +381,7 @@ " name=\"Audio Input Eval Run\",\n", " eval_id=eval_object.id,\n", " data_source={\n", - " \"type\": \"completions\", # sample using completions API\n", + " \"type\": \"completions\", # sample using completions API; responses API is not supported for audio inputs\n", " \"source\": {\n", " \"type\": \"file_id\",\n", " \"id\": file.id\n", From c01cd5e29cd9a920c0410db76e5d52f5bed99a60 Mon Sep 17 00:00:00 2001 From: Todd Hendry Date: Wed, 13 Aug 2025 10:45:02 -0700 Subject: [PATCH 3/6] content and author setup --- authors.yaml | 5 +++++ registry.yaml | 9 +++++++++ 2 files changed, 14 insertions(+) diff --git a/authors.yaml b/authors.yaml index 46d2a95fbe..7375f200d4 100644 --- a/authors.yaml +++ b/authors.yaml @@ -456,3 +456,8 @@ yagil: name: "Yagil Burowski" website: "https://x.com/yagilb" avatar: "https://avatars.lmstudio.com/profile-images/yagil" + +hendrytl: + name: "Todd Hendry" + website: "https://www.linkedin.com/in/todd-hendry-962aa577/" + avatar: "https://avatars.githubusercontent.com/u/36863669" diff --git a/registry.yaml b/registry.yaml index 55006ccd53..23c91b9ec8 100644 --- a/registry.yaml +++ b/registry.yaml @@ -157,6 +157,15 @@ - evals - images +- title: Using Evals API on Audio Inputs + path: examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb + date: 2025-08-13 + authors: + - hendrytl + tags: + - evals + - audio + - title: Optimize Prompts path: examples/Optimize_Prompts.ipynb date: 2025-07-14 From 7cfb312172e7fe1d526e2ef46abad43924a8b7ca Mon Sep 17 00:00:00 2001 From: Todd Hendry Date: Wed, 13 Aug 2025 11:27:01 -0700 Subject: [PATCH 4/6] address review comments --- .../evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb b/examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb index 30b281146b..89b68a5fcf 100644 --- a/examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb +++ b/examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb @@ -6,11 +6,11 @@ "source": [ "# Evals API: Audio Inputs\n", "\n", - "This cookbook demonstrates how to use OpenAI's Evals framework for audio-based tasks. Leveraging the Evals API, we will grade model-generated responses to an audio message and prompt by using **sampling** to generate model responses and **model grading** (LLM as a Judge) to score the model responses against the output audio transcript, prompt, and reference answer.\n", + "This cookbook demonstrates how to use OpenAI's Evals framework for audio-based tasks. Leveraging the Evals API, we will grade model-generated responses to an audio message and prompt by using **sampling** to generate model responses and **model grading** (LLM as a Judge) to score the model responses against the output audio transcript, prompt, and reference answer. Note that grading will be on text outputs from the sampled response. Graders that can grade audio input are not currently supported.\n", "\n", "In this example, we will evaluate how well our model can:\n", "1. **Generate appropriate responses** to user prompts about an audio message\n", - "3. **Align with reference answers** that represent high-quality responses" + "2. **Align with reference answers** that represent high-quality responses" ] }, { @@ -284,9 +284,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "For our testing criteria, we set up our grader config. In this example, it is a simple string_check grader that takes in the official answer and sampled model response (in the `sample` namespace), and then outputs a score between 0 and 1 based if the model response matches the reference answer. For more info on graders, visit [API Grader docs](https://platform.openai.com/docs/api-reference/graders). \n", + "For our testing criteria, we set up our grader config. In this example, it is a simple string_check grader that takes in the official answer and sampled model response (in the `sample` namespace), and then outputs a score between 0 and 1 based if the model response contains the reference answer. The response contains both audio and the text transcript of the audio. We will use the text transcript in the grader. For more info on graders, visit [API Grader docs](https://platform.openai.com/docs/api-reference/graders). \n", "\n", - "Getting the both the data and the grader right are key for an effective evaluation. So, you will likely want to iteratively refine the prompts for your graders. " + "Getting both the data and the grader right are key for an effective evaluation. While this example uses a simple string check grader, a more powerful model grader could be used instead and you will likely want to iteratively refine the prompts for your graders. " ] }, { @@ -467,9 +467,7 @@ "source": [ "## Conclusion\n", "\n", - "In this cookbook, we covered a workflow for evaluating an audio-based task using the OpenAI Evals API's. By using the audio input functionality, we were able to streamline our evals process for the task. It could also be useful to use the audio transcript as input to a model grader for additional flexibility in grading the response.\n", - "\n", - "We're excited to see you extend this to your own audio-based use cases!" + "In this cookbook, we covered a workflow for evaluating native audio inputs to a model using the OpenAI Evals API's. We could additionally add model based graders for additional flexibility in grading in future.\n" ] } ], From a4bbcd7e3dcea19cb2e701d5f6d1b68620947e87 Mon Sep 17 00:00:00 2001 From: Todd Hendry Date: Wed, 13 Aug 2025 13:45:23 -0700 Subject: [PATCH 5/6] added more details to the descriptions --- .../use-cases/EvalsAPI_Audio_Inputs.ipynb | 34 ++++++++++++++----- 1 file changed, 25 insertions(+), 9 deletions(-) diff --git a/examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb b/examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb index 89b68a5fcf..20127d8572 100644 --- a/examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb +++ b/examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb @@ -6,7 +6,9 @@ "source": [ "# Evals API: Audio Inputs\n", "\n", - "This cookbook demonstrates how to use OpenAI's Evals framework for audio-based tasks. Leveraging the Evals API, we will grade model-generated responses to an audio message and prompt by using **sampling** to generate model responses and **model grading** (LLM as a Judge) to score the model responses against the output audio transcript, prompt, and reference answer. Note that grading will be on text outputs from the sampled response. Graders that can grade audio input are not currently supported.\n", + "This cookbook demonstrates how to use OpenAI's Evals framework for audio-based tasks. Leveraging the Evals API, we will grade model-generated responses to an audio message and prompt by using **sampling** to generate model responses and **string match grading** to score the model responses against the output audio transcript and reference answer. Note that grading will be on text outputs from the sampled response. Graders that can grade audio input are not currently supported.\n", + "\n", + "Before audio support was added, in order to evaluate audio conversations, they needed to be first transcribed to text. Now you can use the original audio and get samples from the model in audio as well. This will more accurately repesent workflows such as a customer suppor agent where both the user and agent are using aduio. For grading, we use the text transcript from the sampled audio so that we can leverage the existig suite of text graders. \n", "\n", "In this example, we will evaluate how well our model can:\n", "1. **Generate appropriate responses** to user prompts about an audio message\n", @@ -27,7 +29,7 @@ "outputs": [], "source": [ "# Install required packages\n", - "!pip install openai datasets pandas soundfile torch torchcodec --quiet" + "%pip install openai datasets pandas soundfile torch torchcodec --quiet" ] }, { @@ -55,7 +57,7 @@ "source": [ "## Dataset Preparation\n", "\n", - "We use the [big_bench_audio](https://huggingface.co/datasets/ArtificialAnalysis/big_bench_audio) dataset that's hosted on Hugging Face. Big Bench Audio is an audio version of a subset of Big Bench Hard questions. The dataset can be used for evaluating the reasoning capabilities of models that support audio input." + "We use the [big_bench_audio](https://huggingface.co/datasets/ArtificialAnalysis/big_bench_audio) dataset that's hosted on Hugging Face. Big Bench Audio is an audio version of a subset of Big Bench Hard questions. The dataset can be used for evaluating the reasoning capabilities of models that support audio input. It contains an audio clip describing a logic problem, a category and an offical answer." ] }, { @@ -86,11 +88,9 @@ "def get_base64(audio_path_or_datauri: str) -> str:\n", " if audio_path_or_datauri.startswith(\"data:\"):\n", " # Already base64, just strip prefix\n", - " print(\"Already base64, just strip prefix\")\n", " return audio_path_or_datauri.split(\",\", 1)[1]\n", " else:\n", " # It's a real file path\n", - " print(\"It's a real file path\")\n", " with open(audio_path_or_datauri, \"rb\") as f:\n", " return base64.b64encode(f.read()).decode(\"ascii\")\n", "\n", @@ -154,6 +154,7 @@ "evals_data_source = []\n", "audio_base64 = None\n", "\n", + "# Will use the first 3 examples for testing\n", "for example in dataset[\"train\"].select(range(3)):\n", " audio_val = example[\"audio\"]\n", " try:\n", @@ -205,7 +206,7 @@ "outputs": [], "source": [ "client = OpenAI(\n", - " api_key=os.getenv(\"OPENAI_API_KEY\")\n", + " api_key=os.getenv(\"OPENAI_API_KEY_DISTILLATION\")\n", ")" ] }, @@ -300,7 +301,7 @@ " \"name\": \"String check grader\",\n", " \"input\": \"{{sample.output_text}}\",\n", " \"reference\": \"{{item.official_answer}}\",\n", - " \"operation\": \"like\"\n", + " \"operation\": \"ilike\"\n", "}" ] }, @@ -349,7 +350,15 @@ "sampling_messages = [\n", " {\n", " \"role\": \"system\",\n", - " \"content\": \"You are a helpful assistant that can answer questions with the audio input. You will be given an audio input and a question. You will need to answer the question based on the audio input.\"\n", + " \"content\": \"You are a helpful and obedient assistant that can answer questions with audio input. You will be given an audio input containing a question and instructions on exactly how to answer. For example, if the user asks for a single word response, then you should only reply with a single word answer.\"\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"type\": \"message\",\n", + " \"content\": {\n", + " \"type\": \"input_text\",\n", + " \"text\": \"Answer the following question by replying with a single word answer: 'valid' or 'invalid'.\"\n", + " }\n", " },\n", " {\n", " \"role\": \"user\",\n", @@ -387,6 +396,9 @@ " \"id\": file.id\n", " },\n", " \"model\": \"gpt-4o-audio-preview\", # model used to generate the response; check that the model you use supports audio inputs\n", + " \"sampling_params\": {\n", + " \"temperature\": 0.0,\n", + " },\n", " \"input_messages\": {\n", " \"type\": \"template\", \n", " \"template\": sampling_messages}\n", @@ -467,7 +479,11 @@ "source": [ "## Conclusion\n", "\n", - "In this cookbook, we covered a workflow for evaluating native audio inputs to a model using the OpenAI Evals API's. We could additionally add model based graders for additional flexibility in grading in future.\n" + "In this cookbook, we covered a workflow for evaluating native audio inputs to a model using the OpenAI Evals API's. We demonstrated using a simple text grader to grade the text transcript of the audio response.\n", + "### Next steps\n", + "- Convert this example to your use case. \n", + "- Try using model based graders for additional flexibility in grading.\n", + "- If you have large audio clips, try using the [uploads API](https://platform.openai.com/docs/api-reference/uploads/create) for support up to 8 GB.\n" ] } ], From e809dc2d66867dcfa9a10cf0de5751046fe164ce Mon Sep 17 00:00:00 2001 From: Todd Hendry Date: Wed, 13 Aug 2025 15:16:06 -0700 Subject: [PATCH 6/6] typos --- examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb b/examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb index 20127d8572..f2ae9bd7a9 100644 --- a/examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb +++ b/examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb @@ -8,7 +8,7 @@ "\n", "This cookbook demonstrates how to use OpenAI's Evals framework for audio-based tasks. Leveraging the Evals API, we will grade model-generated responses to an audio message and prompt by using **sampling** to generate model responses and **string match grading** to score the model responses against the output audio transcript and reference answer. Note that grading will be on text outputs from the sampled response. Graders that can grade audio input are not currently supported.\n", "\n", - "Before audio support was added, in order to evaluate audio conversations, they needed to be first transcribed to text. Now you can use the original audio and get samples from the model in audio as well. This will more accurately repesent workflows such as a customer suppor agent where both the user and agent are using aduio. For grading, we use the text transcript from the sampled audio so that we can leverage the existig suite of text graders. \n", + "Before audio support was added, in order to evaluate audio conversations, they needed to be first transcribed to text. Now you can use the original audio and get samples from the model in audio as well. This will more accurately represent workflows such as a customer support agent where both the user and agent are using audio. For grading, we use the text transcript from the sampled audio so that we can leverage the existing suite of text graders. \n", "\n", "In this example, we will evaluate how well our model can:\n", "1. **Generate appropriate responses** to user prompts about an audio message\n", @@ -57,7 +57,7 @@ "source": [ "## Dataset Preparation\n", "\n", - "We use the [big_bench_audio](https://huggingface.co/datasets/ArtificialAnalysis/big_bench_audio) dataset that's hosted on Hugging Face. Big Bench Audio is an audio version of a subset of Big Bench Hard questions. The dataset can be used for evaluating the reasoning capabilities of models that support audio input. It contains an audio clip describing a logic problem, a category and an offical answer." + "We use the [big_bench_audio](https://huggingface.co/datasets/ArtificialAnalysis/big_bench_audio) dataset that's hosted on Hugging Face. Big Bench Audio is an audio version of a subset of Big Bench Hard questions. The dataset can be used for evaluating the reasoning capabilities of models that support audio input. It contains an audio clip describing a logic problem, a category and an official answer." ] }, { @@ -206,7 +206,7 @@ "outputs": [], "source": [ "client = OpenAI(\n", - " api_key=os.getenv(\"OPENAI_API_KEY_DISTILLATION\")\n", + " api_key=os.getenv(\"OPENAI_API_KEY\")\n", ")" ] }, @@ -479,7 +479,7 @@ "source": [ "## Conclusion\n", "\n", - "In this cookbook, we covered a workflow for evaluating native audio inputs to a model using the OpenAI Evals API's. We demonstrated using a simple text grader to grade the text transcript of the audio response.\n", + "In this cookbook, we covered a workflow for evaluating native audio inputs to a model using the OpenAI Evals APIs. We demonstrated using a simple text grader to grade the text transcript of the audio response.\n", "### Next steps\n", "- Convert this example to your use case. \n", "- Try using model based graders for additional flexibility in grading.\n",