diff --git a/authors.yaml b/authors.yaml index 46d2a95fbe..7375f200d4 100644 --- a/authors.yaml +++ b/authors.yaml @@ -456,3 +456,8 @@ yagil: name: "Yagil Burowski" website: "https://x.com/yagilb" avatar: "https://avatars.lmstudio.com/profile-images/yagil" + +hendrytl: + name: "Todd Hendry" + website: "https://www.linkedin.com/in/todd-hendry-962aa577/" + avatar: "https://avatars.githubusercontent.com/u/36863669" diff --git a/examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb b/examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb new file mode 100644 index 0000000000..f2ae9bd7a9 --- /dev/null +++ b/examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb @@ -0,0 +1,511 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Evals API: Audio Inputs\n", + "\n", + "This cookbook demonstrates how to use OpenAI's Evals framework for audio-based tasks. Leveraging the Evals API, we will grade model-generated responses to an audio message and prompt by using **sampling** to generate model responses and **string match grading** to score the model responses against the output audio transcript and reference answer. Note that grading will be on text outputs from the sampled response. Graders that can grade audio input are not currently supported.\n", + "\n", + "Before audio support was added, in order to evaluate audio conversations, they needed to be first transcribed to text. Now you can use the original audio and get samples from the model in audio as well. This will more accurately represent workflows such as a customer support agent where both the user and agent are using audio. For grading, we use the text transcript from the sampled audio so that we can leverage the existing suite of text graders. \n", + "\n", + "In this example, we will evaluate how well our model can:\n", + "1. **Generate appropriate responses** to user prompts about an audio message\n", + "2. **Align with reference answers** that represent high-quality responses" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Installing Dependencies + Setup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Install required packages\n", + "%pip install openai datasets pandas soundfile torch torchcodec --quiet" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Import libraries\n", + "from datasets import load_dataset, Audio\n", + "from openai import OpenAI\n", + "import base64\n", + "import os\n", + "import json\n", + "import time\n", + "import io\n", + "import soundfile as sf\n", + "import numpy as np\n", + "import pandas as pd" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Dataset Preparation\n", + "\n", + "We use the [big_bench_audio](https://huggingface.co/datasets/ArtificialAnalysis/big_bench_audio) dataset that's hosted on Hugging Face. Big Bench Audio is an audio version of a subset of Big Bench Hard questions. The dataset can be used for evaluating the reasoning capabilities of models that support audio input. It contains an audio clip describing a logic problem, a category and an official answer." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "dataset = load_dataset(\"ArtificialAnalysis/big_bench_audio\")\n", + "# Ensure audio column is decoded into a dict with 'array' and 'sampling_rate'\n", + "dataset = dataset.cast_column(\"audio\", Audio(decode=True))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We extract the relevant fields and put it in a json-like format to pass in as a data source in the Evals API. Input audio data must be in the form of a base64 encoded string. So we process the data in the audio file and convert it to base64." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Audio helpers: file/array to base64\n", + "def get_base64(audio_path_or_datauri: str) -> str:\n", + " if audio_path_or_datauri.startswith(\"data:\"):\n", + " # Already base64, just strip prefix\n", + " return audio_path_or_datauri.split(\",\", 1)[1]\n", + " else:\n", + " # It's a real file path\n", + " with open(audio_path_or_datauri, \"rb\") as f:\n", + " return base64.b64encode(f.read()).decode(\"ascii\")\n", + "\n", + "\n", + "def audio_to_base64(audio_val) -> str:\n", + " \"\"\"\n", + " Accepts various Hugging Face audio representations and returns base64-encoded WAV bytes (no data: prefix).\n", + " Handles:\n", + " - dict or mapping-like with 'path'\n", + " - decoded dict with 'array' and 'sampling_rate'\n", + " - torchcodec AudioDecoder (mapping-like access via ['path'] or ['array'])\n", + " - raw bytes\n", + " \"\"\"\n", + " # Try to get a file path first\n", + " try:\n", + " path = None\n", + " if isinstance(audio_val, dict) and \"path\" in audio_val:\n", + " path = audio_val[\"path\"]\n", + " else:\n", + " # Mapping-like access\n", + " try:\n", + " path = audio_val[\"path\"] # works for many decoder objects\n", + " except Exception:\n", + " path = getattr(audio_val, \"path\", None)\n", + " if isinstance(path, str) and os.path.exists(path):\n", + " with open(path, \"rb\") as f:\n", + " return base64.b64encode(f.read()).decode(\"ascii\")\n", + " except Exception:\n", + " pass\n", + "\n", + " # Fallback: use array + sampling_rate and render to WAV in-memory\n", + " try:\n", + " array = None\n", + " sampling_rate = None\n", + " try:\n", + " array = audio_val[\"array\"]\n", + " sampling_rate = audio_val[\"sampling_rate\"]\n", + " except Exception:\n", + " array = getattr(audio_val, \"array\", None)\n", + " sampling_rate = getattr(audio_val, \"sampling_rate\", None)\n", + " if array is not None and sampling_rate is not None:\n", + " audio_np = np.array(array)\n", + " buf = io.BytesIO()\n", + " sf.write(buf, audio_np, int(sampling_rate), format=\"WAV\")\n", + " return base64.b64encode(buf.getvalue()).decode(\"ascii\")\n", + " except Exception:\n", + " pass\n", + "\n", + " if isinstance(audio_val, (bytes, bytearray)):\n", + " return base64.b64encode(audio_val).decode(\"ascii\")\n", + "\n", + " raise ValueError(\"Unsupported audio value; could not convert to base64\")\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "evals_data_source = []\n", + "audio_base64 = None\n", + "\n", + "# Will use the first 3 examples for testing\n", + "for example in dataset[\"train\"].select(range(3)):\n", + " audio_val = example[\"audio\"]\n", + " try:\n", + " audio_base64 = audio_to_base64(audio_val)\n", + " except Exception as e:\n", + " print(f\"Warning: could not encode audio for id={example['id']}: {e}\")\n", + " audio_base64 = None\n", + " evals_data_source.append({\n", + " \"item\": {\n", + " \"id\": example[\"id\"],\n", + " \"category\": example[\"category\"],\n", + " \"official_answer\": example[\"official_answer\"],\n", + " \"audio_base64\": audio_base64\n", + " }\n", + " })\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you print the data source list, each item should be of a similar form to:\n", + "\n", + "```python\n", + "{\n", + " \"item\": {\n", + " \"id\": 0\n", + " \"category\": \"formal_fallacies\"\n", + " \"official_answer\": \"invalid\"\n", + " \"audio_base64\": \"UklGRjrODwBXQVZFZm10IBAAAAABAAEAIlYAAESsA...\"\n", + " }\n", + "}\n", + "```" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Eval Configuration\n", + "\n", + "Now that we have our data source and task, we will create our evals. For the OpenAI Evals API docs, visit [API docs](https://platform.openai.com/docs/evals/overview).\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "client = OpenAI(\n", + " api_key=os.getenv(\"OPENAI_API_KEY\")\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Since audio inputs are large, we need to save the examples to a file and upload it to the API" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "# Save the examples to a file\n", + "file_name = \"evals_data_source.json\"\n", + "with open(file_name, \"w\", encoding=\"utf-8\") as f:\n", + " for obj in evals_data_source:\n", + " f.write(json.dumps(obj, ensure_ascii=False) + \"\\n\")\n", + "\n", + "# Upload the file to the API\n", + "file = client.files.create(\n", + " file=open(file_name, \"rb\"),\n", + " purpose=\"evals\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Evals have two parts, the \"Eval\" and the \"Run\". In the \"Eval\", we define the expected structure of the data and the testing criteria (grader)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Data Source Config\n", + "\n", + "Based on the data that we have compiled, our data source config is as follows:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data_source_config = {\n", + " \"type\": \"custom\",\n", + " \"item_schema\": {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"id\": { \"type\": \"integer\" },\n", + " \"category\": { \"type\": \"string\" },\n", + " \"official_answer\": { \"type\": \"string\" },\n", + " \"audio_base64\": { \"type\": \"string\" }\n", + " },\n", + " \"required\": [\"id\", \"category\", \"official_answer\", \"audio_base64\"]\n", + " },\n", + " \"include_sample_schema\": True, # enables sampling\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Testing Criteria" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For our testing criteria, we set up our grader config. In this example, it is a simple string_check grader that takes in the official answer and sampled model response (in the `sample` namespace), and then outputs a score between 0 and 1 based if the model response contains the reference answer. The response contains both audio and the text transcript of the audio. We will use the text transcript in the grader. For more info on graders, visit [API Grader docs](https://platform.openai.com/docs/api-reference/graders). \n", + "\n", + "Getting both the data and the grader right are key for an effective evaluation. While this example uses a simple string check grader, a more powerful model grader could be used instead and you will likely want to iteratively refine the prompts for your graders. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "grader_config = {\n", + " \"type\": \"string_check\",\n", + " \"name\": \"String check grader\",\n", + " \"input\": \"{{sample.output_text}}\",\n", + " \"reference\": \"{{item.official_answer}}\",\n", + " \"operation\": \"ilike\"\n", + "}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, we create the eval object." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "eval_object = client.evals.create(\n", + " name=\"Audio Grading Cookbook\",\n", + " data_source_config=data_source_config,\n", + " testing_criteria=[grader_config],\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Eval Run" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To create the run, we pass in the eval object id, the data source (i.e., the data we compiled earlier), and the chat message input we will use for sampling to generate the model response.\n", + "\n", + "Here's the sampling message input we'll use for this example." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "sampling_messages = [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": \"You are a helpful and obedient assistant that can answer questions with audio input. You will be given an audio input containing a question and instructions on exactly how to answer. For example, if the user asks for a single word response, then you should only reply with a single word answer.\"\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"type\": \"message\",\n", + " \"content\": {\n", + " \"type\": \"input_text\",\n", + " \"text\": \"Answer the following question by replying with a single word answer: 'valid' or 'invalid'.\"\n", + " }\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"type\": \"message\",\n", + " \"content\": {\n", + " \"type\": \"input_audio\",\n", + " \"input_audio\": {\n", + " \"data\": \"{{ item.audio_base64 }}\",\n", + " \"format\": \"wav\"\n", + " }\n", + " }\n", + " }]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We now kickoff an eval run." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "eval_run = client.evals.runs.create(\n", + " name=\"Audio Input Eval Run\",\n", + " eval_id=eval_object.id,\n", + " data_source={\n", + " \"type\": \"completions\", # sample using completions API; responses API is not supported for audio inputs\n", + " \"source\": {\n", + " \"type\": \"file_id\",\n", + " \"id\": file.id\n", + " },\n", + " \"model\": \"gpt-4o-audio-preview\", # model used to generate the response; check that the model you use supports audio inputs\n", + " \"sampling_params\": {\n", + " \"temperature\": 0.0,\n", + " },\n", + " \"input_messages\": {\n", + " \"type\": \"template\", \n", + " \"template\": sampling_messages}\n", + " }\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Poll and Display the Results" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "When the run finishes, we can take a look at the result. You can also check in your org's OpenAI evals dashboard to see the progress and results. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "while True:\n", + " run = client.evals.runs.retrieve(run_id=eval_run.id, eval_id=eval_object.id)\n", + " if run.status == \"completed\":\n", + " output_items = list(client.evals.runs.output_items.list(\n", + " run_id=run.id, eval_id=eval_object.id\n", + " ))\n", + " df = pd.DataFrame({\n", + " \"id\": [item.datasource_item[\"id\"]for item in output_items],\n", + " \"category\": [item.datasource_item[\"category\"] for item in output_items],\n", + " \"official_answer\": [item.datasource_item[\"official_answer\"] for item in output_items],\n", + " \"model_response\": [item.sample.output[0].content for item in output_items],\n", + " \"grading_results\": [\"passed\" if item.results[0][\"passed\"] else \"failed\"\n", + " for item in output_items]\n", + " })\n", + " display(df)\n", + " break\n", + " if run.status == \"failed\":\n", + " print(run.error)\n", + " break\n", + " time.sleep(5)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Viewing Individual Output Items" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "To see a full output item, we can do the following. The structure of an output item is specified in the API docs [here](https://platform.openai.com/docs/api-reference/evals/run-output-item-object)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "first_item = output_items[0]\n", + "\n", + "print(json.dumps(dict(first_item), indent=2, default=str))" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Conclusion\n", + "\n", + "In this cookbook, we covered a workflow for evaluating native audio inputs to a model using the OpenAI Evals APIs. We demonstrated using a simple text grader to grade the text transcript of the audio response.\n", + "### Next steps\n", + "- Convert this example to your use case. \n", + "- Try using model based graders for additional flexibility in grading.\n", + "- If you have large audio clips, try using the [uploads API](https://platform.openai.com/docs/api-reference/uploads/create) for support up to 8 GB.\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.10" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/registry.yaml b/registry.yaml index 55006ccd53..23c91b9ec8 100644 --- a/registry.yaml +++ b/registry.yaml @@ -157,6 +157,15 @@ - evals - images +- title: Using Evals API on Audio Inputs + path: examples/evaluation/use-cases/EvalsAPI_Audio_Inputs.ipynb + date: 2025-08-13 + authors: + - hendrytl + tags: + - evals + - audio + - title: Optimize Prompts path: examples/Optimize_Prompts.ipynb date: 2025-07-14