An automated evaluation system that uses ChatGPT to assess the quality of audio transcription outputs through pairwise comparisons, golden master evaluation, and hallucination detection.
- Hallucination Detection: Identifies repetitive patterns and invented content
- Pairwise Comparisons: Head-to-head evaluation of different transcript models
- Golden Master Evaluation: Compares transcripts against a reference standard
- Comprehensive Scoring: Rates faithfulness, readability, proper nouns, and speaker consistency
- Automated Reports: Console output and detailed JSON results
qual-evals/
├── inputs/ # Input transcript files
│ ├── *_*_vad.txt # Transcript files (format: speaker_model_vad.txt)
│ └── generated_golden_copy.txt # Golden reference transcript
├── prompts/ # LLM judge prompts
│ ├── hallucination_detection.txt
│ ├── pairwise_comparison.txt
│ └── golden_master_comparison.txt
├── outputs/ # Generated evaluation reports
├── llm_judge.py # Main evaluation script
└── .env # OpenAI API key configuration
- Python 3.11+
- OpenAI API key
- UV package manager (or pip)
cd experiments/qual-evals
# Install dependencies with uv
uv add openai python-dotenv
# Or with pip
pip install openai python-dotenvSet up your OpenAI API key in .env:
echo "OPENAI_API_KEY=your_api_key_here" > .envPlace your transcript files in the inputs/ folder:
- Transcript files: Named as
<speaker>_<model>_vad.txt(e.g.,john_whisper_base_vad.txt) - Golden reference: Named
generated_golden_copy.txt
# Run all evaluations
python llm_judge.py
# Skip specific evaluation types
python llm_judge.py --skip hallucination pairwise
# List available evaluation types
python llm_judge.py --list-evaluations
# Use a different model
python llm_judge.py --model gpt-4o-mini🔍 Evaluating 3 transcripts...
📊 Running hallucination detection...
Analyzing base...
Analyzing small...
Analyzing medium...
⚖️ Running pairwise comparisons...
Comparing base vs small...
Comparing base vs medium...
Comparing small vs medium...
🏆 Running golden master comparisons...
Comparing base to golden master...
Comparing small to golden master...
Comparing medium to golden master...
============================================================
🎯 TRANSCRIPT EVALUATION REPORT
============================================================
📅 Generated: 2025-01-15T10:30:45.123456
🤖 LLM Model: gpt-4o
📄 Files Evaluated: 3
👻 HALLUCINATION ANALYSIS
------------------------------
small | Score: 3/10
medium | Score: 2/10
base | Score: 4/10
🏆 GOLDEN MASTER COMPARISON
------------------------------
small | Score: 8/10
medium | Score: 9/10
base | Score: 7/10
⚖️ PAIRWISE COMPARISON WINS
------------------------------
medium | Wins: 2
small | Wins: 1
🥇 Best Performer: medium
🥉 Worst Performer: base
============================================================
💾 Full results saved to: outputs/evaluation_results_20250115_103045.json
- 1: No hallucinations detected
- 10: Severe hallucinations throughout
- Identifies repetitive patterns like "No. No. No. No."
- Detects looping sentences and fabricated content
- Faithfulness: Content accuracy vs reference
- Readability: Punctuation and flow quality
- Proper Nouns: Names and numerical accuracy
- Speakers: Speaker labeling consistency
- Direct head-to-head evaluation between transcript pairs
- Winner determination based on overall quality
- Same scoring categories as golden master evaluation
You can skip specific evaluation types to save time or focus on particular metrics:
hallucination: Hallucination detection analysispairwise: Pairwise transcript comparisonsgolden: Golden master comparisons
# Skip hallucination detection only
python llm_judge.py --skip hallucination
# Skip multiple evaluation types
python llm_judge.py --skip hallucination pairwise
# Run only golden master comparisons
python llm_judge.py --skip hallucination pairwise--skip/-s: Skip specific evaluation types--list-evaluations/-l: Show available evaluation types--model/-m: Change the OpenAI model (default: gpt-4o)--help/-h: Show help message
Edit files in the prompts/ directory to customize evaluation criteria:
hallucination_detection.txt: Adjust hallucination detection patternspairwise_comparison.txt: Modify comparison criteriagolden_master_comparison.txt: Change golden reference evaluation
Edit llm_judge.py to use a different OpenAI model:
self.model = "gpt-4o" # Change to your preferred modelResults are saved as JSON in outputs/ with detailed breakdowns:
{
"timestamp": "2025-01-15T10:30:45.123456",
"model_used": "gpt-4o",
"transcript_files": ["speaker_base_vad.txt", "speaker_small_vad.txt"],
"hallucination_analysis": {...},
"pairwise_comparisons": {...},
"golden_master_comparisons": {...},
"summary": {...}
}- API Key Error: Ensure your OpenAI API key is correctly set in
.env - File Not Found: Check that transcript files follow the naming convention
- JSON Parse Error: LLM responses occasionally need manual review
- Rate Limits: Add delays between API calls if hitting OpenAI rate limits
Transcript files must follow the pattern: <identifier>_<model>_vad.txt
Examples:
- ✅
john_whisper_base_vad.txt - ✅
episode1_openai_large_vad.txt - ❌
transcript.txt - ❌
base_model.txt
This project is part of the DND Podcast Transcriber repository.