LLM Judge for Transcript Quality Evaluation

An automated evaluation system that uses ChatGPT to assess the quality of audio transcription outputs through pairwise comparisons, golden master evaluation, and hallucination detection.

🎯 Features

Hallucination Detection: Identifies repetitive patterns and invented content
Pairwise Comparisons: Head-to-head evaluation of different transcript models
Golden Master Evaluation: Compares transcripts against a reference standard
Comprehensive Scoring: Rates faithfulness, readability, proper nouns, and speaker consistency
Automated Reports: Console output and detailed JSON results

📁 Directory Structure

qual-evals/
├── inputs/                     # Input transcript files
│   ├── *_*_vad.txt            # Transcript files (format: speaker_model_vad.txt)
│   └── generated_golden_copy.txt  # Golden reference transcript
├── prompts/                   # LLM judge prompts
│   ├── hallucination_detection.txt
│   ├── pairwise_comparison.txt
│   └── golden_master_comparison.txt
├── outputs/                   # Generated evaluation reports
├── llm_judge.py              # Main evaluation script
└── .env                      # OpenAI API key configuration

🚀 Quick Start

1. Prerequisites

Python 3.11+
OpenAI API key
UV package manager (or pip)

2. Installation

cd experiments/qual-evals

# Install dependencies with uv
uv add openai python-dotenv

# Or with pip
pip install openai python-dotenv

3. Configuration

Set up your OpenAI API key in .env:

echo "OPENAI_API_KEY=your_api_key_here" > .env

4. Prepare Input Files

Place your transcript files in the inputs/ folder:

Transcript files: Named as <speaker>_<model>_vad.txt (e.g., john_whisper_base_vad.txt)
Golden reference: Named generated_golden_copy.txt

5. Run Evaluation

# Run all evaluations
python llm_judge.py

# Skip specific evaluation types
python llm_judge.py --skip hallucination pairwise

# List available evaluation types
python llm_judge.py --list-evaluations

# Use a different model
python llm_judge.py --model gpt-4o-mini

📊 Sample Output

🔍 Evaluating 3 transcripts...

📊 Running hallucination detection...
  Analyzing base...
  Analyzing small...
  Analyzing medium...

⚖️  Running pairwise comparisons...
  Comparing base vs small...
  Comparing base vs medium...
  Comparing small vs medium...

🏆 Running golden master comparisons...
  Comparing base to golden master...
  Comparing small to golden master...
  Comparing medium to golden master...

============================================================
🎯 TRANSCRIPT EVALUATION REPORT
============================================================

📅 Generated: 2025-01-15T10:30:45.123456
🤖 LLM Model: gpt-4o
📄 Files Evaluated: 3

👻 HALLUCINATION ANALYSIS
------------------------------
  small           | Score: 3/10
  medium          | Score: 2/10
  base            | Score: 4/10

🏆 GOLDEN MASTER COMPARISON
------------------------------
  small           | Score: 8/10
  medium          | Score: 9/10
  base            | Score: 7/10

⚖️  PAIRWISE COMPARISON WINS
------------------------------
  medium          | Wins: 2
  small           | Wins: 1

🥇 Best Performer: medium
🥉 Worst Performer: base
============================================================

💾 Full results saved to: outputs/evaluation_results_20250115_103045.json

📈 Evaluation Metrics

Hallucination Detection (1-10 scale)

1: No hallucinations detected
10: Severe hallucinations throughout
Identifies repetitive patterns like "No. No. No. No."
Detects looping sentences and fabricated content

Golden Master Comparison (1-10 scale)

Faithfulness: Content accuracy vs reference
Readability: Punctuation and flow quality
Proper Nouns: Names and numerical accuracy
Speakers: Speaker labeling consistency

Pairwise Comparisons (1-5 scale per category)

Direct head-to-head evaluation between transcript pairs
Winner determination based on overall quality
Same scoring categories as golden master evaluation

⏭️ Selective Evaluation

You can skip specific evaluation types to save time or focus on particular metrics:

Available Evaluation Types

hallucination: Hallucination detection analysis
pairwise: Pairwise transcript comparisons
golden: Golden master comparisons

Skip Examples

# Skip hallucination detection only
python llm_judge.py --skip hallucination

# Skip multiple evaluation types
python llm_judge.py --skip hallucination pairwise

# Run only golden master comparisons
python llm_judge.py --skip hallucination pairwise

CLI Options

--skip / -s: Skip specific evaluation types
--list-evaluations / -l: Show available evaluation types
--model / -m: Change the OpenAI model (default: gpt-4o)
--help / -h: Show help message

🔧 Customization

Modify Evaluation Prompts

Edit files in the prompts/ directory to customize evaluation criteria:

hallucination_detection.txt: Adjust hallucination detection patterns
pairwise_comparison.txt: Modify comparison criteria
golden_master_comparison.txt: Change golden reference evaluation

Change LLM Model

Edit llm_judge.py to use a different OpenAI model:

self.model = "gpt-4o"  # Change to your preferred model

Output Format

Results are saved as JSON in outputs/ with detailed breakdowns:

{
  "timestamp": "2025-01-15T10:30:45.123456",
  "model_used": "gpt-4o",
  "transcript_files": ["speaker_base_vad.txt", "speaker_small_vad.txt"],
  "hallucination_analysis": {...},
  "pairwise_comparisons": {...},
  "golden_master_comparisons": {...},
  "summary": {...}
}

🔍 Troubleshooting

Common Issues

API Key Error: Ensure your OpenAI API key is correctly set in .env
File Not Found: Check that transcript files follow the naming convention
JSON Parse Error: LLM responses occasionally need manual review
Rate Limits: Add delays between API calls if hitting OpenAI rate limits

File Naming Convention

Transcript files must follow the pattern: <identifier>_<model>_vad.txt

Examples:

✅ john_whisper_base_vad.txt
✅ episode1_openai_large_vad.txt
❌ transcript.txt
❌ base_model.txt

📄 License

This project is part of the DND Podcast Transcriber repository.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
prompts		prompts
.gitignore		.gitignore
README.md		README.md
llm_judge.py		llm_judge.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Judge for Transcript Quality Evaluation

🎯 Features

📁 Directory Structure

🚀 Quick Start

1. Prerequisites

2. Installation

3. Configuration

4. Prepare Input Files

5. Run Evaluation

📊 Sample Output

📈 Evaluation Metrics

Hallucination Detection (1-10 scale)

Golden Master Comparison (1-10 scale)

Pairwise Comparisons (1-5 scale per category)

⏭️ Selective Evaluation

Available Evaluation Types

Skip Examples

CLI Options

🔧 Customization

Modify Evaluation Prompts

Change LLM Model

Output Format

🔍 Troubleshooting

Common Issues

File Naming Convention

📄 License

About

Uh oh!

Releases

Packages

Languages

naveedn/transcript-qual-evals

Folders and files

Latest commit

History

Repository files navigation

LLM Judge for Transcript Quality Evaluation

🎯 Features

📁 Directory Structure

🚀 Quick Start

1. Prerequisites

2. Installation

3. Configuration

4. Prepare Input Files

5. Run Evaluation

📊 Sample Output

📈 Evaluation Metrics

Hallucination Detection (1-10 scale)

Golden Master Comparison (1-10 scale)

Pairwise Comparisons (1-5 scale per category)

⏭️ Selective Evaluation

Available Evaluation Types

Skip Examples

CLI Options

🔧 Customization

Modify Evaluation Prompts

Change LLM Model

Output Format

🔍 Troubleshooting

Common Issues

File Naming Convention

📄 License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages