Skip to content

naveedn/transcript-qual-evals

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LLM Judge for Transcript Quality Evaluation

An automated evaluation system that uses ChatGPT to assess the quality of audio transcription outputs through pairwise comparisons, golden master evaluation, and hallucination detection.

🎯 Features

  • Hallucination Detection: Identifies repetitive patterns and invented content
  • Pairwise Comparisons: Head-to-head evaluation of different transcript models
  • Golden Master Evaluation: Compares transcripts against a reference standard
  • Comprehensive Scoring: Rates faithfulness, readability, proper nouns, and speaker consistency
  • Automated Reports: Console output and detailed JSON results

📁 Directory Structure

qual-evals/
├── inputs/                     # Input transcript files
│   ├── *_*_vad.txt            # Transcript files (format: speaker_model_vad.txt)
│   └── generated_golden_copy.txt  # Golden reference transcript
├── prompts/                   # LLM judge prompts
│   ├── hallucination_detection.txt
│   ├── pairwise_comparison.txt
│   └── golden_master_comparison.txt
├── outputs/                   # Generated evaluation reports
├── llm_judge.py              # Main evaluation script
└── .env                      # OpenAI API key configuration

🚀 Quick Start

1. Prerequisites

  • Python 3.11+
  • OpenAI API key
  • UV package manager (or pip)

2. Installation

cd experiments/qual-evals

# Install dependencies with uv
uv add openai python-dotenv

# Or with pip
pip install openai python-dotenv

3. Configuration

Set up your OpenAI API key in .env:

echo "OPENAI_API_KEY=your_api_key_here" > .env

4. Prepare Input Files

Place your transcript files in the inputs/ folder:

  • Transcript files: Named as <speaker>_<model>_vad.txt (e.g., john_whisper_base_vad.txt)
  • Golden reference: Named generated_golden_copy.txt

5. Run Evaluation

# Run all evaluations
python llm_judge.py

# Skip specific evaluation types
python llm_judge.py --skip hallucination pairwise

# List available evaluation types
python llm_judge.py --list-evaluations

# Use a different model
python llm_judge.py --model gpt-4o-mini

📊 Sample Output

🔍 Evaluating 3 transcripts...

📊 Running hallucination detection...
  Analyzing base...
  Analyzing small...
  Analyzing medium...

⚖️  Running pairwise comparisons...
  Comparing base vs small...
  Comparing base vs medium...
  Comparing small vs medium...

🏆 Running golden master comparisons...
  Comparing base to golden master...
  Comparing small to golden master...
  Comparing medium to golden master...

============================================================
🎯 TRANSCRIPT EVALUATION REPORT
============================================================

📅 Generated: 2025-01-15T10:30:45.123456
🤖 LLM Model: gpt-4o
📄 Files Evaluated: 3

👻 HALLUCINATION ANALYSIS
------------------------------
  small           | Score: 3/10
  medium          | Score: 2/10
  base            | Score: 4/10

🏆 GOLDEN MASTER COMPARISON
------------------------------
  small           | Score: 8/10
  medium          | Score: 9/10
  base            | Score: 7/10

⚖️  PAIRWISE COMPARISON WINS
------------------------------
  medium          | Wins: 2
  small           | Wins: 1

🥇 Best Performer: medium
🥉 Worst Performer: base
============================================================

💾 Full results saved to: outputs/evaluation_results_20250115_103045.json

📈 Evaluation Metrics

Hallucination Detection (1-10 scale)

  • 1: No hallucinations detected
  • 10: Severe hallucinations throughout
  • Identifies repetitive patterns like "No. No. No. No."
  • Detects looping sentences and fabricated content

Golden Master Comparison (1-10 scale)

  • Faithfulness: Content accuracy vs reference
  • Readability: Punctuation and flow quality
  • Proper Nouns: Names and numerical accuracy
  • Speakers: Speaker labeling consistency

Pairwise Comparisons (1-5 scale per category)

  • Direct head-to-head evaluation between transcript pairs
  • Winner determination based on overall quality
  • Same scoring categories as golden master evaluation

⏭️ Selective Evaluation

You can skip specific evaluation types to save time or focus on particular metrics:

Available Evaluation Types

  • hallucination: Hallucination detection analysis
  • pairwise: Pairwise transcript comparisons
  • golden: Golden master comparisons

Skip Examples

# Skip hallucination detection only
python llm_judge.py --skip hallucination

# Skip multiple evaluation types
python llm_judge.py --skip hallucination pairwise

# Run only golden master comparisons
python llm_judge.py --skip hallucination pairwise

CLI Options

  • --skip / -s: Skip specific evaluation types
  • --list-evaluations / -l: Show available evaluation types
  • --model / -m: Change the OpenAI model (default: gpt-4o)
  • --help / -h: Show help message

🔧 Customization

Modify Evaluation Prompts

Edit files in the prompts/ directory to customize evaluation criteria:

  • hallucination_detection.txt: Adjust hallucination detection patterns
  • pairwise_comparison.txt: Modify comparison criteria
  • golden_master_comparison.txt: Change golden reference evaluation

Change LLM Model

Edit llm_judge.py to use a different OpenAI model:

self.model = "gpt-4o"  # Change to your preferred model

Output Format

Results are saved as JSON in outputs/ with detailed breakdowns:

{
  "timestamp": "2025-01-15T10:30:45.123456",
  "model_used": "gpt-4o",
  "transcript_files": ["speaker_base_vad.txt", "speaker_small_vad.txt"],
  "hallucination_analysis": {...},
  "pairwise_comparisons": {...},
  "golden_master_comparisons": {...},
  "summary": {...}
}

🔍 Troubleshooting

Common Issues

  1. API Key Error: Ensure your OpenAI API key is correctly set in .env
  2. File Not Found: Check that transcript files follow the naming convention
  3. JSON Parse Error: LLM responses occasionally need manual review
  4. Rate Limits: Add delays between API calls if hitting OpenAI rate limits

File Naming Convention

Transcript files must follow the pattern: <identifier>_<model>_vad.txt

Examples:

  • john_whisper_base_vad.txt
  • episode1_openai_large_vad.txt
  • transcript.txt
  • base_model.txt

📄 License

This project is part of the DND Podcast Transcriber repository.

About

LLM judge to compare transcript files from different model sizes to identify the best ones

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages