Gemma Benchmarking Framework #237

JRV7903 · 2025-04-01T12:43:33Z

JRV7903
Apr 1, 2025

Hello Gemma Project Team and Open-Source Community! I'm excited to show you my implementation of a benchmarking framework for Gemma models as my initial take on the Benchmark Gemma Models project for GSoC 2025. This was designed to facilitate the evaluation and comparison of different language models on the MMLU dataset.

Overview:
This framework enables users to benchmark gemma models (currently Gemma 2B and 7B against the Mistral model, potentially other models as well) on the MMLU dataset and provides a detailed performance analysis and visualisation of the results against different categories.

Key Components:
• Model Configuration: The models are loaded dynamically from JSON files. This allows easy updates and extensions.
• Data Loader: The MMLU dataset has been loaded efficiently with support for category-based and customizable sampling
• Benchmark Runner: A runner that executes benchmarks, collects the results, and saves them for analysis.
• Evaluation Metrics: Calculates metrics (accuracy and F1 score) to evaluate the model performance.
• Visualization Support: Contains tools for visualizing benchmark results. This makes it easier to interpret and compare the performance of different models.

Implementation Details:
• Configured usage: This framework uses YAML and JSON files to tune the benchmarking without changing the code.
• Reproducibility: Uses seed setting to reproduce results across runs.
• Error Handling: To ensure that benchmarking continues even if an individual model fails.
• Batch Processing: This framework supports batched inferences for efficient model evaluation.
• Flexible Output: A csv output with timestamped results for tracking results.

What this code does:
The code will:
1. Load and prepare MMLU dataset examples with configurable few-shot learning
2. Initialize models according to specifications in models.json
3. Run inference on each model across all configured MMLU categories and subjects
4. Extract predicted answers from model responses
5. Calculate performance metrics (accuracy and potentially others)
6. Generate comprehensive reports including raw results, metrics, and a human-readable summary

Repository: https://github.com/JRV7903/gemma-benchmarking

I welcome your feedback on this implementation and would love to discuss potential improvements and future developments.
Looking forward to working on this project this summer!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Gemma Benchmarking Framework #237

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Gemma Benchmarking Framework #237

Uh oh!

JRV7903 Apr 1, 2025

Replies: 0 comments

JRV7903
Apr 1, 2025