Gemma Benchmarking Framework #237
JRV7903
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hello Gemma Project Team and Open-Source Community! I'm excited to show you my implementation of a benchmarking framework for Gemma models as my initial take on the Benchmark Gemma Models project for GSoC 2025. This was designed to facilitate the evaluation and comparison of different language models on the MMLU dataset.
Overview:
This framework enables users to benchmark gemma models (currently Gemma 2B and 7B against the Mistral model, potentially other models as well) on the MMLU dataset and provides a detailed performance analysis and visualisation of the results against different categories.
Key Components:
• Model Configuration: The models are loaded dynamically from JSON files. This allows easy updates and extensions.
• Data Loader: The MMLU dataset has been loaded efficiently with support for category-based and customizable sampling
• Benchmark Runner: A runner that executes benchmarks, collects the results, and saves them for analysis.
• Evaluation Metrics: Calculates metrics (accuracy and F1 score) to evaluate the model performance.
• Visualization Support: Contains tools for visualizing benchmark results. This makes it easier to interpret and compare the performance of different models.
Implementation Details:
• Configured usage: This framework uses YAML and JSON files to tune the benchmarking without changing the code.
• Reproducibility: Uses seed setting to reproduce results across runs.
• Error Handling: To ensure that benchmarking continues even if an individual model fails.
• Batch Processing: This framework supports batched inferences for efficient model evaluation.
• Flexible Output: A csv output with timestamped results for tracking results.
What this code does:
The code will:
1. Load and prepare MMLU dataset examples with configurable few-shot learning
2. Initialize models according to specifications in models.json
3. Run inference on each model across all configured MMLU categories and subjects
4. Extract predicted answers from model responses
5. Calculate performance metrics (accuracy and potentially others)
6. Generate comprehensive reports including raw results, metrics, and a human-readable summary
Repository: https://github.com/JRV7903/gemma-benchmarking
I welcome your feedback on this implementation and would love to discuss potential improvements and future developments.
Looking forward to working on this project this summer!
Beta Was this translation helpful? Give feedback.
All reactions