Open LLM Leaderboard MMLU Benchmark Inconsistencies
AI Impact Summary
The Open LLM Leaderboard’s MMLU benchmark results are inconsistent due to the use of multiple evaluation implementations. The original implementation utilizes a UC Berkeley code, while Stanford’s CRFM HELM and Eleuther AI LM Evaluation Harness employ different codebases, leading to varying performance metrics and ranking discrepancies. This highlights the importance of understanding the underlying evaluation methodology when interpreting benchmark results and comparing model capabilities.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info