Open LLM Leaderboard: MMLU scores differ across Harness, Original, and HELM implementations
AI Impact Summary
Open LLM Leaderboard currently aggregates multiple holistic benchmarks (EleutherAI LM Evaluation Harness, UC Berkeley original MMLU implementation, Stanford HELM) to rate open LLMs; this setup reveals that MMLU scores for the same model can diverge substantially depending on which implementation is used. The post details how prompting differences, answer formatting, and how predictions are extracted across the three implementations drive these discrepancies, which in turn can flip model rankings. For engineers and product teams, this means leaderboard numbers are not directly comparable across implementations and should be treated as dependent on the chosen evaluation harness. To maintain decision quality, the team should standardize on a single, clearly documented evaluation path and communicate any discrepancies or version changes with the leaderboard.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info