InfoCapability

Open LLM Leaderboard: MMLU scores differ across Harness, Original, and HELM implementations

AI Impact Summary

Open LLM Leaderboard currently aggregates multiple holistic benchmarks (EleutherAI LM Evaluation Harness, UC Berkeley original MMLU implementation, Stanford HELM) to rate open LLMs; this setup reveals that MMLU scores for the same model can diverge substantially depending on which implementation is used. The post details how prompting differences, answer formatting, and how predictions are extracted across the three implementations drive these discrepancies, which in turn can flip model rankings. For engineers and product teams, this means leaderboard numbers are not directly comparable across implementations and should be treated as dependent on the chosen evaluation harness. To maintain decision quality, the team should standardize on a single, clearly documented evaluation path and communicate any discrepancies or version changes with the leaderboard.

Affected Systems

Open LLM Leaderboard

Date: Date not specified
Change type: capability
Severity: info

Open LLM Leaderboard: MMLU scores differ across Harness, Original, and HELM implementations

More from Hugging Face

Get alerts for Hugging Face