Open LLM Leaderboard: MMLU scores vary across Harness, HELM, and Original implementations
AI Impact Summary
The Open LLM Leaderboard wraps multiple holistic benchmarks (EleutherAI LM Evaluation Harness, Stanford HELM) to compare open models, but MMLU results for LLaMA vary dramatically across implementations. The UC Berkeley 'Original implementation' of MMLU, HELM, and the Eleuther harness feed different prompts/formatting and scoring logic, which shifts predictions and final scores. This leads to ranking changes and undermines trust in the leaderboard; teams should treat a single reported score as non-portable without versioned harness details. If the goal is reliable benchmarking, a standardized evaluation path with explicit versioning and documented prompt templates is needed.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info