InfoCapability

Open LLM Leaderboard: MMLU scores vary across Harness, HELM, and Original implementations

AI Impact Summary

The Open LLM Leaderboard wraps multiple holistic benchmarks (EleutherAI LM Evaluation Harness, Stanford HELM) to compare open models, but MMLU results for LLaMA vary dramatically across implementations. The UC Berkeley 'Original implementation' of MMLU, HELM, and the Eleuther harness feed different prompts/formatting and scoring logic, which shifts predictions and final scores. This leads to ranking changes and undermines trust in the leaderboard; teams should treat a single reported score as non-portable without versioned harness details. If the goal is reliable benchmarking, a standardized evaluation path with explicit versioning and documented prompt templates is needed.

Affected Systems

Open LLM LeaderboardEleutherAI LM Evaluation Harness

Date: Date not specified
Change type: capability
Severity: info

Open LLM Leaderboard: MMLU scores vary across Harness, HELM, and Original implementations

More from Hugging Face

Get alerts for Hugging Face