InfoCapability

Open LLM Leaderboard MMLU Benchmark Inconsistencies

AI Impact Summary

The Open LLM Leaderboard’s MMLU benchmark results are inconsistent due to the use of multiple evaluation implementations. The original implementation utilizes a UC Berkeley code, while Stanford’s CRFM HELM and Eleuther AI LM Evaluation Harness employ different codebases, leading to varying performance metrics and ranking discrepancies. This highlights the importance of understanding the underlying evaluation methodology when interpreting benchmark results and comparing model capabilities.

Affected Systems

Open LLM LeaderboardEleuther AI LM Evaluation Harness

Date: Date not specified
Change type: capability
Severity: info

Open LLM Leaderboard MMLU Benchmark Inconsistencies

More from Hugging Face

Get alerts for Hugging Face