InfoCapability

NPHardEval Leaderboard benchmarks LLM reasoning with NP-hard problems, updated monthly

AI Impact Summary

NPHardEval introduces a dynamic, monthly-updated benchmark that targets LLM reasoning over NP and NP-hard problems by presenting 900 algorithmic questions across three difficulty levels and nine algorithms. The evaluation relies on automated question generation and two metrics (Weighted Accuracy and Failure Rate) grounded in the computational complexity hierarchy, enabling quantitative comparisons beyond standard QA benchmarks. Early results show GPT-4 Turbo achieving the strongest overall performance, Claude 2 excelling on NP-complete tasks, and several open-source models (Yi-34b, Qwen-14b, Phi-2, Mistral-7b) displaying model-specific strengths. For deployment planning, this benchmark adds a robust signal for selecting models on reasoning-heavy workloads and underscores the need for ongoing re-evaluation as the benchmark evolves.

Affected Systems

GPT 4 Turbo

Date: Date not specified
Change type: capability
Severity: info

NPHardEval Leaderboard benchmarks LLM reasoning with NP-hard problems, updated monthly

More from Hugging Face

Get alerts for Hugging Face