NPHardEval Leaderboard benchmarks LLM reasoning with NP-hard problems, updated monthly
AI Impact Summary
NPHardEval introduces a dynamic, monthly-updated benchmark that targets LLM reasoning over NP and NP-hard problems by presenting 900 algorithmic questions across three difficulty levels and nine algorithms. The evaluation relies on automated question generation and two metrics (Weighted Accuracy and Failure Rate) grounded in the computational complexity hierarchy, enabling quantitative comparisons beyond standard QA benchmarks. Early results show GPT-4 Turbo achieving the strongest overall performance, Claude 2 excelling on NP-complete tasks, and several open-source models (Yi-34b, Qwen-14b, Phi-2, Mistral-7b) displaying model-specific strengths. For deployment planning, this benchmark adds a robust signal for selecting models on reasoning-heavy workloads and underscores the need for ongoing re-evaluation as the benchmark evolves.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info