NPHardEval Leaderboard: Benchmark for LLM reasoning across NP-hard problems with monthly updates
AI Impact Summary
NPHardEval introduces a monthly-updated benchmark that tests LLM reasoning on 900 questions across nine algorithms spanning P, NP-complete, and NP-hard classes. The framework automates question generation and verification, grounding scoring in Weighted Accuracy and Failure Rate to quantify reasoning quality and reliability; results show strong performance for closed-source models like GPT-4 Turbo while highlighting strengths of certain open-source models on specific problem types. For technical teams, this means integrating a dynamic, complexity-based evaluation into model selection and ongoing benchmarking to prevent overfitting and track progress over time.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info