InfoCapability

NPHardEval Leaderboard: Benchmark for LLM reasoning across NP-hard problems with monthly updates

AI Impact Summary

NPHardEval introduces a monthly-updated benchmark that tests LLM reasoning on 900 questions across nine algorithms spanning P, NP-complete, and NP-hard classes. The framework automates question generation and verification, grounding scoring in Weighted Accuracy and Failure Rate to quantify reasoning quality and reliability; results show strong performance for closed-source models like GPT-4 Turbo while highlighting strengths of certain open-source models on specific problem types. For technical teams, this means integrating a dynamic, complexity-based evaluation into model selection and ongoing benchmarking to prevent overfitting and track progress over time.

Affected Systems

NPHardEval leaderboardGPT-4 Turbo

Date: Date not specified
Change type: capability
Severity: info

NPHardEval Leaderboard: Benchmark for LLM reasoning across NP-hard problems with monthly updates

More from Hugging Face

Get alerts for Hugging Face