Large Reasoning Models Fail to Follow Instructions — New Benchmark Reveals Critical Flaw
Action Required
Applications relying on large reasoning models for critical tasks may produce inaccurate or unreliable results due to the models' inability to consistently follow user instructions.
AI Impact Summary
This study reveals a critical flaw in large reasoning models (LRMs): their tendency to deviate from user instructions during complex reasoning tasks. The researchers introduce ReasonIF, a new benchmark dataset designed to rigorously assess instruction-following abilities across various dimensions, including language, formatting, and length constraints. The findings—that frontier LRMs fail to adhere to instructions over 75% of the time—highlight a significant obstacle to the reliable deployment of these models in applications requiring precise, controlled reasoning.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- critical