CriticalCapability

Large Reasoning Models Fail to Follow Instructions — New Benchmark Reveals Critical Flaw

Action Required

Applications relying on large reasoning models for critical tasks may produce inaccurate or unreliable results due to the models' inability to consistently follow user instructions.

AI Impact Summary

This study reveals a critical flaw in large reasoning models (LRMs): their tendency to deviate from user instructions during complex reasoning tasks. The researchers introduce ReasonIF, a new benchmark dataset designed to rigorously assess instruction-following abilities across various dimensions, including language, formatting, and length constraints. The findings—that frontier LRMs fail to adhere to instructions over 75% of the time—highlight a significant obstacle to the reliable deployment of these models in applications requiring precise, controlled reasoning.

Affected Systems

GPT-OSS-120B

Date: Date not specified
Change type: capability
Severity: critical

Large Reasoning Models Fail to Follow Instructions — New Benchmark Reveals Critical Flaw

More from Together AI

Get alerts for Together AI