OpenAI releasing VAKRA benchmark for AI agent evaluation
Action Required
Organizations relying on GPT models will need to migrate to GPT-4o-mini to avoid service disruption due to the deprecation of GPT-3.5 Turbo.
AI Impact Summary
This release introduces VAKRA, an executable benchmark designed to evaluate AI agent reasoning and action capabilities in enterprise-like environments. The benchmark utilizes tool-grounded execution, measuring compositional reasoning across APIs and documents with full traces. Initial results demonstrate that current models struggle with VAKRA's complex, multi-step workflows and failure modes, highlighting areas for improvement in agent design and training. This release provides a valuable tool for researchers and developers to assess and advance the capabilities of AI agents.
Affected Systems
- Date
- 15 Apr 2026
- Change type
- capability
- Severity
- high