HighCapability

OpenAI releasing VAKRA benchmark for AI agent evaluation

Action Required

Organizations relying on GPT models will need to migrate to GPT-4o-mini to avoid service disruption due to the deprecation of GPT-3.5 Turbo.

AI Impact Summary

This release introduces VAKRA, an executable benchmark designed to evaluate AI agent reasoning and action capabilities in enterprise-like environments. The benchmark utilizes tool-grounded execution, measuring compositional reasoning across APIs and documents with full traces. Initial results demonstrate that current models struggle with VAKRA's complex, multi-step workflows and failure modes, highlighting areas for improvement in agent design and training. This release provides a valuable tool for researchers and developers to assess and advance the capabilities of AI agents.

Affected Systems

GPT-4o-mini

Date: 15 Apr 2026
Change type: capability
Severity: high

OpenAI releasing VAKRA benchmark for AI agent evaluation

More from Hugging Face

Get alerts for Hugging Face