InfoCapability

IBM and UC Berkeley diagnose ITBench agent failures using MAST

AI Impact Summary

IBM Research and UC Berkeley used the MAST failure taxonomy to diagnose ITBench-driven agent runs across Gemini-3-Flash, Kimi-K2, and GPT-OSS-120B. They find frontier models fail with isolated verification bottlenecks, while open-weight models exhibit cascading failure modes once a small mismatch propagates. The study argues for externalizing verification, hard tool evidence before exit, and placing termination/loop control outside the LLM (e.g., explicit stop conditions or finite state machines), plus clarifying-or-read-only branches for ambiguity. This implies IT automation pipelines should collect structured failure vectors via MAST during evaluation and bake targeted mitigations into orchestrators to reduce outages and misclaims of success.

Affected Systems

Gemini-3-FlashKimi-K2

Date: Date not specified
Change type: capability
Severity: info

IBM and UC Berkeley diagnose ITBench agent failures using MAST

More from Hugging Face

Get alerts for Hugging Face