IBM and UC Berkeley diagnose ITBench agent failures using MAST
AI Impact Summary
IBM Research and UC Berkeley used the MAST failure taxonomy to diagnose ITBench-driven agent runs across Gemini-3-Flash, Kimi-K2, and GPT-OSS-120B. They find frontier models fail with isolated verification bottlenecks, while open-weight models exhibit cascading failure modes once a small mismatch propagates. The study argues for externalizing verification, hard tool evidence before exit, and placing termination/loop control outside the LLM (e.g., explicit stop conditions or finite state machines), plus clarifying-or-read-only branches for ambiguity. This implies IT automation pipelines should collect structured failure vectors via MAST during evaluation and bake targeted mitigations into orchestrators to reduce outages and misclaims of success.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info