TextQuests benchmark: LLMs as autonomous agents in text-based games reveal long-context reasoning challenges
AI Impact Summary
TextQuests benchmarks LLMs' ability to act as autonomous agents in 25 Infocom-style games, emphasizing long-context reasoning, exploration learning, and action history management. The report notes hallucinations about past interactions, repeated actions as context grows, and navigation failures in spatial tasks, translating into high compute and latency costs for robust performance. These findings imply that deploying AI agents in dynamic, open-ended environments will require advanced memory, world-modeling, and adaptive planning beyond vanilla prompt-based reasoning, as well as careful cost/performance tradeoffs.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info