ConTextual: Benchmark for context-sensitive text-rich visual reasoning in multimodal models
AI Impact Summary
ConTextual introduces a benchmark and leaderboard to evaluate context-sensitive, text-rich visual reasoning in multimodal models, highlighting that current models struggle to jointly reason over textual cues and visuals. It evaluates joint reasoning across 8 real-world scenarios (Time Reading, Shopping, Navigation, Abstract Scenes, Mobile App use, Webpages, Infographics and Miscellaneous Natural Scenes) and a dataset of 506 instructions, with a validation set and a test set. The evaluation uses GPT-4 as the judge, and reports results for models including GPT4V, Gemini-Vision-Pro, LLaVA-v1.5-13B, ShareGPT4V-7B, Instruct-Blip-Vicuna-7B, mPlugOwl-v2-7B, Bliva-Vicuna-7B, Qwen-VL-7B, and Idefics-9B, providing concrete guidance on where to invest—improving image encoders, fine-grained vision-language alignment, and richer image descriptions—while highlighting underperformance in open-source models across several domains.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info