Smolagents adds vision-language model (VLM) support for vision-enabled agents
AI Impact Summary
Smolagents now supports vision-language models, enabling perception of images within agentic pipelines and unlocking vision-based decision making for tasks like autonomous web browsing. The feature allows passing images at startup or dynamically via a callback, storing images in TaskStep.task_images and step_log.observation_images, and integrates with Helium/Selenium for browser automation. To use VLMs reliably, teams should initialize TransformersModel with flatten_messages_as_text=False when using a VLM such as HuggingFaceTB/SmolVLM-Instruct, and adjust image-data workflows accordingly.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info