InfoCapability

Smolagents adds vision-language model (VLM) support for vision-enabled agents

AI Impact Summary

Smolagents now supports vision-language models, enabling perception of images within agentic pipelines and unlocking vision-based decision making for tasks like autonomous web browsing. The feature allows passing images at startup or dynamically via a callback, storing images in TaskStep.task_images and step_log.observation_images, and integrates with Helium/Selenium for browser automation. To use VLMs reliably, teams should initialize TransformersModel with flatten_messages_as_text=False when using a VLM such as HuggingFaceTB/SmolVLM-Instruct, and adjust image-data workflows accordingly.

Affected Systems

smolagentsTransformersModel

Date: Date not specified
Change type: capability
Severity: info

Smolagents adds vision-language model (VLM) support for vision-enabled agents

More from Hugging Face

Get alerts for Hugging Face