Language Technologies Lab releases Visual Salamandra — multimodal LLM with SigLIP encoder
AI Impact Summary
The Language Technologies Lab has released Visual Salamandra, a 7B parameter LLM extended to process both images and video. This model leverages a SigLIP encoder and late-fusion techniques to align visual and textual modalities, enabling contextual responses from diverse inputs. The four-phase training process, incorporating data from AI2D, Cambrian, and LLaVA Next, highlights a commitment to multilingual inclusivity, particularly for European languages, and robust multimodal AI systems.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info