Google launches PaliGemma 2 vision-language models with SigLIP encoder and Gemma 2 decoder
AI Impact Summary
Google is launching PaliGemma 2, a vision-language family that combines the SigLIP image encoder with the Gemma 2 text decoder, available in 3B, 10B, and 28B variants across 224x224, 448x448, and 896x896 resolutions. The release includes DOCCI-tuned models (3B and 10B) and a full set of pre-trained checkpoints, along with model cards, fine-tuning scripts, and a VQAv2 demo, all under the Gemma license that permits redistribution, commercial use, and fine-tuning. This expands production-ready options for captioning and visual question answering, enabling teams to balance quality and compute while leveraging a broader ecosystem of training data (WebLI, CC3M-35L, OpenImages, WIT, DOCCI) and existing transformer integrations.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info