InfoCapability

Google launches PaliGemma 2 vision-language models with SigLIP encoder and Gemma 2 decoder

AI Impact Summary

Google is launching PaliGemma 2, a vision-language family that combines the SigLIP image encoder with the Gemma 2 text decoder, available in 3B, 10B, and 28B variants across 224x224, 448x448, and 896x896 resolutions. The release includes DOCCI-tuned models (3B and 10B) and a full set of pre-trained checkpoints, along with model cards, fine-tuning scripts, and a VQAv2 demo, all under the Gemma license that permits redistribution, commercial use, and fine-tuning. This expands production-ready options for captioning and visual question answering, enabling teams to balance quality and compute while leveraging a broader ecosystem of training data (WebLI, CC3M-35L, OpenImages, WIT, DOCCI) and existing transformer integrations.

Affected Systems

SigLIPGemma 2

Date: Date not specified
Change type: capability
Severity: info

Google launches PaliGemma 2 vision-language models with SigLIP encoder and Gemma 2 decoder

More from Hugging Face

Get alerts for Hugging Face