nanoVLM: Pure PyTorch Vision-Language Model training toolkit with Colab support
AI Impact Summary
nanoVLM introduces a minimal PyTorch toolkit to train a Vision-Language Model (VLM) using a SigLIP vision encoder and SmolLM2-135M language backbone. It provides end-to-end training and inference with Colab-friendly setup and Hugging Face data/datasets integration, plus optional wandb logging and model sharing to Hugging Face Hub. The dual LR optimizer and modality projector design aim to reduce training cost while preserving backbone knowledge, enabling quick VLM prototyping on free-tier GPUs. Teams should plan for scale, considering Colab limitations, data handling needs, and reproducibility as experiments move toward dedicated infrastructure.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info