Fine-tuning Florence-2 enables DocVQA capabilities with Florence-2-base-ft
AI Impact Summary
Microsoft's Florence-2 can be adapted to DocVQA by fine-tuning the base-ft variant (Florence-2-base-ft). In experiments, adding DocVQA data with a <DocVQA> prefix yielded a validation similarity of 57.0 after seven epochs, indicating viable VQA capability emerges from targeted fine-tuning rather than architectural changes. The fine-tuning pipeline uses a DaViT vision encoder, BERT-based prompt/text embeddings, and an encoder-decoder transformer, with an option to unfreeze the vision encoder; loading relies on trust_remote_code due to non-native code. Enterprises aiming doc-focused QA should provision GPUs (A100/T4/H100) and plan for validation, plus consider deeper fine-tuning via The Cauldron for further gains.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info