InfoCapability

Fine-tuning Florence-2 enables DocVQA capabilities with Florence-2-base-ft

AI Impact Summary

Microsoft's Florence-2 can be adapted to DocVQA by fine-tuning the base-ft variant (Florence-2-base-ft). In experiments, adding DocVQA data with a <DocVQA> prefix yielded a validation similarity of 57.0 after seven epochs, indicating viable VQA capability emerges from targeted fine-tuning rather than architectural changes. The fine-tuning pipeline uses a DaViT vision encoder, BERT-based prompt/text embeddings, and an encoder-decoder transformer, with an option to unfreeze the vision encoder; loading relies on trust_remote_code due to non-native code. Enterprises aiming doc-focused QA should provision GPUs (A100/T4/H100) and plan for validation, plus consider deeper fine-tuning via The Cauldron for further gains.

Affected Systems

Florence-2-base-ftFlorence-2

Date: Date not specified
Change type: capability
Severity: info

Fine-tuning Florence-2 enables DocVQA capabilities with Florence-2-base-ft

More from Hugging Face

Get alerts for Hugging Face