Docmatix DocVQA dataset (2.4M images) yields ~20% gain for Florence-2 fine-tuning
AI Impact Summary
Docmatix dramatically enlarges DocVQA data with 2.4 million images and 9.5 million Q/A pairs derived from 1.3 million PDFs, ~240x larger than previous datasets. Early ablations show a ~20% performance boost when fine-tuning Florence-2 on Docmatix-derived data, with QA generated by Phi-3-small and 15% of outputs filtered for hallucinations, and images hosted on the Hugging Face Hub for easy access. The dataset emphasizes provenance by linking back to PDFA and provides a reproducible processing pipeline (150 dpi images, PDF-to-image conversion), which helps close the gap between open-source and closed models. This enables stronger baselines for DocVQA and accelerates open-model development, though teams should assess licensing, data quality, and the compute cost of reproducing the pipeline.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info