LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs — reconsider fine-tuning impact
AI Impact Summary
The study shows that fine-tuning Florence-2 on Docmatix can boost DocVQA performance but degrade benchmark alignment, leading to a model released only with Docmatix training due to human feedback. It highlights limitations of traditional metrics (CIDER, ANLS, BLEU) for zero-shot VQA and introduces LAVE, an LLM-based evaluation approach that may align better with human judgment—evidenced by about a 50% accuracy gain when evaluated with LLMs. For teams building VQA pipelines, this suggests that benchmark-driven fine-tuning may misrepresent real-world usefulness and that adopting LAVE or similar metrics could reshape model selection and release criteria on synthetic, OOD datasets.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info