InfoCapability

LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs — reconsider fine-tuning impact

AI Impact Summary

The study shows that fine-tuning Florence-2 on Docmatix can boost DocVQA performance but degrade benchmark alignment, leading to a model released only with Docmatix training due to human feedback. It highlights limitations of traditional metrics (CIDER, ANLS, BLEU) for zero-shot VQA and introduces LAVE, an LLM-based evaluation approach that may align better with human judgment—evidenced by about a 50% accuracy gain when evaluated with LLMs. For teams building VQA pipelines, this suggests that benchmark-driven fine-tuning may misrepresent real-world usefulness and that adopting LAVE or similar metrics could reshape model selection and release criteria on synthetic, OOD datasets.

Affected Systems

DocmatixDocVQA

Date: Date not specified
Change type: capability
Severity: info

LAVE: Zero-shot VQA Evaluation on Docmatix with LLMs — reconsider fine-tuning impact

More from Hugging Face

Get alerts for Hugging Face