MediumCapability

Finetune Sentence Transformers for Visual Document Retrieval with Qwen/Qwen3-VL-Embedding-2B

Action Required

Finetuning Sentence Transformers' multimodal models can significantly improve the accuracy of VDR tasks, leading to better information retrieval and search results.

AI Impact Summary

This blog post details how to finetune Sentence Transformers' multimodal embedding and reranker models, specifically using the Qwen/Qwen3-VL-Embedding-2B model for Visual Document Retrieval (VDR). The key insight is that finetuning on domain-specific data like the tomaarsen/Qwen3-VL-Embedding-2B-vdr dataset dramatically improves performance (NDCG@10 from 0.888 to 0.947) compared to the base model, demonstrating the value of specialized models for tasks like matching text queries to document screenshots. This approach is particularly useful when dealing with complex data formats like images and charts.

Affected Systems

Sentence Transformers

Date: 16 Apr 2026
Change type: capability
Severity: medium

Finetune Sentence Transformers for Visual Document Retrieval with Qwen/Qwen3-VL-Embedding-2B

More from Hugging Face

Get alerts for Hugging Face