Build Domain-Specific Embedding Model in Under a Day — Synthetic Data & Hard Negatives
AI Impact Summary
Building a domain-specific embedding model in under a day is achievable through a synthetic data generation (SDG) pipeline leveraging NVIDIA's NeMo models. This approach bypasses the time-consuming and potentially biased process of manual labeling, generating high-quality question-answer pairs from domain documents. The use of hard negative mining further enhances the model's ability to distinguish subtle nuances within the data, leading to improved retrieval performance, as demonstrated by Atlassian's 26% Recall@60 improvement on JIRA data.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info