Huggy Lingo: ML-based language metadata enrichment for Hugging Face Hub datasets
AI Impact Summary
Huggy Lingo is deploying a capability to infer dataset languages using ML by sampling text rows via the dataset viewer API and applying the facebook/fasttext-language-identification model. Predictions are aggregated per dataset with thresholds (e.g., keep a language if it dominates 20% of predictions with an average score above 80%), then mapped to ISO 639-1 and proposed as metadata updates through librarian-bots PRs. This approach aims to dramatically increase discoverability of multilingual datasets on the Hugging Face Hub by filling language metadata gaps, reducing manual curation, and improving UI filtering, though accuracy will depend on dataset text quality and the sampling strategy.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info