mmBERT: ModernBERT goes multilingual with 1,833 languages and three-phase training
AI Impact Summary
mmBERT is a massively multilingual encoder trained on over 3T tokens across 1,833 languages, employing a three-phase training schedule and new techniques to bolster low-resource language learning, with measurable gains over XLM-R on XTREME benchmarks. It builds on ModernBERT-base, switches to the Gemma 2 tokenizer, and uses a progressive language inclusion strategy (60 → 110 → 1,833 languages) plus an inverse mask and annealed sampling to improve multilingual representations. While results are strong across multilingual understanding and retrieval, some structured prediction tasks (e.g., NER, POS) show limitations likely tied to tokenizer boundaries; this informs where additional tokenization tweaks or fine-tuning may be needed for production use.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info