InfoCapability

mmBERT: ModernBERT goes multilingual with 1,833 languages and three-phase training

AI Impact Summary

mmBERT is a massively multilingual encoder trained on over 3T tokens across 1,833 languages, employing a three-phase training schedule and new techniques to bolster low-resource language learning, with measurable gains over XLM-R on XTREME benchmarks. It builds on ModernBERT-base, switches to the Gemma 2 tokenizer, and uses a progressive language inclusion strategy (60 → 110 → 1,833 languages) plus an inverse mask and annealed sampling to improve multilingual representations. While results are strong across multilingual understanding and retrieval, some structured prediction tasks (e.g., NER, POS) show limitations likely tied to tokenizer boundaries; this informs where additional tokenization tweaks or fine-tuning may be needed for production use.

Affected Systems

mmBERTModernBERT-base

Date: Date not specified
Change type: capability
Severity: info

mmBERT: ModernBERT goes multilingual with 1,833 languages and three-phase training

More from Hugging Face

Get alerts for Hugging Face