Train EsperBERTo: RoBERTa-like LM from scratch using Transformers and Tokenizers
AI Impact Summary
The post demonstrates end-to-end training of a RoBERTa-like language model from scratch for Esperanto, using Transformers and Tokenizers tooling. It combines OSCAR Esperanto data and the Leipzig corpus to create a ~3 GB pretraining corpus, trains an 84M-parameter model (EsperBERTo-small) with a ByteLevelBPETokenizer vocab of 52k, and uses the run_language_modeling.py script for MLM, plus a custom EsperantoDataset to feed data. The approach yields a tokenizer that preserves Esperanto diacritics and RoBERTa-style preprocessing, enabling rapid prototyping for a low-resource language. This can help teams quickly spin up small, testable LMs for downstream tasks like POS tagging, but scaling to production-quality performance will require larger datasets and longer training with more compute.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info