InfoCapability

Train EsperBERTo: RoBERTa-like LM from scratch using Transformers and Tokenizers

AI Impact Summary

The post demonstrates end-to-end training of a RoBERTa-like language model from scratch for Esperanto, using Transformers and Tokenizers tooling. It combines OSCAR Esperanto data and the Leipzig corpus to create a ~3 GB pretraining corpus, trains an 84M-parameter model (EsperBERTo-small) with a ByteLevelBPETokenizer vocab of 52k, and uses the run_language_modeling.py script for MLM, plus a custom EsperantoDataset to feed data. The approach yields a tokenizer that preserves Esperanto diacritics and RoBERTa-style preprocessing, enabling rapid prototyping for a low-resource language. This can help teams quickly spin up small, testable LMs for downstream tasks like POS tagging, but scaling to production-quality performance will require larger datasets and longer training with more compute.

Affected Systems

TransformersTokenizers

Date: Date not specified
Change type: capability
Severity: info

Train EsperBERTo: RoBERTa-like LM from scratch using Transformers and Tokenizers

More from Hugging Face

Get alerts for Hugging Face