Port fairseq WMT19 translation system to Transformers for en-ru/ru-en
AI Impact Summary
This change documents porting a high-quality WMT19 translation system from fairseq to HuggingFace Transformers, including handling dual-vocabulary en-ru/ru-en and the later merged-vocabulary case. The workflow relies on a conversion script (src/transformers/convert_fsmt_original_pytorch_checkpoint_to_pytorch.py) to translate a fairseq checkpoint into a PyTorch state_dict compatible with transformers, plus local setup using torch.hub to fetch fairseq models. The process requires installing fairseq, mosesdecoder, and fastBPE, plus providing dictionaries and BPE files (dict.en.txt, dict.ru.txt, bpecodes) and model4.pt checkpoints. Migrators should note that the en-ru path uses separate vocabularies, while de-en/en-de later use a merged vocabulary, which affects tokenization compatibility and vocab alignment during conversion.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info