Training CodeParrot from scratch with GPT-2 large and Hugging Face tooling
AI Impact Summary
CodeParrot is presented as a GPT-2 large‑based code-completion model trained from a scratch Python corpus derived from a GitHub dump, filtered on Google BigQuery and deduplicated to about 50 GB. The pipeline demonstrates tokenizer training from scratch with train_new_from_iterator, streaming dataset ingestion, a GPT-2‑large style config (vocab_size adjusted; scale_attn_by_layer_idx; reorder_and_upcast_attn), and training orchestrated with Hugging Face Accelerate, with artifacts pushed to the Hugging Face Hub. This provides a repeatable blueprint for building in-house code assistants but requires substantial compute, storage, and governance around data provenance and licensing.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info