InfoCapability

Training CodeParrot from scratch with GPT-2 large and Hugging Face tooling

AI Impact Summary

CodeParrot is presented as a GPT-2 large‑based code-completion model trained from a scratch Python corpus derived from a GitHub dump, filtered on Google BigQuery and deduplicated to about 50 GB. The pipeline demonstrates tokenizer training from scratch with train_new_from_iterator, streaming dataset ingestion, a GPT-2‑large style config (vocab_size adjusted; scale_attn_by_layer_idx; reorder_and_upcast_attn), and training orchestrated with Hugging Face Accelerate, with artifacts pushed to the Hugging Face Hub. This provides a repeatable blueprint for building in-house code assistants but requires substantial compute, storage, and governance around data provenance and licensing.

Affected Systems

GPT-2 Largelvwerra/codeparrot-clean

Date: Date not specified
Change type: capability
Severity: info

Training CodeParrot from scratch with GPT-2 large and Hugging Face tooling

More from Hugging Face

Get alerts for Hugging Face