Cosmopedia opens large-scale synthetic data pipeline for LLM pre-training (Cosmo-1B, Mixtral-8x7B-Instruct-v0.1)
AI Impact Summary
Cosmopedia outlines an end-to-end, open pipeline to generate billions of tokens for pre-training LLMs, combining curated sources with web data to reach 25B tokens across 30M+ prompts. The approach underscores the heavy compute and prompt-engineering effort required at scale (hundreds of GPUs on H100-class hardware) and provides a reproducible baseline (cosmo-1b) for benchmarking synthetic-data pre-training. By releasing code, dataset, and a 1B-parameter model, it lowers barriers to experimentation but introduces considerations around data provenance, licensing, and potential duplication when reusing synthetic prompts at scale.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info