Cosmopedia open synthetic data pipeline for LLM pre-training using Mixtral-8x7B-Instruct-v0.1
AI Impact Summary
Cosmopedia describes an open, end-to-end pipeline to generate a 25B-token synthetic dataset for LLM pre-training, built with Mixtral-8x7B-Instruct-v0.1 and 30M prompts sourced from Stanford, OpenStax, and WikiHow, plus web data, producing cosmo-1b. This demonstrates a scalable alternative to proprietary data for pre-training, but the compute footprint is substantial (hundreds of GPUs) and prompt-engineering-intensive to sustain low duplication and broad topic coverage. For engineering teams, the key implications are reproducibility, licensing considerations for source materials, and governance around synthetic-data quality and bias, along with a migration path to rely less on web-scale corpora.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info