BigCode: large-scale near-deduplication to clean training data across BigScience datasets
AI Impact Summary
BigCode is implementing a large-scale near-deduplication workflow to clean training data across major datasets used by BigScience/BigCode. The approach combines document-level deduplication with techniques such as MinHashLSH, SimHash, and exact matching to reduce duplicates and benchmark contamination across sources like OpenWebText2, Pile-CC, CC100-XL, and C4. This directly improves data quality, training efficiency, and evaluation reliability for code and language models, and provides a replicable blueprint for datasets such as The BigScience ROOTS Corpus and Real News.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info