InfoCapability

BigCode: large-scale near-deduplication to clean training data across BigScience datasets

AI Impact Summary

BigCode is implementing a large-scale near-deduplication workflow to clean training data across major datasets used by BigScience/BigCode. The approach combines document-level deduplication with techniques such as MinHashLSH, SimHash, and exact matching to reduce duplicates and benchmark contamination across sources like OpenWebText2, Pile-CC, CC100-XL, and C4. This directly improves data quality, training efficiency, and evaluation reliability for code and language models, and provides a replicable blueprint for datasets such as The BigScience ROOTS Corpus and Real News.

Affected Systems

BigCodeBigScience

Date: Date not specified
Change type: capability
Severity: info

BigCode: large-scale near-deduplication to clean training data across BigScience datasets

More from Hugging Face

Get alerts for Hugging Face