BigCode large-scale near-deduplication using MinHashLSH and SimHash to reduce duplicates and contamination
AI Impact Summary
The document describes a large-scale near-deduplication initiative behind BigCode, applying hashing-based similarity detection (MinHashLSH, SimHash) across text and code corpora to reduce duplicates and potential data leakage. It highlights data-quality motivation, benchmark contamination concerns, and dataset-level impacts across major sources (OpenWebText2, Pile-CC, CC100-XL, MassiveText, Real News, LM1B, WIKI40B, The BigScience ROOTS Corpus, C4) as well as code-model datasets (InCoder, CodeGen, AlphaCode, PolyCode, PaLM Coder, CodeParrot, The Stack). For engineers, this implies building a scalable, distributed dedup pipeline with shingles, multiple hashing passes (SHA-1/MD5, MinHash permutations), and LSH parameterization, plus governance around data quality and privacy considerations.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info