BigCode implements large-scale near-deduplication with MinHash LSH
AI Impact Summary
BigCode is implementing large-scale near-deduplication across its datasets, primarily utilizing MinHash LSH for efficiency. This process involves shingling text, generating MinHash fingerprints, and applying locality-sensitive hashing to identify and remove duplicate documents. The use of MinHash and LSH significantly reduces the computational cost of deduplication, particularly when dealing with massive datasets like those used in BigScience and BigCode, and allows for faster training and evaluation of models.
Affected Systems
Business Impact
Near-deduplication in BigCode will improve model training efficiency and reduce the risk of benchmark contamination by eliminating duplicate data.
- Date
- Date not specified
- Change type
- capability
- Severity
- info