Binary and Scalar Embedding Quantization for Faster Retrieval
AI Impact Summary
This document details a novel approach to embedding quantization, specifically binary and scalar quantization, aimed at significantly reducing the cost and improving the speed of retrieval systems. The core idea is to represent embeddings using fewer bits, with binary quantization achieving a 32x reduction in memory and storage, and scalar quantization using int8, offering a 128x reduction. The technique leverages Hamming Distance for efficient binary retrieval and incorporates a rescoring step to maintain performance, demonstrating a 96% retention rate.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info