Embedding quantization for faster, cheaper retrieval in Sentence Transformers and OpenAI embeddings
AI Impact Summary
OpenAI introduces embedding quantization, presenting binary and scalar quantization as post-processing steps to shrink embedding footprints and accelerate retrieval. In a 41-million-document demo, binary embeddings with Hamming distance—and a follow-on rescoring pass using the original float32 query—can preserve up to ~96% of accuracy while reducing memory and disk usage by ~32x and increasing retrieval speed by up to 32x. Success hinges on adapting the retrieval pipeline to support binary or int8 embeddings and implementing a calibrated scalar quantization path where applicable, with a rescoring stage to recover performance; calibration data and workload characteristics will drive the final accuracy.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info