Blazing Fast SetFit Inference with π€ Optimum Intel on Xeon β Quantization
AI Impact Summary
This document details a technique to accelerate SetFit inference on Intel Xeon CPUs using π€ Optimum Intel, specifically through post-training static quantization. By applying quantization with Intel Neural Compressor (INC), the model's weights and activations are converted to lower precision (INT8), significantly reducing memory footprint and accelerating computations leveraging Intel AVX-512, VNNI, and AMX instructions. This results in a 7.8x inference speedup compared to standard PyTorch and Transformers implementations, enabling production deployment of SetFit solutions.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info