Hugging Face Transformers integrates AutoGPTQ GPTQ-based quantization (8/4/3/2-bit) for LLMs
AI Impact Summary
Integrating AutoGPTQ into π€ Transformers enables quantization of large models using the GPTQ algorithm, down to 2-bit precision. The approach uses per-layer/row optimization with a calibration dataset and dequantizes weights on-the-fly, delivering memory reductions near 4x for int4 while maintaining FP16-like inference speed for small batches. Supported hardware includes Nvidia GPUs and ROCm-enabled AMD GPUs, and the workflow leverages the Optimum integration to simplify quantization of models such as TheBloke/Llama-2-7b-Chat-GPTQ and facebook/opt-125m. This lowers the barrier to deploying larger transformers in production by reducing memory footprint and enabling broader sharing of quantized models via the Hugging Face Hub.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info