InfoCapability

Hugging Face Transformers integrates AutoGPTQ GPTQ-based quantization (8/4/3/2-bit) for LLMs

AI Impact Summary

Integrating AutoGPTQ into 🤗 Transformers enables quantization of large models using the GPTQ algorithm, down to 2-bit precision. The approach uses per-layer/row optimization with a calibration dataset and dequantizes weights on-the-fly, delivering memory reductions near 4x for int4 while maintaining FP16-like inference speed for small batches. Supported hardware includes Nvidia GPUs and ROCm-enabled AMD GPUs, and the workflow leverages the Optimum integration to simplify quantization of models such as TheBloke/Llama-2-7b-Chat-GPTQ and facebook/opt-125m. This lowers the barrier to deploying larger transformers in production by reducing memory footprint and enabling broader sharing of quantized models via the Hugging Face Hub.

Affected Systems

🤗 TransformersAutoGPTQ library

Date: Date not specified
Change type: capability
Severity: info

Hugging Face Transformers integrates AutoGPTQ GPTQ-based quantization (8/4/3/2-bit) for LLMs

More from Hugging Face

Get alerts for Hugging Face