Quanto: PyTorch quantization backend for Optimum
AI Impact Summary
Quanto introduces a PyTorch quantization backend designed for Optimum, aiming to reduce model size and computational costs by utilizing low-precision data types like int8. This is particularly relevant for deploying Large Language Models on resource-constrained devices, offering a pathway to run models on consumer hardware. The backend provides a streamlined workflow with features like dynamic and static quantization, device support (CUDA, MPS), and automatic insertion of quantization stubs, simplifying the process of adapting models for efficient inference.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info