Memory-efficient Diffusion Transformers with Quanto and Diffusers — FP8 quantization
AI Impact Summary
This post details a technique for significantly reducing the memory footprint of Transformer-based diffusion models like PixArt-Sigma using Quanto's quantization utilities within the Diffusers library. By quantizing the model's parameters to FP8 or INT8, the authors achieve substantial memory savings – up to 70% – with minimal impact on inference latency, particularly when combined with techniques like horizontal attention fusion and bfloat16 precision. The ability to quantize individual text encoders further optimizes memory usage, offering a practical solution for deploying these computationally intensive models on consumer GPUs.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info