Optimizing Stable Diffusion on Intel CPUs with NNCF and OpenVINO (ToME, QAT)
AI Impact Summary
Intel-focused optimization of Stable Diffusion via OpenVINO, NNCF, and Diffusers demonstrates that CPU-bound inference can achieve GPU-like performance when a layered optimization stack is used. Converting to OpenVINO FP32 yields ~1.9x latency improvement over PyTorch, 8-bit quantization adds ~3.9x, and stacking Token Merging (ToME) with quantization yields ~5.1x faster inference while maintaining the same footprint. A key caveat is that post-training 8-bit quantization alone does not preserve accuracy for Stable Diffusion; QAT with EMA and knowledge distillation is required to recover quality, and Token Merging must be adapted for OpenVINO. This workflow makes CPU-only Stable Diffusion viable on edge devices and CPU servers, enabling cheaper deployments with lower memory usage and faster prompt-to-image turnaround.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info