StarCoder on Intel Xeon: 8/4-bit quantization and speculative decoding with IPEX
AI Impact Summary
Intel Xeon optimization of StarCoder demonstrates significant inference acceleration through 8-bit and 4-bit quantization, SmoothQuant, and groupwise quantization, with assisted generation to mitigate memory bandwidth pressure during autoregressive decoding. Results show Q8-StarCoder delivering about 2.19x TTFT and 2.20x TPOT improvements, while 4-bit quantization yields ~3.35x TPOT speedup but slower TTFT due to dequantization overhead. These paths depend on PyTorch 2.0 and Intel Extension for PyTorch (IPEX) on 4th-gen Xeon with AMX, and deployment should consider calibration and the appropriate balance between latency and throughput for your code-generation workloads.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info