InfoCapability

StarCoder on Intel Xeon: 8/4-bit quantization and speculative decoding with IPEX

AI Impact Summary

Intel Xeon optimization of StarCoder demonstrates significant inference acceleration through 8-bit and 4-bit quantization, SmoothQuant, and groupwise quantization, with assisted generation to mitigate memory bandwidth pressure during autoregressive decoding. Results show Q8-StarCoder delivering about 2.19x TTFT and 2.20x TPOT improvements, while 4-bit quantization yields ~3.35x TPOT speedup but slower TTFT due to dequantization overhead. These paths depend on PyTorch 2.0 and Intel Extension for PyTorch (IPEX) on 4th-gen Xeon with AMX, and deployment should consider calibration and the appropriate balance between latency and throughput for your code-generation workloads.

Affected Systems

StarCoder-15BStarCoderBase

Date: Date not specified
Change type: capability
Severity: info

StarCoder on Intel Xeon: 8/4-bit quantization and speculative decoding with IPEX

More from Hugging Face

Get alerts for Hugging Face