InfoCapability

Differential Transformer V2: faster inference with no custom kernels and improved training stability

AI Impact Summary

DIFF V2 is an upgraded variant of the Differential Transformer designed to deliver faster, kernel-free decoding and improved training stability for production-scale LLMs. It doubles the number of query heads while keeping key-value heads fixed, adds token-specific lambda parameters, and relies on FlashAttention to achieve decoding speeds on par with the baseline Transformer without custom kernels. Early pretraining experiments on dense models and a 30A3 MoE across trillions of tokens at learning rates 6e-4 to 1e-3 show lower LM loss and reduced gradient spikes, indicating better numerical stability at scale. For long sequences, the implementation recommends YOCO to linearize prefill and suggests comparing against a Transformer with equivalent query dimension to quantify resource trade-offs.

Affected Systems

Differential Transformer V2 (DIFF V2)

Date: Date not specified
Change type: capability
Severity: info

Differential Transformer V2: faster inference with no custom kernels and improved training stability

More from Hugging Face

Get alerts for Hugging Face