Differential Transformer V2: faster inference with no custom kernels and improved training stability
AI Impact Summary
DIFF V2 is an upgraded variant of the Differential Transformer designed to deliver faster, kernel-free decoding and improved training stability for production-scale LLMs. It doubles the number of query heads while keeping key-value heads fixed, adds token-specific lambda parameters, and relies on FlashAttention to achieve decoding speeds on par with the baseline Transformer without custom kernels. Early pretraining experiments on dense models and a 30A3 MoE across trillions of tokens at learning rates 6e-4 to 1e-3 show lower LM loss and reduced gradient spikes, indicating better numerical stability at scale. For long sequences, the implementation recommends YOCO to linearize prefill and suggests comparing against a Transformer with equivalent query dimension to quantify resource trade-offs.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info