Intel Gaudi: Accelerated assisted generation with Speculative Sampling in Optimum Habana (transformers.generate)
AI Impact Summary
Intel Gaudi now supports accelerated text generation via speculative sampling and assisted generation, integrated in Optimum Habana to optimize inference for large transformer tasks. The generate() API gains an optional assistant_model parameter, enabling a draft model to precompute tokens and be evaluated by the target model, with separate KV caches for draft and target models. This can deliver roughly 2x speedups for large models, reducing latency and infrastructure/power costs on Gaudi-based deployments, but requires adopting the assisted generation workflow and cache management.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info