InfoCapability

Intel Gaudi: Accelerated assisted generation with Speculative Sampling in Optimum Habana (transformers.generate)

AI Impact Summary

Intel Gaudi now supports accelerated text generation via speculative sampling and assisted generation, integrated in Optimum Habana to optimize inference for large transformer tasks. The generate() API gains an optional assistant_model parameter, enabling a draft model to precompute tokens and be evaluated by the target model, with separate KV caches for draft and target models. This can deliver roughly 2x speedups for large models, reducing latency and infrastructure/power costs on Gaudi-based deployments, but requires adopting the assisted generation workflow and cache management.

Affected Systems

Intel GaudiOptimum Habana

Date: Date not specified
Change type: capability
Severity: info

Intel Gaudi: Accelerated assisted generation with Speculative Sampling in Optimum Habana (transformers.generate)

More from Hugging Face

Get alerts for Hugging Face