Together AI introduces Cache-aware prefill–decode disaggregation (CPD) for faster LLM serving
Action Required
Organizations can significantly reduce the latency and improve the throughput of their long-context LLM applications, leading to improved user experience and increased operational efficiency.
AI Impact Summary
Together AI has launched a new capability, Cache-aware prefill–decode disaggregation (CPD), designed to significantly improve the performance of long-context LLM serving. By separating warm and cold inference workloads and utilizing a three-level KV-cache hierarchy, CPD achieves up to 40% higher throughput and dramatically reduces time-to-first-token. This architecture is particularly effective under high load scenarios where many requests share common context, preventing cold prefills from saturating shared capacity and improving overall system scalability.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- high