HighCapability

Together AI introduces Cache-aware prefill–decode disaggregation (CPD) for faster LLM serving

Action Required

Organizations can significantly reduce the latency and improve the throughput of their long-context LLM applications, leading to improved user experience and increased operational efficiency.

AI Impact Summary

Together AI has launched a new capability, Cache-aware prefill–decode disaggregation (CPD), designed to significantly improve the performance of long-context LLM serving. By separating warm and cold inference workloads and utilizing a three-level KV-cache hierarchy, CPD achieves up to 40% higher throughput and dramatically reduces time-to-first-token. This architecture is particularly effective under high load scenarios where many requests share common context, preventing cold prefills from saturating shared capacity and improving overall system scalability.

Affected Systems

GPT-4o

Date: Date not specified
Change type: capability
Severity: high

Together AI introduces Cache-aware prefill–decode disaggregation (CPD) for faster LLM serving

More from Together AI

Get alerts for Together AI