Bloom inference optimization: 5x latency reduction and 50x throughput on Bloom model
AI Impact Summary
The Bloom inference stack was ported from Megatron-Deepspeed training code into a transformers-based path, enabling a multi-GPU inference server that shards the 176B Bloom model across GPUs via Accelerate's device_map and pipeline parallelism. To reach practical performance, the team accepted modest logit drift and implemented a configurable tradeoff between speed and exactness, after comparing bfloat16 vs float16 and addressing tensor-parallelism differences on Linear layers. Benchmarks showed initial 16x A100 GPUs with 0.3 requests/sec and 350 ms per token latency, then through iterative optimizations achieved 5x lower latency and 50x higher throughput, with an HTTP server front-end. The effort also emphasizes deterministic testing with fixed prompts and a circuit-breaker approach to handle high-load scenarios without degrading service.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info