InfoCapability

Bloom inference optimization: 5x latency reduction and 50x throughput on Bloom model

AI Impact Summary

The Bloom inference stack was ported from Megatron-Deepspeed training code into a transformers-based path, enabling a multi-GPU inference server that shards the 176B Bloom model across GPUs via Accelerate's device_map and pipeline parallelism. To reach practical performance, the team accepted modest logit drift and implemented a configurable tradeoff between speed and exactness, after comparing bfloat16 vs float16 and addressing tensor-parallelism differences on Linear layers. Benchmarks showed initial 16x A100 GPUs with 0.3 requests/sec and 350 ms per token latency, then through iterative optimizations achieved 5x lower latency and 50x higher throughput, with an HTTP server front-end. The effort also emphasizes deterministic testing with fixed prompts and a circuit-breaker approach to handle high-load scenarios without degrading service.

Affected Systems

bigscience/bloom

Date: Date not specified
Change type: capability
Severity: info

Bloom inference optimization: 5x latency reduction and 50x throughput on Bloom model

More from Hugging Face

Get alerts for Hugging Face