BLOOM 176B training tech stack: Megatron-DeepSpeed 3D parallelism on Jean Zay with 384 A100 80GB GPUs
AI Impact Summary
BLOOM training reached 176B parameters using 384 NVIDIA A100 80GB GPUs across 48 nodes on the Jean Zay supercomputer, running ~3.5 months (~1M compute hours) and processing 350B tokens across 59 languages. The run uses a forked Megatron-DeepSpeed stack to implement 3D parallelism (Data, Tensor, Pipeline) with ZeRO offloading, combining Megatron-LM and DeepSpeed components for efficient cross-GPU training. The hardware interconnects (NVLink, Omni-Path), GPFS storage, and GENCI/Jean Zay provisioning were essential to this scale, illustrating the tight coupling of hardware, software and funding in producing a model of BLOOM's size. For teams planning comparable scale, this underscores the need for specialized tooling and access to large HPC facilities; reproduction or extension hinges on obtaining similar compute and a compatible forked tech stack rather than generic frameworks.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info