InfoCapability

BLOOM 176B training tech stack: Megatron-DeepSpeed 3D parallelism on Jean Zay with 384 A100 80GB GPUs

AI Impact Summary

BLOOM training reached 176B parameters using 384 NVIDIA A100 80GB GPUs across 48 nodes on the Jean Zay supercomputer, running ~3.5 months (~1M compute hours) and processing 350B tokens across 59 languages. The run uses a forked Megatron-DeepSpeed stack to implement 3D parallelism (Data, Tensor, Pipeline) with ZeRO offloading, combining Megatron-LM and DeepSpeed components for efficient cross-GPU training. The hardware interconnects (NVLink, Omni-Path), GPFS storage, and GENCI/Jean Zay provisioning were essential to this scale, illustrating the tight coupling of hardware, software and funding in producing a model of BLOOM's size. For teams planning comparable scale, this underscores the need for specialized tooling and access to large HPC facilities; reproduction or extension hinges on obtaining similar compute and a compatible forked tech stack rather than generic frameworks.

Affected Systems

BLOOM 176B

Date: Date not specified
Change type: capability
Severity: info

BLOOM 176B training tech stack: Megatron-DeepSpeed 3D parallelism on Jean Zay with 384 A100 80GB GPUs

More from Hugging Face

Get alerts for Hugging Face