Slurm-on-Kubernetes v1.0 deployed for all new Slurm GPU clusters
AI Impact Summary
New Slurm GPU clusters will be provisioned on a Slurm-on-Kubernetes stack introducing self-healing workers, durable sacct history on PVC-backed storage, kernel-level process tracking, and automated zombie reaping. These changes reduce operator intervention during failures, prevent data loss in job accounting during restarts, and prevent GPU memory leaks and PID exhaustion on reschedules. GPU state is rebuilt on node boot, eliminating GPU-not-found errors after pod reschedules, and DCGM-based GPU utilization metrics are now exposed in Grafana for per-cluster visibility; existing clusters can be migrated in place per Slurm configuration guidance.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info