InfoCapability

Slurm-on-Kubernetes v1.0 deployed for all new Slurm GPU clusters

AI Impact Summary

New Slurm GPU clusters will be provisioned on a Slurm-on-Kubernetes stack introducing self-healing workers, durable sacct history on PVC-backed storage, kernel-level process tracking, and automated zombie reaping. These changes reduce operator intervention during failures, prevent data loss in job accounting during restarts, and prevent GPU memory leaks and PID exhaustion on reschedules. GPU state is rebuilt on node boot, eliminating GPU-not-found errors after pod reschedules, and DCGM-based GPU utilization metrics are now exposed in Grafana for per-cluster visibility; existing clusters can be migrated in place per Slurm configuration guidance.

Affected Systems

Slurm-on-Kubernetes stackSlurm

Date: Date not specified
Change type: capability
Severity: info

Slurm-on-Kubernetes v1.0 deployed for all new Slurm GPU clusters

More from Together AI

Get alerts for Together AI