Custom ROCm kernels for AMD MI300X to accelerate VLLM with Llama 3.1 405B FP8
AI Impact Summary
AMD MI300X-specific kernels are being developed to boost inference performance, including fused residual/RMS norm, FP8 conversion, fused SwiGLU, and a skinny GEMM kernel, targeting Llama 3.1 405B in FP8 on 8-MI300X nodes with VLLM. The work is published in the hf-rocm-kernels repo and planned for integration into the AMD fork of VLLM, with Python bindings and benchmarking scripts to reproduce results. This unlocks kernel-level optimizations for AMD hardware, offering a clear path to higher throughput and lower latency for large-scale generative workloads, contingent on adopting the provided repo and follow-on integration steps.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info