Llama 2 on Amazon SageMaker benchmark with Hugging Face LLM Inference Container
AI Impact Summary
This benchmark exhaustively evaluates Llama 2 deployments on SageMaker using the Hugging Face LLM Inference Container, testing 60 configurations across 7B, 13B, and 70B models and multiple EC2 GPU instances to map cost, latency, and throughput. Key results include 13B with GPTQ 4-bit on g5.2xlarge delivering ~71 tokens/sec at $1.55/h, 296 tokens/sec max throughput on ml.g5.12xlarge, and 7B achieving 16.8 ms/token latency on ml.g5.12xlarge, with scalability considerations noted across load levels. The study provides reproducible data via GitHub and a documented benchmark dataset to help teams select SageMaker/Hugging Face configurations aligned to cost or performance goals.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- medium