Benchmark: Llama 2 deployment on Amazon SageMaker with Hugging Face LLM Inference Container
AI Impact Summary
Amazon SageMaker is evaluated with the Hugging Face LLM Inference Container and Text Generation Inference across 60 configurations of Llama 2 (7B, 13B, 70B). The study measures latency and throughput over varying concurrency on multiple GPU instances (g5.2xlarge, g5.12xlarge, g5.48xlarge, ml.g5.12xlarge, ml.p4d.24xlarge) and with GPTQ 4-bit quantization to reveal cost/performance tradeoffs. Key takeaways identify a cost-efficient 13B+GPTQ on g5.2xlarge (~71 tokens/sec at $1.55/h), best throughput with 13B on ml.g5.12xlarge (~296 tokens/sec), and lowest latency with 7B on ml.g5.12xlarge (~16.8 ms/token). The findings offer concrete deployment patterns and highlight future gains from newer hardware like Inferentia2, plus reproducible data and code for teams to replicate or tailor to their use cases.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- medium