Llama 2 on Amazon SageMaker Benchmark: Optimal Deployments
AI Impact Summary
This benchmark reveals optimal deployment strategies for Llama 2 on Amazon SageMaker, focusing on cost, throughput, and latency. The analysis highlights that a 13B Llama 2 model quantized with GPTQ on a g5.2xlarge instance offers the most cost-effective solution for general use cases, achieving 71 tokens per second at a reasonable hourly cost. For maximum throughput, a 13B Llama 2 model on an ml.g5.12xlarge instance can process 296 tokens per second, while for minimal latency, a 7B Llama 2 model on an ml.g5.12xlarge instance delivers a median latency of 16ms per token.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- medium