MediumCapability

Benchmark: Llama 2 deployment on Amazon SageMaker with Hugging Face LLM Inference Container

AI Impact Summary

Amazon SageMaker is evaluated with the Hugging Face LLM Inference Container and Text Generation Inference across 60 configurations of Llama 2 (7B, 13B, 70B). The study measures latency and throughput over varying concurrency on multiple GPU instances (g5.2xlarge, g5.12xlarge, g5.48xlarge, ml.g5.12xlarge, ml.p4d.24xlarge) and with GPTQ 4-bit quantization to reveal cost/performance tradeoffs. Key takeaways identify a cost-efficient 13B+GPTQ on g5.2xlarge (~71 tokens/sec at $1.55/h), best throughput with 13B on ml.g5.12xlarge (~296 tokens/sec), and lowest latency with 7B on ml.g5.12xlarge (~16.8 ms/token). The findings offer concrete deployment patterns and highlight future gains from newer hardware like Inferentia2, plus reproducible data and code for teams to replicate or tailor to their use cases.

Affected Systems

Llama 2 7B

Date: Date not specified
Change type: capability
Severity: medium

Benchmark: Llama 2 deployment on Amazon SageMaker with Hugging Face LLM Inference Container

More from Hugging Face

Get alerts for Hugging Face