MediumCapability

Llama 2 on Amazon SageMaker benchmark with Hugging Face LLM Inference Container

AI Impact Summary

This benchmark exhaustively evaluates Llama 2 deployments on SageMaker using the Hugging Face LLM Inference Container, testing 60 configurations across 7B, 13B, and 70B models and multiple EC2 GPU instances to map cost, latency, and throughput. Key results include 13B with GPTQ 4-bit on g5.2xlarge delivering ~71 tokens/sec at $1.55/h, 296 tokens/sec max throughput on ml.g5.12xlarge, and 7B achieving 16.8 ms/token latency on ml.g5.12xlarge, with scalability considerations noted across load levels. The study provides reproducible data via GitHub and a documented benchmark dataset to help teams select SageMaker/Hugging Face configurations aligned to cost or performance goals.

Affected Systems

Llama 2 (7B, 13B, 70B)Hugging Face LLM Inference Container

Date: Date not specified
Change type: capability
Severity: medium

Llama 2 on Amazon SageMaker benchmark with Hugging Face LLM Inference Container

More from Hugging Face

Get alerts for Hugging Face