MediumCapability

Llama 2 on Amazon SageMaker Benchmark: Optimal Deployments

AI Impact Summary

This benchmark reveals optimal deployment strategies for Llama 2 on Amazon SageMaker, focusing on cost, throughput, and latency. The analysis highlights that a 13B Llama 2 model quantized with GPTQ on a g5.2xlarge instance offers the most cost-effective solution for general use cases, achieving 71 tokens per second at a reasonable hourly cost. For maximum throughput, a 13B Llama 2 model on an ml.g5.12xlarge instance can process 296 tokens per second, while for minimal latency, a 7B Llama 2 model on an ml.g5.12xlarge instance delivers a median latency of 16ms per token.

Affected Systems

Amazon SageMakerHugging Face LLM Inference Container

Date: Date not specified
Change type: capability
Severity: medium

Llama 2 on Amazon SageMaker Benchmark: Optimal Deployments

More from Hugging Face

Get alerts for Hugging Face