Accelerate BERT inference with Hugging Face Transformers on AWS Inferentia via SageMaker
AI Impact Summary
This notebook describes accelerating BERT inference by converting Hugging Face Transformer models to AWS Inferentia using the AWS Neuron SDK and deploying on SageMaker/Inf1. It requires a custom inference.py because zero-code deployments aren’t supported for Inferentia, and compilation uses static input shapes via torch_neuron.trace. Deployment involves packaging artifacts into model.tar.gz and uploading to S3 for SageMaker hosting, with per-core tuning via NEURON_RT_NUM_CORES. Expect lower cost per inference and higher throughput on Inf1 versus GPUs, but at the cost of added deployment steps and shape constraints.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info