Run Vicuna 13B Chatbot on AMD GPU with ROCm and GPTQ
AI Impact Summary
This guide details running the Vicuna 13B language model, a 13 billion parameter chatbot, on a single AMD GPU using ROCm and GPTQ quantization. The key technical challenge is the model's memory footprint (approximately 28GB in fp16), which is addressed through 4-bit GPTQ quantization, reducing the memory requirements to around 7.5GB. This allows the model to run on GPUs with limited memory like the Instinct MI210 or RX6900XT, demonstrating a viable path to deploying large language models on consumer hardware. The process involves setting up ROCm, Docker, and Python, followed by model quantization and inference, ultimately exposing the model via a web API.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info