Optimum-Intel and OpenVINO GenAI enable edge deployment of Hugging Face Transformers
AI Impact Summary
Optimum-Intel combined with OpenVINO GenAI enables edge and client-side deployment of Hugging Face Transformers by exporting models to OpenVINO IR and running inference via Python or C++ paths, reducing dependency footprint. The workflow uses OVModelForCausalLM wrappers or CLI export to generate the IR directory (including tokenizers), laying the groundwork for lightweight deployment on Intel hardware. Weight-only quantization with NNCF (INT8/INT4) and techniques like AWQ and scale estimation are recommended to meet latency and footprint targets, with expected accuracy trade-offs validated against benchmarks such as lm-evaluation. Deployment is facilitated through the OpenVINO GenAI LLMPipeline in Python or the OpenVINO GenAI C++ API, enabling streamlined generation pipelines on CPU and other supported devices.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info