Holotron-12B High-Throughput Multimodal Agent with Nemotron SSM delivers 8.9k tokens/s at 100 concurrency
AI Impact Summary
Holotron-12B is a production-oriented multimodal agent model built on Nemotron-Nano-2 VL, optimized for long-context, interactive workloads. Its hybrid state-space model with attention reduces memory footprint and scales throughput, demonstrated on a single H100 with vLLM, achieving 8.9k tokens/s at 100 concurrent requests. The model is released on Hugging Face under the NVIDIA Open Model License and targets enterprise data generation, annotation, and online reinforcement learning pipelines that require high-throughput agentic reasoning.
Affected Systems
- Date
- Date not specified
- Change type
- capability
- Severity
- info