Nvidia has released Nemotron-Nano-9B-v2, a small language model (SLM) that leads its class in benchmark performance and includes a toggle for AI “reasoning” (self-checking before answering). The model was pruned from 12B to 9B parameters to fit on a single Nvidia A10 GPU, a popular choice for deployment, according to Oleksii Kuchiaev, Nvidia’s Director of AI Model Post-Training. It’s a hybrid model that supports larger batch sizes and runs up to 6× faster than similarly sized transformer models. Nemotron-Nano-9B-v2 is based on Nemotron-H, which uses hybrid Mamba-Transformer architecture developed with input from Carnegie Mellon and Princeton researchers. Mamba architecture integrates selective state space models (SSMs) that efficiently handle long sequences by maintaining state. Hybrid Mamba-Transformer architecture reduces compute costs by replacing most attention with linear-time state space layers, achieving 2–3× higher throughput on long contexts with similar accuracy. Nemotron-Nano-9B-v2 is a unified, text-only chat and reasoning model, trained from scratch and defaulting to reasoning trace generation before final answers. Users can control reasoning behavior using tokens like /think or /no_think. The model introduces a “thinking budget”, allowing developers to cap internal reasoning tokens to balance accuracy and latency—ideal for customer support and autonomous agents. Benchmark results show strong performance: 72.1% on AIME25, 97.8% on MATH500, 64.0% on GPQA, and 71.1% on LiveCodeBench. Nemotron-Nano-9B-v2 outperforms Qwen3-8B, a common comparison model in the open SLM category. The model can be put into production immediately without negotiating a separate commercial license or paying fees tied to usage thresholds, revenue levels, or user counts. There are no clauses requiring a paid license once a company reaches a certain scale, unlike some tiered open licenses used by other providers