Efficiency has always been NVIDIA's real competitive weapon, and Nemotron 3 Super is the clearest proof yet. Released on March 11, 2026, the model arrives as a 120-billion-parameter Mixture-of-Experts architecture that activates only 12 billion parameters per forward pass. The throughput benchmark reads like a typo: 5x higher throughput than the previous Nemotron Super and 7.5x higher than Qwen3.5-122B on equivalent hardware. When those numbers are real, they're not describing a better model. They're describing a different cost structure for enterprise AI deployment.
What Actually Happened
NVIDIA unveiled the Nemotron 3 family as three distinct models designed for different deployment profiles across the enterprise AI stack. At the bottom, Nemotron 3 Nano carries 3.2 billion active parameters within a 31.6 billion total parameter model, delivering 4x higher throughput than its predecessor while targeting edge inference and resource-constrained deployments. The top-tier Nemotron 3 Ultra extends to approximately 500 billion total parameters with 50 billion active per token, designed for complex multi-step reasoning in enterprise applications with no concurrency constraints. The flagship is Nemotron 3 Super, which occupies the enterprise sweet spot at 120 billion total parameters, 12 billion active per token, and throughput rates of 449 to 478 output tokens per second on B200 GPUs.
The architecture innovation that enables this performance profile is a hybrid Mamba-Transformer Mixture-of-Experts design. Traditional transformer architectures scale attention complexity quadratically with context length, which becomes a throughput bottleneck at the long sequences that agentic AI tasks require. The Mamba state-space model component in Nemotron 3 Super handles these long sequences with near-linear complexity, while the sparse MoE routing ensures that only a fraction of total parameters fire on any given input. The combination produces a model that is simultaneously faster per token and more capable at long-context tasks than models of comparable active parameter counts. NVIDIA released all three tiers under an Apache 2.0 license with full weights publicly available.
The multimodal extension of the Nano tier, Nemotron 3 Nano Omni, adds a separate dimension to the family's story. Unifying vision, audio, and language processing in a single efficient model, Nano Omni delivers 9x higher throughput than competing open omni models with equivalent interactivity while topping six leaderboards for complex document intelligence and video and audio understanding. This positions the Nemotron 3 family as a full-stack open model answer: cloud-scale concurrent agent reasoning at the Super and Ultra tiers, and edge inference with multimodal capabilities at the Nano Omni tier.
Why This Matters More Than People Think
The 85.6% PinchBench score will dominate initial coverage. PinchBench measures how well an LLM performs as the reasoning brain of an autonomous agent running inside NVIDIA's OpenClaw framework, and Nemotron 3 Super's result places it ahead of every public open competitor. But the number that changes procurement decisions is the throughput ratio. At 7.5x the tokens per second of Qwen3.5-122B on equivalent B200 hardware, Nemotron 3 Super doesn't just outperform its nearest large-scale open model competitor. It renders comparisons on cost-per-useful-output nearly irrelevant at scale.
Enterprise AI deployments in 2026 are increasingly characterized by high-concurrency agent workloads, not single-user chat sessions. A legal research firm running 500 simultaneous document review agents faces very different infrastructure economics than a consumer chatbot with intermittent queries. For concurrent agent deployments, throughput is the variable that determines whether a deployment is economically viable. The math is direct: a model handling 7.5x the throughput of the next best open option on the same GPU cluster effectively reduces inference cost-per-token by that same factor for throughput-bound applications. For enterprises watching GPU bills compound monthly across thousands of concurrent agent sessions, that ratio is a procurement decision, not a benchmark talking point.
The 91.75% RULER score at 1 million token context opens a separate tier of use cases that has been constrained to expensive frontier closed models. Most models degrade meaningfully beyond 128K tokens. A model that maintains near-perfect fidelity at 1 million tokens makes legal contract analysis, scientific literature review, and codebase-level AI reasoning economically viable for organizations that cannot justify frontier API pricing. The long-context capability, combined with the throughput advantage, makes Nemotron 3 Super the first open model that checks both boxes for enterprise agentic deployment simultaneously.
The Competitive Landscape
The open model market entering Q2 2026 is the most competitive it has ever been. Meta Llama 5 arrived as the generation benchmark-setter. DeepSeek V4 Pro disrupted inference pricing expectations with frontier-comparable performance at drastically lower cost. Zhipu AI's GLM-5.1 captured the top position on SWE-Bench Pro. Into this field, Nemotron 3 Super introduces a variable none of the competitors can easily replicate: NVIDIA's ability to co-optimize model architecture with its own silicon, then benchmark that architecture on the hardware it was designed for.
Mistral faces the most direct competitive pressure. Its Large and Medium models have occupied the enterprise-grade open model niche precisely because they deliver capable performance at efficient compute costs. Nemotron 3 Super enters that same niche with Apache 2.0 licensing, CUDA optimization, and the full weight of NVIDIA's enterprise customer relationships behind it. Qwen3.5-122B's position as the throughput baseline for very large open models is now structurally weakened: no benchmarking comparison on equivalent hardware will favor Qwen for throughput-sensitive tasks given the 7.5x gap.
Critics argue, however, that the benchmark picture is not fully independent. PinchBench was developed by NVIDIA and measures performance within NVIDIA's own OpenClaw agent framework. A model tuned against its creator's evaluation methodology raises legitimate questions about whether the 85.6% score generalizes to enterprise workflows using different orchestration tools. Nemotron 3 Super's 60.47% on SWE-Bench Verified, a third-party standard, is competitive but still below Claude Opus 4.7 and OpenAI's Codex-integrated results. The bear case is that enterprises testing on their own task distributions may find the real performance gap with frontier closed models is larger than the PinchBench headline suggests, while the throughput advantage remains genuine. That makes this a strong efficiency win, but not necessarily a capability win on the most demanding reasoning tasks.
Hidden Insight: The Model Is a Hardware Advertisement
NVIDIA has a history of shipping software products that exist primarily to demonstrate hardware capabilities. CUDA is the clearest example. When NVIDIA released CUDA in 2006, it was framed as a parallel computing platform for developers. The practical effect was to make NVIDIA GPUs the only viable hardware for the scientific computing workloads that would eventually become deep learning, creating a dependency that competitors spent fifteen years trying to replicate. Nemotron 3 operates on the same logic, one level up the stack.
A model with a hybrid Mamba-Transformer MoE architecture that achieves 449 to 478 tokens per second on B200 GPUs is not, in any meaningful sense, hardware-agnostic. The architecture is specifically designed to leverage NVIDIA's memory bandwidth and compute patterns. The long-context performance at 1 million tokens is achievable on B200 hardware in part because NVIDIA's memory system handles the attention and state caching requirements of the hybrid architecture in ways competing GPU designs do not match. When NVIDIA publishes these throughput numbers, they are simultaneously benchmarking the model and benchmarking the hardware that was used to achieve them.
The strategic consequence is subtle but structurally important. Every enterprise that adopts Nemotron 3 Super as its standard open reasoning model is, implicitly, selecting B200 hardware as its inference infrastructure. The open source license doesn't undermine this dependency; it accelerates adoption, which deepens the hardware lock-in. An enterprise that builds agent workflows around Nemotron 3 Super's specific performance profile will find that migrating to AMD MI300X or future Intel Gaudi hardware introduces friction: the benchmark performance that justified the model selection was measured on NVIDIA silicon. Apache 2.0 is a free license. The hardware it implies is not.
What to Watch Next
The 60-day indicator is cloud availability. NVIDIA has strong relationships with AWS, Google Cloud, and Azure, and the Apache 2.0 license removes licensing barriers to hosting. If all three major clouds add Nemotron 3 Super to their managed inference APIs by early Q2 2026, the model transitions from technically impressive to commercially dominant in the open category. Watch for announcements at cloud-specific developer conferences in May and June. The absence of a major cloud API announcement by late May 2026 would be the first signal that enterprise adoption is slower than the benchmark numbers predict.
The 90-day indicator is the SWE-Bench trajectory on updated checkpoints. NVIDIA has framed Nemotron 3 Super as a model that continues improving with post-training and reinforcement learning on agentic data. A checkpoint update pushing SWE-Bench Verified above 65% would close the gap with frontier closed models on software engineering tasks, the benchmark category most directly relevant to enterprise developer productivity. That is the threshold that moves Nemotron 3 Super from "best open option" to "competitive alternative to frontier closed models at zero API cost." If NVIDIA reaches that threshold within the calendar year, the economics of enterprise AI tooling change in a way that directly pressures OpenAI and Anthropic enterprise pricing. Watch Hugging Face download trends and the SWE-Bench leaderboard for new checkpoints through the end of Q2.
When a GPU company ships the reference open model for agentic AI, the benchmarks are the product and the hardware is the hidden price tag.
Key Takeaways
- 85.6% PinchBench score — Highest-scoring open model on NVIDIA's agentic AI benchmark, placing Nemotron 3 Super above every public open competitor in its class
- 7.5x throughput over Qwen3.5-122B — On equivalent B200 hardware, this gap makes Nemotron 3 Super the default throughput-efficiency choice for enterprise concurrent agent workloads
- 12B active out of 120B total parameters — The sparse MoE architecture delivers frontier-comparable reasoning at sub-frontier compute cost per token, reshaping enterprise deployment economics
- 91.75% RULER at 1M context length — Near-perfect long-context fidelity unlocks legal, scientific, and codebase-level AI applications previously limited to expensive closed model pricing tiers
- Apache 2.0 license with public weights — Full commercial rights remove licensing friction and accelerate adoption, while the B200-optimized architecture embeds a hardware dependency below the model layer
Questions Worth Asking
- If the highest-performing open agentic model is architecture-optimized for NVIDIA hardware, does the open source license actually provide the infrastructure independence enterprises assume it does?
- What does Nemotron 3 Super's throughput advantage mean for the competitive viability of Mistral, Qwen, and other efficiency-focused open model providers in the enterprise market?
- If 7.5x throughput on the same hardware reduces your AI agent inference cost by a proportional factor, how does that change the ROI calculation for agent deployments you are currently deferring on cost grounds?