NVIDIA’s Nemotron-3 Super 120B FP8 Targets Agentic Workflows at Scale

Model overview

NVIDIA-Nemotron-3-Super-120B-A12B-FP8 is a large language model developed by NVIDIA designed for agentic workflows, reasoning tasks, and high-volume workloads. Released in March 2026, this model features a hybrid Latent Mixture-of-Experts architecture combining Mamba-2 and MoE layers with select Attention components. It maintains 12 billion active parameters from a total of 120 billion parameters and supports a context length up to 1 million tokens. The model operates across seven languages including English, French, German, Italian, Japanese, Spanish, and Chinese.

When compared to related models in the Nemotron family, this variant offers distinct advantages. The NVIDIA-Nemotron-3-Super-120B-A12B-NVFP4 uses NVFP4 quantization for different efficiency profiles, while the NVIDIA-Nemotron-3-Super-120B-A12B-BF16 maintains full precision. For smaller deployments, the NVIDIA-Nemotron-3-Nano-30B-A3B-NVFP4 provides a more compact alternative with 3 billion active parameters.

Model inputs and outputs

The model accepts text prompts and generates coherent, contextually relevant responses. It incorporates configurable reasoning capabilities, allowing users to enable or disable intermediate thinking traces through the chat template. The FP8 quantization optimizes memory usage and computational efficiency while maintaining quality across diverse tasks.

Inputs

Text prompts in supported languages
Configuration flags for enabling reasoning mode
Parameters including temperature and top_p settings

Outputs

Generated text responses with optional reasoning traces
Tool-calling outputs for agentic workflows
Long-context document analysis and summaries

Capabilities

This model handles complex reasoning, code generation, and multi-turn conversations with strong performance across benchmarks. It achieves 83.63 on MMLU-Pro for general knowledge, 94.38 on HMMT Feb25 with tools for mathematical reasoning, and 78.44 on LiveCodeBench for coding tasks. The model maintains context awareness across up to 1 million tokens, scoring 96.85 on RULER-500 at 128k context length. It performs retrieval-augmented generation tasks effectively and supports tool use for building collaborative AI agents.

What can I use it for?

This model is suited for IT ticket automation, customer service bots, technical support systems, and enterprise workflows requiring reasoning capabilities. Organizations can build retrieval-augmented generation systems for documentation search, deploy code generation assistants, or create reasoning-based question-answering systems. The agentic capabilities enable task automation where the model coordinates multiple operations. For developers building specialized AI systems, the open weights and training recipes from NVIDIA support fine-tuning on domain-specific data.

Things to try

Set temperature=1.0 and top_p=0.95 across all serving backends for consistent performance on reasoning, tool calling, and general chat tasks. Enable the reasoning mode flag to observe how intermediate thinking traces improve solution quality on complex prompts, then disable it for production scenarios prioritizing speed. Test the model's behavior on long-context documents near the 1 million token limit to understand context retention. Experiment with tool calling by structuring prompts to trigger function definitions, observing how the model sequences multiple tool invocations for complex workflows.

This is a simplified guide to an AI model called NVIDIA-Nemotron-3-Super-120B-A12B-FP8 maintained by nvidia. If you like these kinds of analysis, join AIModels.fyi or follow us on Twitter.