NVIDIA’s Nemotron Ultra 253B

NVIDIA’s Nemotron Ultra 253B is a compact yet powerful open-source AI model that outperforms larger rivals with efficient reasoning and stunning benchmark result.

BLOGSNEWSLATEST NEWS

AIVO News

4/10/20255 min read

Introduction

With Nemotron Ultra 253B, NVIDIA has just made a huge statement and leap into the large language model (LLMs) arena. This model competes fluidly with the giants, including DeepSeek R1, mesmerically beating it on many tasks while being significantly smaller than that. The new model, based on Meta's LLAMA 3.1405B Instruct base, is a marvel of engineering, showcasing high flexibility, modularity, and fine-tuned performance that enhances its grace in the rough-and-tumble LLMs domain.

Let us look at how NVIDIA pulled this technical feat, why Nemotron Ultra stands in a class of its own, and why it is of great relevance to developers, researchers, and business folks alike.

Small Model, Big Results

Nemotron Ultra has 253 billion components, which makes it about 50 percent of the DeepSeek R1 (671B), while performing better with regards to instruction following, question answering, and even some code generation benchmarks. Running on the 8x H100 GPU node or Hopper architecture with BF16 or FP8 precision, the model is thus smaller and begs the question of the one-trillion-parameter device being basic infrastructure for obtaining world-class performance.

The design is developed to ensure ease and cost savings.

Great Architecture with NAS

a robot hand holding a finger on a green circuit board

Key innovations that distinguish Nemotron Ultra are the early adoption of Neural Architecture Search (NAS), allowing NVIDIA to depart from sticking to the basic LLAMA architecture, choose patiently what to keep or what to very expeditiously ignore, fuse some select parts of a network, intermingle some codes, etc. Such instances include:

- Some blocks skipping attention entirely.

- The layer of feedforward being cheapened.

- Feeding forward networks are fused into other efficient ones.

That's its build—a premise for more memory-efficient and fast, but not so fast or memory-efficient-it abandons the intelligent and reasoning powers inside.

also read: https://www.aivonews.com/the-history-of-artificial-intelligence-from-concept-to-concern

Reasoning Mode: On or Off, You Decide

One of the most distinct features that set Nemotron Ultra apart is the introducing mechanism that enables users to switch between viewing the model as operating under "Reasoning On" mode or "Reasoning Off" mode. This toggle feature allows developers to gain control over the way through which the model understands input:

- On—contemplation for difficult undertakings such as (in-depth) question and answer and even code generation.

- Off—fast-response, short answers, and simple directions.

That is a rather sea change regarding performance. On some occasions,

- In Math 500, accuracy went from 80.40% (Off) to 97.00% (On).

- AIME25 climbed from 16.67% with Reasoning Off to 72.50% with Reasoning On.

- Live CodeBench results more than double—from 29.03% for Reasoning Off to 66.31% for Reasoning On.

- GPQA crowned with an increase in performance from 56.60% to 76.01%.

These figures make it clear that toggling Reasoning mode is indeed not just a gimmick, but it's a great way to adjust the precision and depth of the output.

Best performance and evaluation process: Nemotron Ultra vs. DeepSeek R1

Nemotron Ultra may be said to hold its own in head to head tests and more often than not scores top marks.

- DeepSeek R1 is beaten on tests such as GPQA, if-eval instruction following, and code generation.

- There is minor difference in the toughest math questions like AIME25, with the following scores: 72.50% compared to DeepSeek’s 79.8%.

- Just slight differences can be noted on Math 500 scores which are 97.00% compared to 97.30%.

Given this, it is impressive when you consider the much lower parameter count of Nemotron Ultra.

Open Source and Commercial Ready

Nemotron Ultra will be fully open-source under the NVIDIA Open Model License and will also fit LLAMA 3.1 Community License, since it is based on Meta's model.

What does this mean for you?

- Model + post-training data download from Hugging Face.

- Commercially usable: use of this model with deployment in chatbots, AI assistants, RAG retrieval augmented generation applications, etc.

- NVIDIA encourages developers to do their alignment, bias and safety checks to strengthen responsible AI use.

How It Was Trained: A Deep Dive

Nemotron Ultra's prowess came from size, but more importantly by the use of a multi-step post-training pipeline by NVIDIA, which comprised:

1. Supervised Fine-Tuning on tasks like math, coding generation, chatting, and tool usage.

2. Reinforcement Learning with an algorithm called Group Relative Policy Optimization (Great Poly) to fine-tune instruction following.

3. Knowledge Distillation for over 65 billion tokens.

4. Further training involved 88 billion tokens pre-training using public datasets such as FineWeb, BuzzV1.2, and Dolma, with synthetic data.

To represent it whatever specifically, the synthetic data were very beneficial for teaching the model Reasoning mode changes for everyday instances of tasks in life experiences.

Huge Context Length and Real-World Use

Massive context windows, up to 131,072 tokens: this is what sets Nemotron Ultra apart. It's far beyond what we would normally expect from 4K or 8K tokens. Some ideal applications for such a model are:

- For conducting long chatbot conversations.

- For the analysis of full-length documents.

- For processing and summarization of code repositories.

For those developers working in Hugging Face Transformers, the integration will be simple. Make use of version 4.48.3, regulate Reasoning modes via system prompt ("Detailed Thinking On" or "Off"), and fix decoding preferences:

- On Mode: Temperature = 0.6, Top_p = 0.95.

- Off Mode: Greedy Decoding or Temperature = 0 for deterministic output.

Also, they have VLLM-based deployment guidelines to facilitate API serving.

Hardware Requirements and Optimization

In fact, very large, yet efficient deployment: tested setups include

- 8x H180GB GPUs (BF16).

- 4x B100 GPUs.

- 4x H180GB GPUs (FP8 precision).

These factors: flexibility plus memory savings from architectural compression make it much easier to deploy than many larger models.

Timelines, transparency, and model family

NVIDIA started working on Nemotron Ultra in November 2024 and was completed by April 2025. The project tested different attention-skipping strategies to finally release the post-training dataset for LLAMA Nemotron for public appraisal.

They are offering sibling models, namely:

- Nemotron Nano 8BV1 - the lightweight version.

- Nemotron Super 49BV1 - bigger brother.

Thus, the Ultra settles as "just right" with the balance of power and deployability. Multilingual, Multi-purpose AI

Nemotron Ultra is far from being just an average English man. The multilingual model supports German, French, Italian, Portuguese, Hindi, Spanish, and Thai. General-purpose domains are for several-aI, including:

- Code generation

- AI agents

- Retrieval-augmented generation (RAG)

- Chatbots, etc.

Just follow the official prompting guidelines-do not add anything else that may confuse under the sysprompt "Detailed Thinking On/Off".

Final Thoughts: Nemotron Ultra Redefines What is Possible

In open-source spirit, Nemotron Ultra 253B is a game changer. Nemotron Ultra proves that a great design works best if you keep it simple; a great giant may only be a Trillion parameters in size to zoom past the competition.

Developers can start using it right now with Hugging Face, customize reasoning modes, handle long contexts, and use it to build smart apps.

It's a simple solution: Nemotron Ultra is a tool that packs a punch, is adaptable, and is economical, all in an open-source package. Have a go; you might start to reimagine what is possible with LLMs.