Rick-Brick
NVIDIA Vera Rubin — The Next-Generation AI Infrastructure Dramatically Lowering Inference Costs

Introduction: Why Inference Cost is a Problem Now

As 2026 begins, the discussion surrounding AI is rapidly shifting from “model performance” to “the economics of inference cost.” The capabilities of Large Language Models (LLMs) are no longer in doubt, but the primary barrier to actual business deployment is the “inference cost per token.”

Agent AI, in particular, requires hundreds to thousands of LLM calls to complete a single task. This incurs costs orders of magnitude higher than simple queries, making scaling difficult.

At GTC 2026 in March 2026, NVIDIA CEO Jensen Huang succinctly described this situation: “If they have more capacity, they can generate more tokens, and they make more revenue. As agentic applications begin to generate other agents, and then go off and do tasks, the number of tokens generated is exploding,” he stated, emphasizing the importance of high-speed, low-cost inference infrastructure.

NVIDIA’s answer to this challenge is the Vera Rubin platform. First unveiled at CES 2026 (January 2026) and detailed at GTC 2026 (March 2026), this next-generation AI infrastructure promises to reduce inference costs by up to 10x compared to the previous Blackwell generation, capturing industry attention.

This article will delve technically into Vera Rubin’s architecture, exploring why such cost reductions are achievable and its potential impact on the future of agent AI.


What is Vera Rubin: A “Supercomputer for AI” Integrating 7 Chips

Vera Rubin is not a single GPU chip, but rather an integrated AI platform with 7 specialized chips designed with extreme co-design. NVIDIA calls this “Extreme Co-Design.” At GTC 2026, NVIDIA officially confirmed its acquisition of Groq in December 2025 for approximately $20 billion, with Groq’s 3 LPU becoming the seventh chip integrated into the platform.

The 7 integrated chips are as follows:

ChipRole
Vera CPUCustom CPU for AI (88 Olympus cores)
Rubin GPUCore of AI computation (50 PFLOPS NVFP4)
NVLink 6 SwitchHigh-speed GPU-to-GPU communication (3.6 TB/s)
ConnectX-9 SuperNICNetwork processing
BlueField-4 DPUData processing / Inference context memory
Spectrum-6 Ethernet SwitchEthernet communication
Groq 3 LPULow-latency inference accelerator (newly added)

This entire system is integrated on a rack-level basis, provided in the Vera Rubin NVL72 form factor. This configuration integrates 72 Rubin GPUs and 36 Vera CPUs into a single rack. For even larger-scale deployments, a Vera Rubin POD configuration of 40 racks is also available, offering 60 exaFLOPS of compute power.


Vera CPU: A Proprietary Processor Designed for AI

One of the most significant departures of Vera Rubin from previous platforms is the adoption of NVIDIA’s custom-designed “Vera” CPU.

The Vera CPU features 88 Olympus cores. Olympus is a proprietary NVIDIA core based on the ARMv9.2 instruction set, specifically optimized for AI data center workloads. Each core can process 2 threads in parallel using “Spatial Multithreading” technology, providing a total processing capacity of 176 threads. The L3 cache has been increased by 40% to 162MB, and the transistor count reaches 227 billion, a 2.2x increase over the previous generation.

A notable feature is its FP8 support. The Vera CPU is the industry’s first CPU to natively support FP8, enabling the processing of entire AI workloads using low-precision numerical formats uniformly.

On the memory side, it supports up to 1.5TB of SOCAMM LPDDR5X memory, offering a memory bandwidth of 1.2 TB/s. By widening the memory bus to 1024 bits and increasing the speed to 9600MT/s, it achieves 2.5x the bandwidth of the previous generation. Even more critical is its connection to the Rubin GPU. The 2nd-generation NVLink-C2C (Chip-to-Chip) provides a coherent bandwidth of 1.8 TB/s between the CPU and GPU. This is 7x faster than PCIe Gen 6.

Why a Custom CPU is Necessary

Traditional AI servers have used general-purpose CPUs, but these often become bottlenecks in LLM inference. The host CPU’s memory bandwidth and connection speed cannot keep up with the GPU’s processing power.

NVIDIA recognized that LLM inference is constrained by memory bandwidth and interconnects, and optimized the entire system by designing its own CPU. The high-speed coherent link between CPU and GPU minimizes data transfer overhead, improving GPU utilization.


Rubin GPU: The Next-Generation Compute Engine Specialized for Inference

The Rubin GPU incorporates numerous innovations specifically for AI inference.

Key Specifications

ItemValue
NVFP4 Inference Performance50 PFLOPS (5x Blackwell)
NVFP4 Training Performance35 PFLOPS (3.5x Blackwell)
HBM4 Memory288GB (per GPU)
HBM4 Memory Bandwidth22 TB/s
NVLink 6 Bandwidth3.6 TB/s (per GPU)
Transistor Count336 billion

Particularly noteworthy is the adoption of HBM4. Compared to the previous generation HBM3, memory bandwidth is improved by approximately 2.8x, directly addressing the issue of LLM inference being constrained by memory bandwidth.

NVFP4 and 3rd-Generation Transformer Engine

The Rubin GPU features the 3rd-generation Transformer Engine, which utilizes a new low-precision numerical format called NVFP4. NVFP4 offers even higher arithmetic density than NVFP8 used in Blackwell, achieving significant throughput improvements while maintaining precision. NVIDIA has achieved a real-world throughput increase beyond mere FLOPS gains by deeply integrating this low-precision execution into both the architecture and the software stack.


In LLM inference, especially with Mixture-of-Experts (MoE) models or in multi-GPU environments, GPU-to-GPU communication bandwidth is a key performance determinant.

NVLink 6 doubles the bandwidth compared to the previous generation (NVLink 5).

MetricNVLink 5NVLink 6
Bandwidth per Switch1,800 GB/s3,600 GB/s
Bandwidth per GPUApprox. 1.8 TB/s3.6 TB/s
Total NVL72 Rack260 TB/s

The 260 TB/s internal bandwidth provided by the NVL72 rack enables efficient inference for large-scale MoE models.


Groq 3 LPU: Low-Latency Inference Accelerator

One of the biggest surprises at GTC 2026 was the integration of Groq’s LPU (Language Processing Unit) technology into the Vera Rubin platform. NVIDIA acquired Groq on December 24, 2025, for approximately $20 billion, securing key personnel and a non-exclusive license for Groq’s LPU technology.

Role Division Between GPU and LPU

In the Vera Rubin system, Rubin and Groq share the inference process.

  • Rubin GPU: Handles prefill processing and decode attention.
  • Groq 3 LPU: Executes the Feed-Forward Network (FFN).

This division of labor allows each chip to concentrate on its most proficient tasks.

Groq 3 LPX Rack Specifications

The Groq 3 LPX Rack, announced at GTC 2026, features 256 LPUs.

ItemValue
SRAM Capacity (per chip)500MB
SRAM Bandwidth (per chip)150 TB/s
Scale-up Bandwidth (per chip)2.5 TB/s
Total On-Chip SRAM Capacity (Rack)128GB
Scale-up Bandwidth (Rack)640 TB/s

Groq 3 is designed with an emphasis on bandwidth over capacity, offering approximately 80 TB/s bandwidth per chip. This SRAM-centric, high-bandwidth design enables low latency in FFN processing.

Integration Benefits

The combination of Vera Rubin and Groq LPX results in up to 35x higher inference throughput for trillion-parameter models and a 35x increase in throughput per megawatt compared to Rubin GPU alone. This is achieved without significant changes to the CUDA platform, by leveraging the LPUs as highly specialized decode accelerators.


Inference Context Memory List Storage: Specialization for Agent AI

A crucial feature indicating Vera Rubin’s design as a “foundation for agent AI” is its Inference Context Memory List Storage Platform.

New Memory Hierarchy

NVIDIA utilizes the BlueField-4 DPU to build a new memory tier between the GPU and traditional storage.

The BlueField-4 STX storage rack functions as “dedicated context memory” to maintain context consistency for AI agents handling large multi-turn conversations. By offloading KV cache data to the BlueField-4 chip, the sharing and reuse of cache data across the entire AI inference infrastructure become possible, improving inference throughput by up to 5x.

Impact on Agent AI

Agent AI exhibits computational patterns fundamentally different from simple queries.

Dozens to hundreds of LLM calls occur for a single instruction, each with a long context. Inference Context Memory List Storage improves the overall throughput and cost-efficiency of agent AI by efficiently managing this KV cache.


The Mechanism for 10x Cost Reduction: Understanding the Numbers Accurately

It is crucial to accurately understand the conditions under which NVIDIA’s claim of “10x inference cost reduction” is achieved.

Key Improvement Factors

The 10x cost reduction is realized through the combined effect of multiple technological innovations.

Improved HBM4 Memory Bandwidth: Approx. 2.8x
Improved NVLink 6 Throughput: Approx. 2x
NVFP4 Tensor Core Performance Improvement: Approx. 5x
FNN Processing Efficiency from Groq LPU Integration: Additional Factor

Dramatic Improvement in Power Efficiency

Jensen Huang presented striking figures in his keynote address: “With the Blackwell generation, we could generate 22 million tokens per second from a 1GW data center. With Vera Rubin, we can generate 700 million tokens per second from the same power. That’s a 350x improvement in two years.”

MetricBlackwellVera RubinImprovement Factor
Tokens/sec per 1GW22 million700 millionApprox. 32x
Token Cost (Long Context)BaselineUp to 1/10Up to 10x
Inference Throughput/WattBaseline10x10x
MoE Training GPU CountBaseline1/44x Efficiency

Realistic Expectations

On the other hand, realistic evaluation is also important. The 10x cost reduction is a benchmark result under specific conditions of “long context, long output.” For dense model inference with shorter contexts, a 2-3x improvement is a more realistic expectation.


NVL72 Rack: System-Wide Performance

The Vera Rubin NVL72 is a rack-scale system integrating all components.

NVL72 Specification Summary

ItemSpecification
GPU Configuration72 x Rubin GPUs
CPU Configuration36 x Vera CPUs
Total NVFP4 Inference Performance3.6 ExaFLOPS
Total HBM4 Capacity20.7 TB
Total HBM4 Bandwidth1.6 PB/s (Petabytes per second)
Total NVLink 6 Bandwidth260 TB/s

Vera Rubin POD: Data Center Scale Deployment

For even larger configurations, the Vera Rubin POD is available, comprising 40 racks.

ItemSpecification
Total GPUs2,880
Total Compute Performance60 ExaFLOPS
Configuration ComponentsOver 1,300,000

The POD serves as the fundamental unit for NVIDIA’s next-generation data centers, which they refer to as “AI Factories.”


Comparison with Blackwell: Evolution Between Generations

Vera Rubin is positioned as NVIDIA’s successor to Blackwell. Let’s summarize the key improvements of each generation.

ItemBlackwellVera RubinImprovement Factor
GPU Inference Performance (NVFP4)10 PFLOPS50 PFLOPS5x
GPU Training Performance10 PFLOPS35 PFLOPS3.5x
GPU-to-GPU Bandwidth1,800 GB/s3,600 GB/s2x
HBM GenerationHBM3HBM4Approx. 2.8x
CPUGeneral Purpose/GraceVera (88 Olympus Cores)
Low-Latency InferenceGroq 3 LPU Integration
MoE Training GPU CountBaselineReduced to 1/44x
Token CostBaselineUp to 1/10Up to 10x

Deployment Timeline and Key Partners

Availability Schedule

NVIDIA plans to begin mass production and shipment of Vera Rubin in the second half of 2026. At GTC 2026 (March 16-19, 2026), Vera Rubin was confirmed to be in “full production status.”

Initial Deployment Partners

The following companies have been announced as initial partners offering cloud services based on Vera Rubin:

  • Hyperscalers: AWS, Google Cloud, Microsoft Azure, Oracle Cloud Infrastructure (OCI)
  • Specialized Clouds: CoreWeave, Lambda, Nebius, Nscale

Jensen Huang stated, “Cumulative orders for Blackwell and Rubin will exceed $1 trillion by the end of 2027,” indicating Vera Rubin’s central role in data center investment.


Technical Challenges and Future Outlook

Power Consumption and Data Center Investment

While the NVL72 rack offers immense computational power, its power consumption is also substantial. In 2026, data center capital expenditures by hyperscalers are projected to exceed $65 billion in total, necessitating significant investment in power and cooling infrastructure for Vera Rubin deployment.

Software Ecosystem Development

Although NVIDIA states that Groq 3 LPU integration does not require significant changes to the CUDA platform, optimization of the software stack (CUDA libraries, inference frameworks) is also crucial. NVIDIA is addressing this through initiatives like NIM (NVIDIA Inference Microservices).

Next-Generation “Vera Rubin Ultra”

At GTC 2026, a further next-generation platform, Vera Rubin Ultra, was previewed, suggesting NVIDIA’s commitment to annual platform evolution.


Conclusion: Towards the Next Stage of AI Infrastructure

NVIDIA Vera Rubin is more than just a “faster GPU.” It is an integrated AI platform where 7 chips and related systems are extremely co-designed: the proprietary Vera CPU, significantly improved memory bandwidth with HBM4, doubled GPU-to-GPU communication with NVLink 6, low-latency inference integration with Groq 3 LPU, and KV cache management through Inference Context Memory List Storage.

Up to 10x inference cost reduction (under long context conditions), a 4x reduction in GPU count for MoE model training, and 350x token generation capability for the same power fundamentally alter the economic viability of agent AI.

In 2026, as agent AI begins to be deployed for corporate automation, inference cost is a direct factor in business profitability. With Vera Rubin entering mass production in the second half of 2026, this cost equation will be rewritten. The practical realization of AI hinges not only on model intelligence but also on the economics of the infrastructure that powers it. In this context, Vera Rubin represents a significant infrastructure innovation defining 2026.


References

TitleSourceDateURL
NVIDIA Kicks Off the Next Generation of AI With Rubin — Six New Chips, One Incredible AI SupercomputerNVIDIA Newsroom2026/03/16https://nvidianews.nvidia.com/news/rubin-platform-ai-supercomputer
NVIDIA Vera Rubin Opens Agentic AI FrontierNVIDIA Newsroom2026/03/16https://nvidianews.nvidia.com/news/nvidia-vera-rubin-platform
Inside the NVIDIA Vera Rubin Platform: Six New Chips, One AI SupercomputerNVIDIA Technical Blog2026/03/16https://developer.nvidia.com/blog/inside-the-nvidia-rubin-platform-six-new-chips-one-ai-supercomputer/
Inside NVIDIA Groq 3 LPX: The Low-Latency Inference Accelerator for the NVIDIA Vera Rubin PlatformNVIDIA Technical Blog2026/03/16https://developer.nvidia.com/blog/inside-nvidia-groq-3-lpx-the-low-latency-inference-accelerator-for-the-nvidia-vera-rubin-platform/
NVIDIA Vera Rubin POD: Seven Chips, Five Rack-Scale Systems, One AI SupercomputerNVIDIA Technical Blog2026/03/16https://developer.nvidia.com/blog/nvidia-vera-rubin-pod-seven-chips-five-rack-scale-systems-one-ai-supercomputer/
Infrastructure for Scalable AI ReasoningNVIDIA Official2026/03https://www.nvidia.com/en-us/data-center/technologies/rubin/
Nvidia launches Vera Rubin NVL72 AI supercomputer at CESTom’s Hardware2026/01/06https://www.tomshardware.com/pc-components/gpus/nvidia-launches-vera-rubin-nvl72-ai-supercomputer-at-ces-promises-up-to-5x-greater-inference-performance-and-10x-lower-cost-per-token-than-blackwell-coming-2h-2026
GTC 2026: Nvidia Unveils Vera Rubin AI Platform, Eyes $1T by 2027Data Center Knowledge2026/03/16https://www.datacenterknowledge.com/data-center-chips/gtc-2026-nvidia-unveils-vera-rubin-ai-platform-eyes-1t-by-2027
Nvidia GTC 2026: CEO Jensen Huang sees $1 trillion in orders for Blackwell and Vera Rubin through ‘27CNBC2026/03/16https://www.cnbc.com/2026/03/16/nvidia-gtc-2026-ceo-jensen-huang-keynote-blackwell-vera-rubin.html
Nvidia’s Rubin platform aims to cut AI training, inference costsCIO Dive2026/03https://www.ciodive.com/news/nvidia-rubin-cut-ai-training-inference-costs/808915/
NVIDIA Vera Rubin NVL72 Detailed: 72 GPUs, 36 CPUs, 260 TB/s Scale-Up BandwidthVideoCardz2026/01https://videocardz.com/newz/nvidia-vera-rubin-nvl72-detailed-72-gpus-36-cpus-260-tb-s-scale-up-bandwidth
Decoding the Future of Inference At NVIDIA: Groq LPUs Join Vera Rubin PlatformServeTheHome2026/03/16https://www.servethehome.com/decoding-the-future-of-inference-at-nvidia-groq-lpus-join-vera-rubin-platform-for-low-latency-inference/
Nvidia Boasts 7 Chips in Production for Vera Rubin Platform, Including Groq 3 LPUHPCwire2026/03/16https://www.hpcwire.com/2026/03/16/nvidia-boasts-7-chips-in-production-for-vera-rubin-platform-including-groq-3-lpu/
NVIDIA Launches New Vera CPU: 88 Olympus Cores Designed From Scratch for AIKnowledge Hub Media2026/01https://knowledgehubmedia.com/nvidia-launches-new-vera-cpu-88-olympus-cores-designed-from-scratch-for-ai/
NVIDIA GTC 2026: Rubin GPUs, Groq LPUs, Vera CPUs, and What NVIDIA Is Building for Trillion-Parameter InferenceStorageReview2026/03/16https://www.storagereview.com/news/nvidia-gtc-2026-rubin-gpus-groq-lpus-vera-cpus-and-what-nvidia-is-building-for-trillion-parameter-inference

This article was automatically generated by LLM. It may contain errors.