NVIDIA Vera Rubin Platform Deep Dive: 6-Chip Package, 288GB HBM4, 50 PFLOPS FP4

April 22, 2026 · 5 min read

Industry Research Team

The NVIDIA Vera Rubin platform is NVIDIA's next-generation flagship computing platform after Blackwell. This article provides an in-depth analysis covering the naming origin, 6-chip packaging, memory subsystem, compute matrix, networking architecture, rack-scale solution, and software ecosystem.

Naming Origin: Honoring Astronomer Vera Rubin

NVIDIA chose "Vera Rubin" as the codename for the next-generation platform, honoring astronomer Vera Florence Cooper Rubin (1928-2016). In the 1960s-70s, through studying galactic rotation curves, she provided the first conclusive proof of dark matter's existence.

NVIDIA naming the next-generation platform of the AI compute revolution after her carries deep philosophical meaning:

Dark matter is the "invisible" yet dominant mass component of the universe
AI compute is the "invisible" yet dominant underlying infrastructure of the digital economy

6-Chip Packaging (CoWoS-L)

The Vera Rubin platform adopts the industry's first 6-chip CoWoS-L packaging:

Chip	Quantity	Role	Process
Vera CPU	1	Host CPU / Prefetch / Interconnect Controller	TSMC 3NP
Rubin GPU Die	2	Matrix Compute Cores	TSMC 3NP / 4NP
I/O / HBM Base Die	3	HBM4 PHY + I/O + Interconnect	TSMC 4NP

Compared to the Blackwell B300 Ultra's 2-chip package (only 2 GPU dies), Vera Rubin features a 3× increase in chip count, with substantial transistor count growth.

Core Specifications (Per GPU)

Item	Specification
Architecture	Rubin
Transistor Count	~340 billion (per GPU)
Memory	288 GB HBM4
Memory Bandwidth	22 TB/s (2.75× HBM3e)
FP4 Tensor (sparse)	50 PFLOPS
FP8 Tensor (sparse)	25 PFLOPS
FP16/BF16 Tensor	12.5 PFLOPS
TDP	~1,800 W (liquid cooling required)

Data Convention: Post-Blackwell NVIDIA products continue to use sparse compute as the official metric. FP4 50 PF = dense 25 PF. For cross-vendor comparison, AMD MI400's 40 PF FP4 dense = NVIDIA Rubin R200's 80 PF FP4 sparse.

Vera CPU Deep Dive

The Vera CPU is NVIDIA's first self-designed ARM-based server CPU, succeeding Grace:

Item	Specification
Architecture	ARM v9.2 Olympus
Core Count	88 cores (single package)
Process	TSMC 3NP
L2 Cache	1 MB per core
L3 Cache	Shared 264 MB
Memory	12-channel DDR5-8000
Memory Bandwidth	614 GB/s
CXL 2.0	Yes (Type-3 memory expansion)
PCIe	PCIe Gen 6 (128 lanes)
TDP	300-450 W

Why ARM? NVIDIA acquired ARM in 2020 (failed), but obtained the permanent ARM architecture license and Neoverse series code. Vera is essentially NVIDIA's "ARM Olympus" design.

Rubin GPU: 4x Blackwell Performance

The Rubin GPU achieves 4× Blackwell B300 Ultra performance through:

2× SMs (Streaming Multiprocessors)
2× HBM4 bandwidth (22 TB/s vs HBM3e 11 TB/s)
FP4 Tensor Core (new precision, 4× FP8)
Transformer Engine 3 (9× DNN inference acceleration)
RAS Engine (Reliability, Availability, Serviceability)

Memory Subsystem: HBM4 288GB

The HBM4 standard, formalized in 2026, brings:

Per-stack capacity: 36 GB → 48 GB (33% increase)
Per-stack bandwidth: 1.2 TB/s → 1.6 TB/s (33% increase)
Stack height: 12-Hi → 16-Hi
Per-package capacity: 288 GB (6 stacks × 48 GB)
Per-package bandwidth: 22 TB/s (theoretical peak)

Compute Matrix: FP4 Everywhere

Vera Rubin is the first NVIDIA platform to natively support FP4 data type:

Precision	Throughput (per GPU, sparse)	Use Cases
FP4	50 PFLOPS	Inference, small-batch training
FP8	25 PFLOPS	LLM training, large model inference
FP16/BF16	12.5 PFLOPS	Traditional training, scientific computing
FP32	6.25 PFLOPS	HPC, traditional scientific computing
FP64	3.1 PFLOPS	Numerical simulation

Networking: ConnectX-9 28.8 TB/s

The ConnectX-9 is NVIDIA's next-generation network adapter:

Bandwidth: 1.6 Tbps per port (2× ConnectX-8)
GPU ratio: 2× ConnectX-9 per GPU
Aggregate GPU-to-GPU: 28.8 TB/s (per Rubin GPU)
Protocols: NVLink 6, PCIe Gen 6, RoCE v2, InfiniBand NDR
Encryption: AES-256-GCM hardware acceleration

Rack-Scale Solutions

Vera Rubin supports two rack-scale configurations:

1. Rubin NVL72 (1 Rack)

GPU count: 72 (NVL72 1×72 topology)
CPU count: 36 Vera
Total HBM4: 20.7 TB
Total bandwidth: 1.6 PB/s
Peak FP4: 1.4 EFLOPS (1 Rack unit)
Power: ~130 kW

2. Rubin NVL576 (8 Racks)

GPU count: 576 (NVL576 8×72)
CPU count: 288 Vera
Total HBM4: 165 TB
Total bandwidth: 12.7 PB/s
Peak FP4: 28.8 EFLOPS (full cluster)
Power: ~1.1 MW (single data hall)

1.1 MW single-hall power consumption marks Vera Rubin NVL576 as the first data hall that requires substation-level dedicated power supply, posing new challenges to data center design.

Software Ecosystem

The Vera Rubin platform's software stack:

CUDA 13.0: Full FP4 / FP8 / FP16 / BF16 / TF32 support
cuDNN 9.0: FP4 Tensor Core acceleration
TensorRT 11.0: LLM FP4 quantization deployment
Megatron-LM 0.12: Distributed training framework
TensorRT-LLM 2.0: LLM inference optimization
NeMo 2.0: LLM/Speech AI/Visual Agent framework

Competitive Analysis

Vendor	Product	Peak FP4 dense	Memory	Memory BW	Process
NVIDIA	Vera Rubin R200	25 PF (50 PF sparse)	288 GB HBM4	22 TB/s	3NP
AMD	MI400	40 PF dense	432 GB HBM4	~12 TB/s	3N + 5N
Google	TPU Ironwood v7	2.3 PF dense	192 GB HBM	7.4 TB/s	5N
AWS	Trainium 3	1.8 PF dense	144 GB HBM	2.7 TB/s	3N
Huawei	Ascend 920	0.45 PF dense	192 GB HBM2e	4 TB/s	6N SMIC

Apple Silicon M5 Ultra (2026 H2 predicted) 400B FP8 = NVIDIA Rubin R200's 25 PF FP8 sparse (dense) × 6.25 = 156 PF FP8. Apple still 6× behind in single-chip performance, but with 384GB UMA + 1 TB/s bandwidth, it's a strong local LLM inference option.

Conclusion

Vera Rubin is the culmination of 5 architectural innovations:

6-chip CoWoS-L packaging (industry first)
288GB HBM4 (3× B200 capacity)
22 TB/s memory bandwidth (2.75× HBM3e)
50 PFLOPS FP4 sparse (2.5× B300 Ultra)
ConnectX-9 28.8 TB/s (2× ConnectX-8)

For AI workloads, Vera Rubin is the next-generation foundation for trillion-parameter model training and Agentic AI inference.

💡 Disclaimer: All "predicted" and "rumored" data points are clearly marked and do not constitute investment advice.

Naming Origin: Honoring Astronomer Vera Rubin​

6-Chip Packaging (CoWoS-L)​

Core Specifications (Per GPU)​

Vera CPU Deep Dive​

Rubin GPU: 4x Blackwell Performance​

Memory Subsystem: HBM4 288GB​

Compute Matrix: FP4 Everywhere​

Networking: ConnectX-9 28.8 TB/s​

Rack-Scale Solutions​

1. Rubin NVL72 (1 Rack)​

2. Rubin NVL576 (8 Racks)​

Software Ecosystem​

Competitive Analysis​

Conclusion​