NVIDIA Vera Rubin Platform Deep Dive: 6-Chip Package, 288GB HBM4, 50 PFLOPS FP4
The NVIDIA Vera Rubin platform is NVIDIA's next-generation flagship computing platform after Blackwell. This article provides an in-depth analysis covering the naming origin, 6-chip packaging, memory subsystem, compute matrix, networking architecture, rack-scale solution, and software ecosystem.
Naming Origin: Honoring Astronomer Vera Rubin
NVIDIA chose "Vera Rubin" as the codename for the next-generation platform, honoring astronomer Vera Florence Cooper Rubin (1928-2016). In the 1960s-70s, through studying galactic rotation curves, she provided the first conclusive proof of dark matter's existence.
NVIDIA naming the next-generation platform of the AI compute revolution after her carries deep philosophical meaning:
- Dark matter is the "invisible" yet dominant mass component of the universe
- AI compute is the "invisible" yet dominant underlying infrastructure of the digital economy
6-Chip Packaging (CoWoS-L)
The Vera Rubin platform adopts the industry's first 6-chip CoWoS-L packaging:
| Chip | Quantity | Role | Process |
|---|---|---|---|
| Vera CPU | 1 | Host CPU / Prefetch / Interconnect Controller | TSMC 3NP |
| Rubin GPU Die | 2 | Matrix Compute Cores | TSMC 3NP / 4NP |
| I/O / HBM Base Die | 3 | HBM4 PHY + I/O + Interconnect | TSMC 4NP |
Compared to the Blackwell B300 Ultra's 2-chip package (only 2 GPU dies), Vera Rubin features a 3× increase in chip count, with substantial transistor count growth.
Core Specifications (Per GPU)
| Item | Specification |
|---|---|
| Architecture | Rubin |
| Transistor Count | ~340 billion (per GPU) |
| Memory | 288 GB HBM4 |
| Memory Bandwidth | 22 TB/s (2.75× HBM3e) |
| FP4 Tensor (sparse) | 50 PFLOPS |
| FP8 Tensor (sparse) | 25 PFLOPS |
| FP16/BF16 Tensor | 12.5 PFLOPS |
| TDP | ~1,800 W (liquid cooling required) |
Data Convention: Post-Blackwell NVIDIA products continue to use sparse compute as the official metric. FP4 50 PF = dense 25 PF. For cross-vendor comparison, AMD MI400's 40 PF FP4 dense = NVIDIA Rubin R200's 80 PF FP4 sparse.
Vera CPU Deep Dive
The Vera CPU is NVIDIA's first self-designed ARM-based server CPU, succeeding Grace:
| Item | Specification |
|---|---|
| Architecture | ARM v9.2 Olympus |
| Core Count | 88 cores (single package) |
| Process | TSMC 3NP |
| L2 Cache | 1 MB per core |
| L3 Cache | Shared 264 MB |
| Memory | 12-channel DDR5-8000 |
| Memory Bandwidth | 614 GB/s |
| CXL 2.0 | Yes (Type-3 memory expansion) |
| PCIe | PCIe Gen 6 (128 lanes) |
| TDP | 300-450 W |
Why ARM? NVIDIA acquired ARM in 2020 (failed), but obtained the permanent ARM architecture license and Neoverse series code. Vera is essentially NVIDIA's "ARM Olympus" design.
Rubin GPU: 4x Blackwell Performance
The Rubin GPU achieves 4× Blackwell B300 Ultra performance through:
- 2× SMs (Streaming Multiprocessors)
- 2× HBM4 bandwidth (22 TB/s vs HBM3e 11 TB/s)
- FP4 Tensor Core (new precision, 4× FP8)
- Transformer Engine 3 (9× DNN inference acceleration)
- RAS Engine (Reliability, Availability, Serviceability)
Memory Subsystem: HBM4 288GB
The HBM4 standard, formalized in 2026, brings:
- Per-stack capacity: 36 GB → 48 GB (33% increase)
- Per-stack bandwidth: 1.2 TB/s → 1.6 TB/s (33% increase)
- Stack height: 12-Hi → 16-Hi
- Per-package capacity: 288 GB (6 stacks × 48 GB)
- Per-package bandwidth: 22 TB/s (theoretical peak)
Compute Matrix: FP4 Everywhere
Vera Rubin is the first NVIDIA platform to natively support FP4 data type:
| Precision | Throughput (per GPU, sparse) | Use Cases |
|---|---|---|
| FP4 | 50 PFLOPS | Inference, small-batch training |
| FP8 | 25 PFLOPS | LLM training, large model inference |
| FP16/BF16 | 12.5 PFLOPS | Traditional training, scientific computing |
| FP32 | 6.25 PFLOPS | HPC, traditional scientific computing |
| FP64 | 3.1 PFLOPS | Numerical simulation |
Networking: ConnectX-9 28.8 TB/s
The ConnectX-9 is NVIDIA's next-generation network adapter:
- Bandwidth: 1.6 Tbps per port (2× ConnectX-8)
- GPU ratio: 2× ConnectX-9 per GPU
- Aggregate GPU-to-GPU: 28.8 TB/s (per Rubin GPU)
- Protocols: NVLink 6, PCIe Gen 6, RoCE v2, InfiniBand NDR
- Encryption: AES-256-GCM hardware acceleration
Rack-Scale Solutions
Vera Rubin supports two rack-scale configurations:
1. Rubin NVL72 (1 Rack)
- GPU count: 72 (NVL72 1×72 topology)
- CPU count: 36 Vera
- Total HBM4: 20.7 TB
- Total bandwidth: 1.6 PB/s
- Peak FP4: 1.4 EFLOPS (1 Rack unit)
- Power: ~130 kW
2. Rubin NVL576 (8 Racks)
- GPU count: 576 (NVL576 8×72)
- CPU count: 288 Vera
- Total HBM4: 165 TB
- Total bandwidth: 12.7 PB/s
- Peak FP4: 28.8 EFLOPS (full cluster)
- Power: ~1.1 MW (single data hall)
1.1 MW single-hall power consumption marks Vera Rubin NVL576 as the first data hall that requires substation-level dedicated power supply, posing new challenges to data center design.
Software Ecosystem
The Vera Rubin platform's software stack:
- CUDA 13.0: Full FP4 / FP8 / FP16 / BF16 / TF32 support
- cuDNN 9.0: FP4 Tensor Core acceleration
- TensorRT 11.0: LLM FP4 quantization deployment
- Megatron-LM 0.12: Distributed training framework
- TensorRT-LLM 2.0: LLM inference optimization
- NeMo 2.0: LLM/Speech AI/Visual Agent framework
Competitive Analysis
| Vendor | Product | Peak FP4 dense | Memory | Memory BW | Process |
|---|---|---|---|---|---|
| NVIDIA | Vera Rubin R200 | 25 PF (50 PF sparse) | 288 GB HBM4 | 22 TB/s | 3NP |
| AMD | MI400 | 40 PF dense | 432 GB HBM4 | ~12 TB/s | 3N + 5N |
| TPU Ironwood v7 | 2.3 PF dense | 192 GB HBM | 7.4 TB/s | 5N | |
| AWS | Trainium 3 | 1.8 PF dense | 144 GB HBM | 2.7 TB/s | 3N |
| Huawei | Ascend 920 | 0.45 PF dense | 192 GB HBM2e | 4 TB/s | 6N SMIC |
Apple Silicon M5 Ultra (2026 H2 predicted) 400B FP8 = NVIDIA Rubin R200's 25 PF FP8 sparse (dense) × 6.25 = 156 PF FP8. Apple still 6× behind in single-chip performance, but with 384GB UMA + 1 TB/s bandwidth, it's a strong local LLM inference option.
Conclusion
Vera Rubin is the culmination of 5 architectural innovations:
- 6-chip CoWoS-L packaging (industry first)
- 288GB HBM4 (3× B200 capacity)
- 22 TB/s memory bandwidth (2.75× HBM3e)
- 50 PFLOPS FP4 sparse (2.5× B300 Ultra)
- ConnectX-9 28.8 TB/s (2× ConnectX-8)
For AI workloads, Vera Rubin is the next-generation foundation for trillion-parameter model training and Agentic AI inference.
💡 Disclaimer: All "predicted" and "rumored" data points are clearly marked and do not constitute investment advice.