Skip to main content

Huawei Ascend 950PR

Product Overview

Huawei Ascend 950PR (Ascend 950PR) is Hauwei's flagship AI inference acceleration chip officially released on March 20, 2026, with Q1 2026 shipment. It adopts a monolithic design (non-MCM), equipped with 112GB Hauwei self-developed HiBL 1.0 HBM, delivering 1.56 PFLOPS FP4 compute, making it Hauwei's first product to significantly surpass NVIDIA compliance-version H20 in inference performance (FP4 performance reaches 2.8× that of H20).

Strategic Position: 950PR is the first product in Hauwei's four-generation Ascend roadmap (2026-2028: 950PR → 950DT → 960 → 970), paired with CANN Next software stack (CUDA compatibility layer), marking Hauwei's critical transition from "usable" to "easy to use".

Core Specifications

ItemParameter
ArchitectureAscend 950 (monolithic design)
ProcessSMIC N+3 (equivalent to 5nm-class)
PackagingMonolithic Die, non-MCM
HBM112 GB HiBL 1.0 (Atlas 350 accelerator) / 128 GB (bare chip)
HBM Bandwidth1.4 TB/s (Atlas 350) / 1.6 TB/s (bare chip)
FP41.56 PFLOPS
FP8 / MXFP81 PFLOPS
FP16Not publicly disclosed (estimated ~780 TFLOPS)
Interconnect Bandwidth2 TB/s (self-developed LingQu protocol)
Memory Access Granularity128 bytes (vs. previous gen 512 bytes, small tensor efficiency +4×)
TDP600 W (Atlas 350 accelerator)
Board Form FactorAtlas 350 accelerator (PCIe / OAM)
Mass ProductionQ1 2026 (March 20-21 official release)
Unit Price (Atlas 350)~¥111,000 (≈$16,000)

⚠️ Monolithic Design Trade-off: Avoids advanced packaging (TSMC CoWoS) restrictions, but bare die size and yield rate are lower than MCM solutions. This is an engineering compromise under US export controls, not a technical preference.

Comparison with Ascend 910C

MetricAscend 910CAscend 950PRImprovement
ProcessSMIC N+2 (equiv. 7nm)SMIC N+3 (equiv. 5nm)Significant upgrade
HBM~64GB HBM2E112GB HiBL 1.0+75%
Bandwidth~2 TB/s (estimated)1.4 TB/sFlat/+ improvement
FP4Not supported1.56 PFNew
FP8Not supported1 PFNew
Interconnect~800 GB/s2 TB/s2.5×
Memory Granularity512 bytes128 bytes4× small tensor efficiency
TDP~400W600W+50%
SoftwareCANN (poor compatibility)CANN Next (CUDA compatible)Significant improvement

Comparison with Competitors (2026 Inference Cards)

MetricAscend 950PRNVIDIA H20 (compliance)NVIDIA L40SGap
FP41.56 PF~0.56 PF0.16 PF+178% vs H20
HBM112GB96GB48GB+17% vs H20
Bandwidth1.4 TB/s4.0 TB/s0.86 TB/s-65% vs H20
TDP600W400W350W+50% vs H20
SoftwareCANN NextCUDACUDAEcosystem disadvantage
Price~$16,000~$20,000+~$10,000Price advantage

Bandwidth Disadvantage Explanation: 950PR's 1.4 TB/s is significantly lower than H20's 4.0 TB/s, but in inference pre-fill stage, compute (FP4) is the bottleneck, so bandwidth disadvantage has limited impact. Generation (Decode) stage is bandwidth-sensitive, which is 950PR's weakness.

LingQu Interconnect Protocol

ItemParameter
Protocol NameLingQu
Bandwidth2 TB/s (inter-chip)
vs. Previous GenAscend 910 series: ~800 GB/s (2.5×)
Cluster ExpansionSupports UnifiedBus 2.0, expandable to 1 million NPUs
vs. NVIDIANVLink 5: 1.8 TB/s (single GPU); NVLink 6: 3.5 TB/s (Rubin)

Atlas 950 SuperCluster Rack Solution

ItemParameter
Max Scale1 million NPUs
vs. NVIDIA NVL144Total compute 6.7×, memory capacity 15×, interconnect bandwidth 62× (Huawei claimed, not third-party verified)
SwitchHauwei self-developed Qingtian series
Target ScenarioNational-level AI training clusters, large model inference cloud services

CANN Next Software Stack

LayerToolDescription
CUDA Compatibility LayerCANN Next Runtime~80% standard PyTorch inference code requires only configuration changes, no major rewrites
Graph CompilerCANN GraphSimilar to XLA, automatic operator fusion
Quantization ToolCANN QuantFP8 / MXFP8 post-training quantization
Communication LibraryHCCLCollective communication (AllReduce, etc.), similar to NCCL
Model LibraryModelZooPre-optimized LLMs (Qwen, ChatGLM, etc.)

Customers and Orders (2026)

CustomerOrder AmountDescription
ByteDance$5.6B (confirmed)Largest single order in 2026
Alibaba CloudLarge amount (undisclosed)For Tongyi Qianwen inference
TencentLarge amount (undisclosed)For Hunyuan large model
BaiduLarge amount (undisclosed)For ERNIE 4.0

2026 Total Sales Estimated: $12B, first time matching NVIDIA revenue scale in China AI chip market.

Ascend Roadmap (2026-2028)

ProductLaunch DatePositioning
950PRQ1 2026Flagship inference (this page)
950DTQ4 2026Decode + training scenario
960Q4 2027Target to match Blackwell architecture
970Q4 2028Target to match Rubin architecture

Key Features

  • Monolithic design: Avoids US advanced packaging controls, but sacrifices maximum bare die size
  • FP4 leadership: 1.56 PFLOPS, surpassing NVIDIA H20 (compliance version) by 2.8×
  • LingQu 2 TB/s interconnect: Domestic protocol, supports ultra-large clusters
  • CANN Next CUDA compatibility: Reduces migration cost, key to ecosystem improvement
  • 600W TDP: High power consumption, requires liquid cooling or enhanced air cooling
  • Weakness: HBM bandwidth only 1.4 TB/s (vs H20 4.0 TB/s), generation stage performance limited

Suitable Scenarios

  • Ultra-long context inference (Pre-fill stage, FP4 compute bottleneck scenario)
  • Domestic AI cloud services (Alibaba Cloud, Tencent Cloud, Baidu Cloud)
  • Government/SOE AI projects (supply chain security priority)
  • Large model inference as a service (MaaS)
  • ❌ Large-scale training (bandwidth disadvantage)
  • ❌ International market (export controls + ecosystem disadvantage)

References