Skip to main content

Huawei Ascend 950DT

Product Overview

Huawei Ascend 950DT is the high-bandwidth version of the fourth-generation Ascend AI chip, officially launching on Huawei Cloud in August 2026. It shares the exact same Da Vinci v5 computing cores with the 950PR (cost-optimized version), but features the self-developed HiZQ 2.0 HBM memory system with 144GB capacity and up to 4TB/s bandwidth, specifically designed for inference Decode (token-by-token generation) stage and model training scenarios.

950PR vs 950DT dual-version strategy: Huawei adopts "scenario segmentation" design for the Ascend 950 series—the same computing cores paired with different memory subsystems, precisely matching the differentiated demands of AI workloads. 950PR targets Prefill (first token generation), while 950DT targets Decode + training.

Core Specifications

ItemParameter
ArchitectureDa Vinci v5 (4th-gen Ascend)
ProcessSMIC N+2 (equivalent to improved 7nm)
Programming ModelSIMD + SIMT dual model
HBM TypeHiZQ 2.0 (self-developed, bandwidth-first)
HBM Capacity144 GB
HBM Bandwidth4 TB/s
Interconnect Bandwidth2 TB/s (HCCS protocol)
FP8 Compute1 PFLOPS (HiF8 format)
FP4 Compute2 PFLOPS (MXFP4 format)
BF16/FP16 Compute~500 TFLOPS
INT8 Compute~2,000 TOPS
TDP~500 W
PCIeGen 5 ×16
Launch DateAugust 2026 (Huawei Cloud launch)
Price~¥120K-150K per card (estimated)

950DT vs 950PR Detailed Comparison

Dimension950PR950DT
Target ScenarioInference Prefill (first token), video recommendation, real-time interactionInference Decode (token-by-token), model training, high-concurrency inference
HBM TypeHiBL 1.0 (cost-first)HiZQ 2.0 (bandwidth-first)
HBM Capacity128 GB144 GB
HBM Bandwidth~3 TB/s4 TB/s
Interconnect BandwidthHCCS 784 GB/sHCCS 2 TB/s
Supported PrecisionFP8, HiF8FP8, MXFP8, MXFP4, HiF8
Typical ApplicationVideo recommendation, searchDialogue generation, text continuation, SFT fine-tuning
PricingLower (~¥70K per card)Higher (~¥120K-150K per card)

Key Technical Breakthroughs

1. HiZQ 2.0 Self-Developed HBM

  • Bandwidth up to 4TB/s, surpassing NVIDIA HBM3e (3.35TB/s), second only to HBM4 (4.8TB/s)
  • 144GB capacity, supports larger batch sizes and longer context windows
  • Completely breaks dependency on SK Hynix / Samsung HBM, ensuring supply chain autonomy

2. Decode Stage Specific Optimization

  • High-bandwidth memory subsystem: Decode stage bottleneck is memory bandwidth (not compute), 4TB/s bandwidth improves long-context inference throughput by
  • MXFP4/MXFP8 support: Low-precision formats reduce memory transfer volume, further improving Decode efficiency
  • Cooperation with 950PR: Prefill stage handled by 950PR, Decode stage handled by 950DT, forming "heterogeneous inference pipeline"

3. SIMD + SIMT Dual Programming Model

  • SIMD: Efficient vector compute (continuing Da Vinci core advantage from 910C)
  • SIMT: New model, supports flexible scheduling, better适配 Decode stage's irregular memory access patterns
  • Memory access granularity from 512 bytes → 128 bytes, discrete memory access efficiency improved by

4. CloudMatrix 384 System Integration

  • 384 950DT chips can form a super node (requires mixed deployment with 950PR)
  • Total compute: 384 × 1 PFLOPS FP8 ≈ 384 PFLOPS
  • Total memory: 384 × 144GB = 55,296 GB (approx. 54TB)
  • AI cluster performance comparable to NVIDIA GB300 NVL72

Comparison with Competitors

MetricAscend 950DTNVIDIA H200NVIDIA B200AMD MI355X
FP8 Compute1 PFLOPS1.97 PFLOPS4.5 PFLOPS2.3 PFLOPS
HBM Capacity144 GB141 GB192 GB288 GB
HBM Bandwidth4 TB/s4.8 TB/s8 TB/s6 TB/s
TDP~500W700W1,000W1,400W
ProcessSMIC N+2TSMC 4NPTSMC 4NPTSMC 3NM
EcosystemCANN (Huawei)CUDACUDAROCm

Ecosystem Progress

DeepSeek Priority Deployment

  • DeepSeek V4 already equipped with Ascend 950 computing platform (including 950DT)
  • Expected June 2026 release of V4.1 version (950PR optimized)
  • Expected August 2026 release of V4.2 version (950DT optimized), further unlocking model capabilities
  • Goal: Surpass top US closed-source AI models in dimensions like AI programming

CANN Next CUDA Compatibility Layer

  • ~80% PyTorch code requires only configuration changes to migrate
  • Supports mainstream models like DeepSeek, Qwen, LLaMA
  • Huawei Cloud ModelArts platform provides one-click migration tools

Launch Date and Availability

  • First Announcement: September 18, 2025 (Huawei Full Connection Conference)
  • Original Plan: Q4 2026
  • Actual Advance to: August 2026 official launch on Huawei Cloud platform
  • Availability: Huawei Cloud compute rental (hourly/monthly billing), not available for individual purchase
  • Physical Card Release: Expected Q4 2026 through partners (e.g., Inspur, Sugon)