Product Overview
Huawei Ascend 950DT is the high-bandwidth version of the fourth-generation Ascend AI chip, officially launching on Huawei Cloud in August 2026. It shares the exact same Da Vinci v5 computing cores with the 950PR (cost-optimized version), but features the self-developed HiZQ 2.0 HBM memory system with 144GB capacity and up to 4TB/s bandwidth, specifically designed for inference Decode (token-by-token generation) stage and model training scenarios.
950PR vs 950DT dual-version strategy: Huawei adopts "scenario segmentation" design for the Ascend 950 series—the same computing cores paired with different memory subsystems, precisely matching the differentiated demands of AI workloads. 950PR targets Prefill (first token generation), while 950DT targets Decode + training.
Core Specifications
| Item | Parameter |
|---|
| Architecture | Da Vinci v5 (4th-gen Ascend) |
| Process | SMIC N+2 (equivalent to improved 7nm) |
| Programming Model | SIMD + SIMT dual model |
| HBM Type | HiZQ 2.0 (self-developed, bandwidth-first) |
| HBM Capacity | 144 GB |
| HBM Bandwidth | 4 TB/s |
| Interconnect Bandwidth | 2 TB/s (HCCS protocol) |
| FP8 Compute | 1 PFLOPS (HiF8 format) |
| FP4 Compute | 2 PFLOPS (MXFP4 format) |
| BF16/FP16 Compute | ~500 TFLOPS |
| INT8 Compute | ~2,000 TOPS |
| TDP | ~500 W |
| PCIe | Gen 5 ×16 |
| Launch Date | August 2026 (Huawei Cloud launch) |
| Price | ~¥120K-150K per card (estimated) |
950DT vs 950PR Detailed Comparison
| Dimension | 950PR | 950DT |
|---|
| Target Scenario | Inference Prefill (first token), video recommendation, real-time interaction | Inference Decode (token-by-token), model training, high-concurrency inference |
| HBM Type | HiBL 1.0 (cost-first) | HiZQ 2.0 (bandwidth-first) |
| HBM Capacity | 128 GB | 144 GB |
| HBM Bandwidth | ~3 TB/s | 4 TB/s |
| Interconnect Bandwidth | HCCS 784 GB/s | HCCS 2 TB/s |
| Supported Precision | FP8, HiF8 | FP8, MXFP8, MXFP4, HiF8 |
| Typical Application | Video recommendation, search | Dialogue generation, text continuation, SFT fine-tuning |
| Pricing | Lower (~¥70K per card) | Higher (~¥120K-150K per card) |
Key Technical Breakthroughs
1. HiZQ 2.0 Self-Developed HBM
- Bandwidth up to 4TB/s, surpassing NVIDIA HBM3e (3.35TB/s), second only to HBM4 (4.8TB/s)
- 144GB capacity, supports larger batch sizes and longer context windows
- Completely breaks dependency on SK Hynix / Samsung HBM, ensuring supply chain autonomy
2. Decode Stage Specific Optimization
- High-bandwidth memory subsystem: Decode stage bottleneck is memory bandwidth (not compute), 4TB/s bandwidth improves long-context inference throughput by 2×
- MXFP4/MXFP8 support: Low-precision formats reduce memory transfer volume, further improving Decode efficiency
- Cooperation with 950PR: Prefill stage handled by 950PR, Decode stage handled by 950DT, forming "heterogeneous inference pipeline"
3. SIMD + SIMT Dual Programming Model
- SIMD: Efficient vector compute (continuing Da Vinci core advantage from 910C)
- SIMT: New model, supports flexible scheduling, better适配 Decode stage's irregular memory access patterns
- Memory access granularity from 512 bytes → 128 bytes, discrete memory access efficiency improved by 4×
4. CloudMatrix 384 System Integration
- 384 950DT chips can form a super node (requires mixed deployment with 950PR)
- Total compute: 384 × 1 PFLOPS FP8 ≈ 384 PFLOPS
- Total memory: 384 × 144GB = 55,296 GB (approx. 54TB)
- AI cluster performance comparable to NVIDIA GB300 NVL72
Comparison with Competitors
| Metric | Ascend 950DT | NVIDIA H200 | NVIDIA B200 | AMD MI355X |
|---|
| FP8 Compute | 1 PFLOPS | 1.97 PFLOPS | 4.5 PFLOPS | 2.3 PFLOPS |
| HBM Capacity | 144 GB | 141 GB | 192 GB | 288 GB |
| HBM Bandwidth | 4 TB/s | 4.8 TB/s | 8 TB/s | 6 TB/s |
| TDP | ~500W | 700W | 1,000W | 1,400W |
| Process | SMIC N+2 | TSMC 4NP | TSMC 4NP | TSMC 3NM |
| Ecosystem | CANN (Huawei) | CUDA | CUDA | ROCm |
Ecosystem Progress
DeepSeek Priority Deployment
- DeepSeek V4 already equipped with Ascend 950 computing platform (including 950DT)
- Expected June 2026 release of V4.1 version (950PR optimized)
- Expected August 2026 release of V4.2 version (950DT optimized), further unlocking model capabilities
- Goal: Surpass top US closed-source AI models in dimensions like AI programming
CANN Next CUDA Compatibility Layer
- ~80% PyTorch code requires only configuration changes to migrate
- Supports mainstream models like DeepSeek, Qwen, LLaMA
- Huawei Cloud ModelArts platform provides one-click migration tools
Launch Date and Availability
- First Announcement: September 18, 2025 (Huawei Full Connection Conference)
- Original Plan: Q4 2026
- Actual Advance to: August 2026 official launch on Huawei Cloud platform
- Availability: Huawei Cloud compute rental (hourly/monthly billing), not available for individual purchase
- Physical Card Release: Expected Q4 2026 through partners (e.g., Inspur, Sugon)
External Links