3 posts tagged with "Ascend"

Huawei Ascend series AI chips

View all tags

里程碑！华为昇腾910C完成1.6万亿参数大模型全参数训练

June 16, 2026 · 7 min read

AI Compute Cards Wiki Editorial

Industry Research Team

2026年6月5日，深圳发布官宣重磅消息：深圳河套学院联合哈工大（深圳）、华为等团队，用1000颗华为昇腾910C芯片，成功完成1.6万亿参数DeepSeek-V4-Pro大模型全参数后训练。

这不是一次试探性的尝试，而是一次里程碑式的技术突破。它用无可辩驳的工程结果证明：国产AI芯片足以支撑世界级超大参数模型训练。

为什么这很重要？

AI芯片的两道坎："推理"与"训练"

推理（Inference）：用现成模型聊天、写文案。此前国产芯片已经能做
训练（Training）：调整模型参数让它学习新能力。全参数训练要同时调整1.6万亿个参数，难度拉满

此前，万亿级参数模型的全参数训练一直被英伟达H100/H200垄断。国产芯片只能做推理，无法做大规模训练。

这次突破的意义：国产算力从"能用"跨越到"好用"，从"推理"跨越到"训练"。

技术细节

训练配置

项目	参数
芯片	华为昇腾910C × 1,000颗
模型	DeepSeek-V4-Pro
参数量	1.6万亿（1600B）
训练类型	全参数后训练（Full Parameter Post-Training）
训练框架	昇思（MindSpore）+ torch_npu
完成时间	2026年6月5日官宣

性能指标

指标	数值	评价
算力利用率	>30%	工业级水平（海外顶级芯片~40%）
关键训练算子效率提升	14%	相比上一代910B
通信带宽利用率	>60%（推测）	MoE模型的All-to-All通信
稳定性	1000颗卡连续训练无故障	集群稳定性达标

💡 关于30%算力利用率：很多人觉得30%不高，但在大模型训练领域，这已经是非常不错的工业级水平。就算用最顶级的海外芯片，很多团队的实际利用率也就在40%左右。

昇腾910C详细规格

昇腾910C是华为在2024年4月24日（华为分析师大会）公布的AI训练/推理芯片，理论算力峰值达到800 TFLOPS（BF16精度），与英伟达H100处于同等量级。

参数	昇腾910C	昇腾910B	NVIDIA H100
架构	Ascend 910C	Ascend 910B	Hopper
制程	TSMC 7nm（推测）	TSMC 7nm	TSMC 4NP
BF16算力	800 TFLOPS	256 TFLOPS	989 TFLOPS（稀疏）
显存	64GB HBM（推测）	64GB HBM2e（B1/B2）	80GB HBM3
显存带宽	~2TB/s（推测）	600 GB/s（B1/B2）	3.35 TB/s
TDP	~400W（推测）	300-400W	700W
量产时间	2026年4月（正式量产）	2022年11月	2022年3月

关键升级：

✅ 算力提升3×：从910B的256 TFLOPS提升到800 TFLOPS
✅ 软件生态完善：torch_npu适配PyTorch，昇思框架成熟
✅ 集群稳定性：1000颗卡连续训练无故障（这是最大的突破）

技术挑战与解决方案

挑战1：万亿级模型的显存需求

1.6万亿参数模型，仅模型参数就需要：

FP16精度：1.6T × 2 bytes = 3.2 TB
加上梯度、优化器状态：至少10 TB显存

华为的解决方案：

模型并行（Model Parallel）：将模型分布到1000颗910C上
ZeRO优化器：优化显存占用
梯度累积：分阶段更新参数

挑战2：万卡集群的通信效率

1000颗芯片训练时，卡间通信成为瓶颈。MoE模型需要All-to-All通信（每个专家可能需要与其他所有专家通信）。

华为的解决方案：

HCCS（Huawei Collective Communication Scheduler）：自研高速互联协议
分层通信：节点内NVLink + 节点间HCCS
通信-计算重叠：在计算的同时进行数据传输

挑战3：训练稳定性

万亿级模型训练需要数周甚至数月，任何一颗卡故障都可能导致整个训练中断。

华为的解决方案：

故障检测与自动恢复：实时监测卡的状态，故障时自动重启并恢复训练状态
检查点（Checkpoint）优化：高频保存训练状态（每N步保存一次）
昇腾集群管理软件：专门为企业级训练设计

与竞品对比

厂商	芯片	1.6万亿参数训练	生态成熟度	可用性
华为	昇腾910C	✅ 已完成	⭐⭐⭐（进步中）	中国本土
NVIDIA	H100/H200	✅ 工业标准	⭐⭐⭐⭐⭐	全球（受出口管制）
AMD	MI300X	✅ 可行	⭐⭐⭐⭐	全球
Google	TPU v5p/8t	✅ JAX原生	⭐⭐⭐⭐	Google Cloud

结论：昇腾910C在硬件性能上已经追上H100，软件生态仍有差距，但这次训练成功证明了工程可行性。

行业影响

1. 国产算力的"遵义会议"

这次突破被业内称为国产算力的"遵义会议"——从此从被动防守转向战略反攻。

具体影响：

✅ 打破"国产芯片只能推理"的偏见
✅ 证明国产芯片可以做frontier模型训练
✅ 为国产大模型（如DeepSeek-V4、文心5.0）提供算力底座

2. 对英伟达的冲击

华为昇腾910C完成万亿级训练，意味着中国AI产业对英伟达的依赖度降低。

场景	此前	现在
推理	国产芯片可用	国产芯片好用
训练	必须用H100/H200	可以用910C
大规模训练	必须用H100集群	可以用910C集群

3. 对国产芯片产业的提振

这次突破将带动整个国产AI芯片产业链：

芯片设计：寒武纪、沐曦、摩尔线程等加速迭代
晶圆制造：中芯国际、华虹等获得更多订单
封装测试：长电科技、通富微电等受益

华为昇腾芯片路线图（2025-2028）

时间	芯片	定位
2025年Q1	昇腾910C	旗舰训练/推理（已量产）
2026年Q1	昇腾950PR	推理优化（~500 TFLOPS BF16）
2026年Q4	昇腾950DT	数据中心训练
2027年Q4	昇腾960	下一代旗舰
2028年Q4	昇腾970	再下一代

训练实战经验分享

深圳河套学院团队在训练中积累了宝贵经验：

✅ 成功经验

渐进式训练：从小模型（7B）开始，逐步扩大到1.6T
混合精度训练：BF16主训练 + FP32梯度累积
通信优化：All-to-All通信与计算重叠
故障恢复：每1000步保存一次检查点

⚠️ 遇到的挑战

显存碎片：长训练过程中显存碎片化严重，需要定期整理
通信瓶颈：MoE模型的All-to-All通信占训练时间的30%+
软件Bug：torch_npu偶有内存泄漏，需要重启训练进程

参考资料

本文基于公开报道整理。向深圳河套学院、哈工大（深圳）、华为等团队表示敬意——你们用工程实践证明了中国AI算力的可行性。

Huawei Ascend 950 Mass Production and the Full Picture of China's AI Chip Ecosystem

June 4, 2026 · 4 min read

AI Compute Cards Wiki Editorial

Industry Research Team

June 2026 — Huawei's Ascend 950 series (950PR / 950DT) has entered formal mass production and delivery, a landmark event for China's AI chip industry in 2026. Meanwhile, Cambricon's MLU690 has begun shipping and Moore Threads has announced MTT S5000 specifications, formally establishing China's tri-polar AI chip landscape.

Ascend 950 Series: A Historic Breakthrough with Self-Developed HBM

Huawei HiSilicon's Ascend 950 series is the fourth-generation Ascend AI chip, first revealed at Huawei Connect 2025 in September and entering mass production in Q1 2026.

950PR (Prefill Inference Specialized)

Item	Specification
Architecture	Da Vinci v5 (SIMD + SIMT dual-model)
Process	N+2 (SMIC domestic)
HBM	HiBL 1.0 (Huawei self-developed) , 128 GB
FP8 Compute	1 PFLOPS (HiF8 format)
TDP	~400 W
Target	Inference Prefill (video recommendation, real-time interaction)

950DT (Decode + Training Specialized)

Item	Specification
Architecture	Da Vinci v5 (SIMD + SIMT dual-model)
Process	N+2 (SMIC domestic)
HBM	HiZQ 2.0 (Huawei self-developed) , 144 GB, 4 TB/s
FP8 Compute	1 PFLOPS (HiF8 format)
TDP	~500 W
Target	Inference Decode + Model Training

Historical Significance

Self-developed HBM (HiBL 1.0 / HiZQ 2.0) represents the most important technical breakthrough of Huawei Ascend 950 — this is the first time a Chinese enterprise has achieved self-developed mass production of HBM memory, completely eliminating dependence on SK Hynix / Samsung HBM supply. Combined with the domestic N+2 process, Ascend 950 has achieved full-chain domestic production from HBM → Compute Die → Packaging → System.

Cambricon MLU690: China's Only Native FP8 Support

Cambricon's seventh-generation AI chip MLU 690 (Siyuan 690) began volume production and shipping in H1 2026. This is the first domestic AI chip with native FP8 precision support.

Item	MLU 690
Process	5nm (TSMC / SMIC)
FP8 dense	2 PFLOPS
HBM	192GB HBM3E, 5 TB/s
TDP	~500 W
Unit Price (OAM)	~$8,000-12,000

MLU 690's FP8 compute power (2 PFLOPS dense) is on paper comparable to NVIDIA Blackwell (B200 FP8 4.5 PFLOPS sparse). Leveraging its financing advantage as a STAR Market listed company, Cambricon targets 2026 revenue of ¥15-20B (2025: ¥7.2B).

Moore Threads MTT S5000: From Graphics to Training-Inference Unified

Moore Threads publicly disclosed detailed specifications of the MTT S5000 in February 2026, featuring the fourth-generation MUSA "Pinghu" architecture, single-card AI compute of 1,000 TFLOPS, 80GB GDDR6X memory, 1.6 TB/s bandwidth.

Moore Threads pursues a full-function GPU path (graphics rendering + AI compute + general-purpose compute), closest to NVIDIA's strategy. The founding team comes from former NVIDIA China, and the MUSIFY toolchain helps auto-migrate CUDA code to the MUSA platform, lowering ecosystem migration costs.

China's Tri-Polar AI Chip Landscape

Dimension	Huawei Ascend	Cambricon	Moore Threads
Core Architecture	Da Vinci v5	MLUv07	MUSA 4th Gen
Process	N+2 domestic	5nm	6nm
FP8 Compute	~1 PFLOPS	2 PFLOPS	0.5 PFLOPS (estimated)
HBM Self-Sufficiency	✅ Self-developed HiBL/HiZQ	❌ Purchased	❌ Purchased
Ecosystem	CANN + MindSpore	NeuWare + MindSpore	MUSA + MUSIFY
Advantage	Full-chain domestic	Highest FP8 compute	Full-function + CUDA migration
2025 Revenue	(Huawei internal)	¥7.2B	¥2.2B

Global Market Comparison (Q2 2026 Update)

Tier	Vendor	Flagship Chip	FP8/PFLOPS	HBM	Mass Production
Tier 1	NVIDIA	Rubin R200	25 PF (sparse)	288GB HBM4	2026 H2
Tier 2	AMD	MI400	20 PF (dense)	432GB HBM4	2026
	Huawei	Ascend 950DT	1 PF (dense)	144GB self-developed HBM	2026 Q1
	Cambricon	MLU690	2 PF (dense)	192GB HBM3E	2026 H1
	AWS	Trainium 3	5.7 PF (dense)	144GB HBM	2025 Q4 GA
Tier 3	Intel	Gaudi 3	1.8 PF	128GB HBM2e	In production
	Google	TPU v7	4.6 PF(TFLOPS)	192GB HBM	2025
	Moore Threads	MTT S5000	1 PF	80GB GDDR6X	2025 Q1

Note: NVIDIA uses sparse compute as standard, while AMD / Huawei / Cambricon use dense — not directly comparable.

Outlook for H2 2026

NVIDIA Rubin R200: Official shipment in H2 2026, 288GB HBM4, 6-chip CoWoS-L packaging
Huawei Ascend 960: Roadmap H2 2027, expected FP8 compute doubled to 2 PFLOPS
Cambricon MLU790: Expected 2027, 3nm, 384GB HBM4, 2.5 PFLOPS
Moore Threads: Next-gen GPU expected with HBM3, 2× MTT S5000 compute

By 2026, China's AI chip industry has formed a complete product matrix from Training (Cambricon MLU690 / Ascend 950DT) → Inference (Ascend 950PR / Moore Threads S5000) → Systems (CloudMatrix / Distributed Clusters).

This article is based on public information from Huawei Connect 2025 (2025-09-18), industry analysis reports from April 2026, and the latest market data as of June 2026.

China AI Chip Landscape 2025: Ascend, Cambricon, Hygon — Who Will Dominate?

June 3, 2025 · 5 min read

AI Compute Cards Wiki Editorial

Industry Research Team

Escalating U.S. export controls are forcing China's AI chip industry to accelerate self-reliance. By 2025, the discussion around domestic Chinese AI chips has shifted from "are they usable?" to "which one should I choose?"

This article systematically reviews the major players, core products, and actual deployment status of domestic AI chips, helping developers and procurement decision-makers understand the competitive landscape.

为什么这很重要？​

AI芯片的两道坎："推理"与"训练"​

技术细节​

训练配置​

性能指标​

昇腾910C详细规格​

技术挑战与解决方案​

挑战1：万亿级模型的显存需求​

挑战2：万卡集群的通信效率​

挑战3：训练稳定性​

与竞品对比​

行业影响​

1. 国产算力的"遵义会议"​

2. 对英伟达的冲击​

3. 对国产芯片产业的提振​

华为昇腾芯片路线图（2025-2028）​

训练实战经验分享​

✅ 成功经验​

⚠️ 遇到的挑战​

相关芯片​

参考资料​

Ascend 950 Series: A Historic Breakthrough with Self-Developed HBM​

950PR (Prefill Inference Specialized)​

950DT (Decode + Training Specialized)​

Historical Significance​

Cambricon MLU690: China's Only Native FP8 Support​

Moore Threads MTT S5000: From Graphics to Training-Inference Unified​

China's Tri-Polar AI Chip Landscape​

Global Market Comparison (Q2 2026 Update)​

Outlook for H2 2026​

为什么这很重要？

AI芯片的两道坎："推理"与"训练"

技术细节

训练配置

性能指标

昇腾910C详细规格

技术挑战与解决方案

挑战1：万亿级模型的显存需求

挑战2：万卡集群的通信效率

挑战3：训练稳定性

与竞品对比

行业影响

1. 国产算力的"遵义会议"

2. 对英伟达的冲击

3. 对国产芯片产业的提振

华为昇腾芯片路线图（2025-2028）

训练实战经验分享

✅ 成功经验

⚠️ 遇到的挑战

相关芯片

参考资料

Ascend 950 Series: A Historic Breakthrough with Self-Developed HBM

950PR (Prefill Inference Specialized)

950DT (Decode + Training Specialized)

Historical Significance

Cambricon MLU690: China's Only Native FP8 Support

Moore Threads MTT S5000: From Graphics to Training-Inference Unified

China's Tri-Polar AI Chip Landscape

Global Market Comparison (Q2 2026 Update)

Outlook for H2 2026