Files

Gahow Wang a57afa86b4 Initial commit: obsidian to gitea

2026-05-07 15:04:41 +08:00

8.7 KiB

Raw Permalink Blame History

Background

Hardware spec

	memory	bandwidth	FP32 TFLOPs	TF32 TFLOPS	FP16 TFLOPs	FP8 TFLOPs (FP16 Accumulate)	inter bandwidth
M3-Ultra	512 GB	800 GB/s	43	NA	114.688	NA	thunderbolt5-15 GB/s (max: 6 links * 15)
4090	24 GB	1.01 TB/s	82.6	82.6 / 165.2	82.58	660.6 / 1321.2
5090	32 GB	1.79 TB/s	104.8	104.8 / 209.5	419 / 838	838 / 1676
H20-HGX	96 GB	4 TB/s	44	74	148	296
H100-SXM	80 GB	3.35 TB/s	67	495 / 989	990 / 1979	1979 / 3958
A100-SXM	80 GB	2039 GB/s	19.5	156 / 312	312 / 624	NA

M3-Ultra

算力对于 DeepSeek-V3/R1 的架构，prefill per token 需 0.0763 TFLOPs。假设 M3-Ultra 能吃满 80% 的算力，则单卡 M3-Ultra 至多做 1195 tokens/s 的 prefill。在平均 seq_len 为 2k tokens 时，做一次 decode 需要 0.1055 TFLOPs。假设此时 M3-Ultra 能吃满 30% 的算力，则至多做 324 tokens/s，如果 batch_size = 16，对每个用户可以达到 20tps（TPOT = 50 ms），满足 SLO。即使 seq_len 达到 8k，也能提供 10tps，满足基本使用需求。

seq_len	TFLOPs/token	TPS for a batch (30% utilization)
256	0.075	455.825
512	0.079	430.848
1024	0.088	388.296
2048	0.105	324.247
4096	0.140	243.813
8192	0.210	162.963
16384	0.349	97.981

访存每次激活 37 B，使用 FP8 的参数，需要 load 37 GB，假设 800 GB/s 的带宽能吃满 85%，54.41 ms 的参数访存时间。TPS 从 20 降至 10.
通信 MLA 部分单 token all_reduce 需要 0.000814438 GB。Decode 阶段，bsz=64，15 GB/s bandwidth，利用率 50%，则耗时为 6.95 ms。bsz=16，则耗时为 1.74ms。从 50 ms 的 decode，考虑带宽开销后变为 51.74 ms，此时为每个用户可提供 19.32 tps。
KV cache load seq_len=4096, bsz=16，KV load 的开销为 6.7 ms，以访存计算，可达到每用户 149 tps。考虑 TPOT=50ms，加上通信开销与 KV cache load 开销后，在 60ms 以内，每用户可达到 16.7 tps
KV cache 存储 FP16 保存时，单 token 需要 68.62 KB。bsz=16，平均 input+output=8k tokens，则共需 8.58 GB。对于最大支持 512 GB 的 M3-Ultra 来说就是洒洒水啦～

综合考虑以上部分，假设输入的 batch 为 16 reqs * 4k context，在单卡 M3-Ultra 上进行推理，output 下一个 token 的耗时为：

load model params：37 GB / (800 GB/s * 0.85) = 54.41 ms
load KV cache: 6.7 ms
计算：(0.105 / (114 * 0.3)) * 16 = 49.12 ms 可做到对每个请求达到 110.23ms / token ～ 9 TPS

KTransformers

假设使用单卡 4090，report 的性能数据：286 tokens/s prefill，14 tokens/s decode

TBD:

理论计算 KT 的性能

On-demand quantization, Module injection, Operator placement

核心 feature：

写一个 yaml 即可匹配模型的指定部分，将这部分参数 offload 至 CPU 进行计算
算子注入框架，将指定模块的算子进行替换

References

Backup

KTransformers

假设使用单卡 4090，report 的性能数据：286 tokens/s prefill，14 tokens/s decode

TBD: 理论计算 KT 的性能

On-demand quantization, Module injection, Operator placement

核心 feature：

写一个 yaml 即可匹配模型的指定部分，将这部分参数 offload 至 CPU 进行计算
算子注入框架，将指定模块的算子进行替换

模型情况

DeepSeek-V3，计算参考。每个专家 44,040,192B，router 参数量 1,835,264B，每层 load 8 个 experts，8 * 44,040,192 + 1,835,264 = 354,156,800B，每层约 load 0.35GB 参数。共 58 层（前 3 层为 dense）。每层参数走 PCIe 5.0 * 8，则耗时为 11ms。

算力需求：

8 \cdot 7168^2 + 2 \cdot 14336 \cdot L^2 + 473432064 \cdot L

114 TFLOPs fp16 算力下，decode 吃满算力可做 ~4k tokens in a batch。

各层算力需求计算

MLA


\begin{aligned}
o_{t, i} &= \sum_{j = 1}^{t} \mathrm{Softmax}_j(\frac{q_{t, i}^T k_{j, i}}{\sqrt{d_h + d_h^R}})v_{j, i}^C \\
&= \sum_{j = 1}^{t} \mathrm{Softmax}_j(\frac{[q_{t, i}^C; q_{t, i}^R]^T [k_{j, i}^C; k_j^R]}{\sqrt{d_h + d_h^R}})v_{j, i}^C \\
&= \sum_{j = 1}^{t} \mathrm{Softmax}_j(\frac{[W_i^{UQ}c_t^{Q}; q_{t, i}^R]^T [W_i^{UK}c_j^{KV}; k_j^R]}{\sqrt{d_h + d_h^R}})v_{j, i}^C \\
&= \sum_{j = 1}^{t} \mathrm{Softmax}_j(\frac{[W_i^{UQ}c_t^{Q}; q_{t, i}^R]^T [W_i^{UK}\textcolor{blue}{c_j^{KV}}; \textcolor{blue}{k_j^R}]}{\sqrt{d_h + d_h^R}})(W_i^{UV}\textcolor{blue}{c_j^{KV}}) \\
\end{aligned}


\begin{aligned}
u_{t, i} &= W^O_i \cdot o_{t, i} \\
&= W^O_i \sum_{j = 1}^{t} \mathrm{Softmax}_j(\frac{[W_i^{UQ}c_t^{Q}; q_{t, i}^R]^T [W_i^{UK}\textcolor{blue}{c_j^{KV}}; \textcolor{blue}{k_j^R}]}{\sqrt{d_h + d_h^R}})(W_i^{UV}\textcolor{blue}{c_j^{KV}}) \\
&= \sum_{j = 1}^t \mathrm{Softmax}_j(\frac{[W_i^{UQ}c_t^{Q}; q_{t, i}^R]^T [W_i^{UK}\textcolor{blue}{c_j^{KV}}; \textcolor{blue}{k_j^R}]}{\sqrt{d_h + d_h^R}})(W^O_i W_i^{UV})\textcolor{blue}{c_j^{KV}} \\
\end{aligned}

h_t \in \mathbb{R}^{\text{hidden\_size} \times 1}

W^{DQ} \in \mathbb{R}^{\text{q\_lora\_rank} \times \text{hidden\_size}}, c_t^Q \in \mathbb{R}^{\text{q\_lora\_rank} \times 1} W^{UQ} \in \mathbb{R}^{(\text{num\_attention\_heads} \cdot \text{qk\_nope\_head\_dim}) \times \text{q\_lora\_rank}} q_t^C \in \mathbb{R}^{(\text{num\_attention\_heads} \cdot \text{qk\_nope\_head\_dim}) \times 1} W^{QR} \in \mathbb{R}^{\text{qk\_rope\_head\_dim} \times \text{q\_lora\_rank}}, q_t^R \in \mathbb{R}^{\text{qk\_rope\_head\_dim} \times 1} q_{t, i} \in \mathbb{R}^{(\text{qk\_nope\_head\_dim} + \text{qk\_rope\_head\_dim}) \times 1}

W^{DKV} \in \mathbb{R}^{\text{kv\_lora\_rank} \times \text{hidden\_size}}, c_t^{KV} \in \mathbb{R}^{\text{kv\_lora\_rank} \times 1} W^{UK} \in \mathbb{R}^{(\text{num\_key\_value\_heads} \cdot \text{qk\_nope\_head\_dim}) \times \text{kv\_lora\_rank}} k_t^{C} \in \mathbb{R}^{(\text{num\_key\_value\_heads} \cdot \text{qk\_nope\_head\_dim}) \times 1} W^{KR} \in \mathbb{R}^{\text{qk\_rope\_head\_dim} \times \text{kv\_lora\_rank}}, k_t^R \in \mathbb{R}^{\text{qk\_rope\_head\_dim} \times 1} k_{t, i} \in \mathbb{R}^{(\text{qk\_nope\_head\_dim} + \text{qk\_rope\_head\_dim}) \times 1}

W^{UV} \in \mathbb{R}^{(\text{num\_key\_value\_heads} \cdot \text{qk\_nope\_head\_dim}) \times \text{kv\_lora\_rank}} v_t^{C} \in \mathbb{R}^{(\text{num\_key\_value\_heads} \cdot \text{qk\_nope\_head\_dim}) \times 1}

W_O \in \mathbb{R}^{\text{num\_head} }

总计算量 n_h * ((qk_nope_head_dim + qk_rope_head_dim))

KV cache 的部分 C_t^{KV} 与 $k_t^R$，对于长度为 l 的 tokens，总 cache 量为 (\text{kv\_lora\_rank} + \text{qk\_rope\_head\_dim}) \cdot l

TBD

每层计算时按需 load/unload 的带宽耗时对比 mac 统一内存下很差的 TFLOPs 计算速度，哪个会成为瓶颈

batch_size = 2, seq_len = 256

n_layers	TFLOPs
3	1.9059
4	2.5418
5	3.1778
6	3.8137

layers 61: 1.9059 + (61 - 3) * 0.6359333333 = 38.79 TFLOPs

86 TFLOPs FP16 意味着，1s 处理 ~1k tokens 的 context 时，只能达到 1tps，完成不可用！

layers: 6 2 * 512: 8.0623 1 * 1024: 8.9369

layers: 5 1 * 1024: 7.4458 1 * 2048: 17.8143

583.48 M + (187.11 M + (1.84 M + 44.04 M * 9)) * (61 - 3) = 34531.46 M = 33.72 GB

1.49 * 61 = 90.89 GFLOPs

单层算力需求（GFLOPs）

MLA: normal: 0.374292, absorb: 0.714439
MLP: 0.792723
MoE: 1 expert: 0.088080

8.7 KiB Raw Permalink Blame History Unescape Escape

Background

Hardware spec

M3-Ultra

KTransformers

References

Backup

KTransformers

模型情况

各层算力需求计算

MLA

TBD

8.7 KiB

Raw Permalink Blame History