190 lines
8.7 KiB
Markdown
190 lines
8.7 KiB
Markdown
# Background
|
||
|
||
## Hardware spec
|
||
|
||
|
||
![[250326-153847.png]]
|
||
|
||
| | memory | bandwidth | FP32 TFLOPs | TF32 TFLOPS | FP16 TFLOPs | FP8 TFLOPs (FP16 Accumulate) | inter bandwidth |
|
||
| -------- | ------ | --------- | ----------- | ------------- | ----------- | ---------------------------- | ---------------------------------------- |
|
||
| M3-Ultra | 512 GB | 800 GB/s | 43 | NA | 114.688 | NA | thunderbolt5-15 GB/s (max: 6 links * 15) |
|
||
| 4090 | 24 GB | 1.01 TB/s | 82.6 | 82.6 / 165.2 | 82.58 | 660.6 / 1321.2 | |
|
||
| 5090 | 32 GB | 1.79 TB/s | 104.8 | 104.8 / 209.5 | 419 / 838 | 838 / 1676 | |
|
||
| H20-HGX | 96 GB | 4 TB/s | 44 | 74 | 148 | 296 | |
|
||
| H100-SXM | 80 GB | 3.35 TB/s | 67 | 495 / 989 | 990 / 1979 | 1979 / 3958 | |
|
||
| A100-SXM | 80 GB | 2039 GB/s | 19.5 | 156 / 312 | 312 / 624 | NA | |
|
||
|
||
## M3-Ultra
|
||
|
||
- 算力
|
||
对于 DeepSeek-V3/R1 的架构,prefill per token 需 0.0763 TFLOPs。假设 M3-Ultra 能吃满 80% 的算力,则单卡 M3-Ultra 至多做 1195 tokens/s 的 prefill。
|
||
在平均 seq_len 为 2k tokens 时,做一次 decode 需要 0.1055 TFLOPs。假设此时 M3-Ultra 能吃满 30% 的算力,则至多做 324 tokens/s,如果 batch_size = 16,对每个用户可以达到 20tps(TPOT = 50 ms),满足 SLO。即使 seq_len 达到 8k,也能提供 10tps,满足基本使用需求。
|
||
|
||
| seq_len | TFLOPs/token | TPS for a batch (30% utilization) |
|
||
| ------- | ------------ | --------------------------------- |
|
||
| 256 | 0.075 | 455.825 |
|
||
| 512 | 0.079 | 430.848 |
|
||
| 1024 | 0.088 | 388.296 |
|
||
| 2048 | 0.105 | 324.247 |
|
||
| 4096 | 0.140 | 243.813 |
|
||
| 8192 | 0.210 | 162.963 |
|
||
| 16384 | 0.349 | 97.981 |
|
||
|
||
- 访存
|
||
每次激活 37 B,使用 FP8 的参数,需要 load 37 GB,假设 800 GB/s 的带宽能吃满 85%,54.41 ms 的参数访存时间。TPS 从 20 降至 10.
|
||
|
||
- 通信
|
||
MLA 部分单 token all_reduce 需要 0.000814438 GB。Decode 阶段,bsz=64,15 GB/s bandwidth,利用率 50%,则耗时为 6.95 ms。bsz=16,则耗时为 1.74ms。从 50 ms 的 decode,考虑带宽开销后变为 51.74 ms,此时为每个用户可提供 19.32 tps。
|
||
|
||
- KV cache load
|
||
seq_len=4096, bsz=16,KV load 的开销为 6.7 ms,以访存计算,可达到每用户 149 tps。考虑 TPOT=50ms,加上通信开销与 KV cache load 开销后,在 60ms 以内,每用户可达到 16.7 tps
|
||
|
||
- KV cache 存储
|
||
FP16 保存时,单 token 需要 68.62 KB。bsz=16,平均 input+output=8k tokens,则共需 8.58 GB。对于最大支持 512 GB 的 M3-Ultra 来说就是洒洒水啦~
|
||
|
||
综合考虑以上部分,假设输入的 batch 为 16 reqs * 4k context,在单卡 M3-Ultra 上进行推理,output 下一个 token 的耗时为:
|
||
- load model params:37 GB / (800 GB/s * 0.85) = 54.41 ms
|
||
- load KV cache: 6.7 ms
|
||
- 计算:(0.105 / (114 * 0.3)) * 16 = 49.12 ms
|
||
可做到对每个请求达到 110.23ms / token ~ 9 TPS
|
||
|
||
## KTransformers
|
||
|
||
假设使用单卡 4090,report 的性能数据:286 tokens/s prefill,14 tokens/s decode
|
||
|
||
TBD:
|
||
- 理论计算 KT 的性能
|
||
|
||
On-demand quantization, Module injection, Operator placement
|
||
|
||
核心 feature:
|
||
- 写一个 yaml 即可匹配模型的指定部分,将这部分参数 offload 至 CPU 进行计算
|
||
- 算子注入框架,将指定模块的算子进行替换
|
||
|
||
# References
|
||
|
||
- [DeepSeek-V3/R1推理效率分析(v0.15)](https://mp.weixin.qq.com/s/-KWSFpjiTggFP4Y_jJa1oA)
|
||
- [GPU specs database](https://www.techpowerup.com/gpu-specs/)
|
||
- [M3 Ultra is a slightly weakened 3090 w/ 512GB](https://www.reddit.com/r/LocalLLaMA/comments/1j4jpij/m3_ultra_is_a_slightly_weakened_3090_w_512gb/)
|
||
|
||
|
||
|
||
---
|
||
# Backup
|
||
|
||
### KTransformers
|
||
|
||
假设使用单卡 4090,report 的性能数据:286 tokens/s prefill,14 tokens/s decode
|
||
|
||
TBD: 理论计算 KT 的性能
|
||
|
||
On-demand quantization, Module injection, Operator placement
|
||
|
||
核心 feature:
|
||
- 写一个 yaml 即可匹配模型的指定部分,将这部分参数 offload 至 CPU 进行计算
|
||
- 算子注入框架,将指定模块的算子进行替换
|
||
|
||
|
||
|
||
## 模型情况
|
||
|
||
DeepSeek-V3,[计算参考](https://zhuanlan.zhihu.com/p/24954705040)。
|
||
每个专家 44,040,192B,router 参数量 1,835,264B,每层 load 8 个 experts,8 * 44,040,192 + 1,835,264 = 354,156,800B,每层约 load 0.35GB 参数。共 58 层(前 3 层为 dense)。每层参数走 PCIe 5.0 * 8,则耗时为 11ms。
|
||
|
||
算力需求:
|
||
|
||
$$8 \cdot 7168^2 + 2 \cdot 14336 \cdot L^2 + 473432064 \cdot L$$
|
||
114 TFLOPs fp16 算力下,decode 吃满算力可做 ~4k tokens in a batch。
|
||
|
||
|
||
## 各层算力需求计算
|
||
|
||
### MLA
|
||
|
||
![[250326-153847-1.png]]
|
||
|
||
$$
|
||
\begin{aligned}
|
||
o_{t, i} &= \sum_{j = 1}^{t} \mathrm{Softmax}_j(\frac{q_{t, i}^T k_{j, i}}{\sqrt{d_h + d_h^R}})v_{j, i}^C \\
|
||
&= \sum_{j = 1}^{t} \mathrm{Softmax}_j(\frac{[q_{t, i}^C; q_{t, i}^R]^T [k_{j, i}^C; k_j^R]}{\sqrt{d_h + d_h^R}})v_{j, i}^C \\
|
||
&= \sum_{j = 1}^{t} \mathrm{Softmax}_j(\frac{[W_i^{UQ}c_t^{Q}; q_{t, i}^R]^T [W_i^{UK}c_j^{KV}; k_j^R]}{\sqrt{d_h + d_h^R}})v_{j, i}^C \\
|
||
&= \sum_{j = 1}^{t} \mathrm{Softmax}_j(\frac{[W_i^{UQ}c_t^{Q}; q_{t, i}^R]^T [W_i^{UK}\textcolor{blue}{c_j^{KV}}; \textcolor{blue}{k_j^R}]}{\sqrt{d_h + d_h^R}})(W_i^{UV}\textcolor{blue}{c_j^{KV}}) \\
|
||
\end{aligned}
|
||
$$
|
||
|
||
$$
|
||
\begin{aligned}
|
||
u_{t, i} &= W^O_i \cdot o_{t, i} \\
|
||
&= W^O_i \sum_{j = 1}^{t} \mathrm{Softmax}_j(\frac{[W_i^{UQ}c_t^{Q}; q_{t, i}^R]^T [W_i^{UK}\textcolor{blue}{c_j^{KV}}; \textcolor{blue}{k_j^R}]}{\sqrt{d_h + d_h^R}})(W_i^{UV}\textcolor{blue}{c_j^{KV}}) \\
|
||
&= \sum_{j = 1}^t \mathrm{Softmax}_j(\frac{[W_i^{UQ}c_t^{Q}; q_{t, i}^R]^T [W_i^{UK}\textcolor{blue}{c_j^{KV}}; \textcolor{blue}{k_j^R}]}{\sqrt{d_h + d_h^R}})(W^O_i W_i^{UV})\textcolor{blue}{c_j^{KV}} \\
|
||
\end{aligned}
|
||
$$
|
||
|
||
$h_t \in \mathbb{R}^{\text{hidden\_size} \times 1}$
|
||
|
||
$W^{DQ} \in \mathbb{R}^{\text{q\_lora\_rank} \times \text{hidden\_size}}$, $c_t^Q \in \mathbb{R}^{\text{q\_lora\_rank} \times 1}$
|
||
$W^{UQ} \in \mathbb{R}^{(\text{num\_attention\_heads} \cdot \text{qk\_nope\_head\_dim}) \times \text{q\_lora\_rank}}$
|
||
$q_t^C \in \mathbb{R}^{(\text{num\_attention\_heads} \cdot \text{qk\_nope\_head\_dim}) \times 1}$
|
||
$W^{QR} \in \mathbb{R}^{\text{qk\_rope\_head\_dim} \times \text{q\_lora\_rank}}$, $q_t^R \in \mathbb{R}^{\text{qk\_rope\_head\_dim} \times 1}$
|
||
$q_{t, i} \in \mathbb{R}^{(\text{qk\_nope\_head\_dim} + \text{qk\_rope\_head\_dim}) \times 1}$
|
||
|
||
|
||
$W^{DKV} \in \mathbb{R}^{\text{kv\_lora\_rank} \times \text{hidden\_size}}$, $c_t^{KV} \in \mathbb{R}^{\text{kv\_lora\_rank} \times 1}$
|
||
$W^{UK} \in \mathbb{R}^{(\text{num\_key\_value\_heads} \cdot \text{qk\_nope\_head\_dim}) \times \text{kv\_lora\_rank}}$
|
||
$k_t^{C} \in \mathbb{R}^{(\text{num\_key\_value\_heads} \cdot \text{qk\_nope\_head\_dim}) \times 1}$
|
||
$W^{KR} \in \mathbb{R}^{\text{qk\_rope\_head\_dim} \times \text{kv\_lora\_rank}}$, $k_t^R \in \mathbb{R}^{\text{qk\_rope\_head\_dim} \times 1}$
|
||
$k_{t, i} \in \mathbb{R}^{(\text{qk\_nope\_head\_dim} + \text{qk\_rope\_head\_dim}) \times 1}$
|
||
|
||
|
||
$W^{UV} \in \mathbb{R}^{(\text{num\_key\_value\_heads} \cdot \text{qk\_nope\_head\_dim}) \times \text{kv\_lora\_rank}}$
|
||
$v_t^{C} \in \mathbb{R}^{(\text{num\_key\_value\_heads} \cdot \text{qk\_nope\_head\_dim}) \times 1}$
|
||
|
||
$W_O \in \mathbb{R}^{\text{num\_head} }$
|
||
|
||
总计算量 `n_h * ((qk_nope_head_dim + qk_rope_head_dim))`
|
||
|
||
KV cache 的部分 $C_t^{KV}$ 与 $k_t^R$,对于长度为 $l$ 的 tokens,总 cache 量为 $(\text{kv\_lora\_rank} + \text{qk\_rope\_head\_dim}) \cdot l$
|
||
|
||
|
||
|
||
## TBD
|
||
|
||
- [ ] 每层计算时按需 load/unload 的带宽耗时 对比 mac 统一内存下很差的 TFLOPs 计算速度,哪个会成为瓶颈
|
||
|
||
|
||
|
||
batch_size = 2, seq_len = 256
|
||
|
||
| n_layers | TFLOPs |
|
||
| -------- | ------ |
|
||
| 3 | 1.9059 |
|
||
| 4 | 2.5418 |
|
||
| 5 | 3.1778 |
|
||
| 6 | 3.8137 |
|
||
|
||
layers 61: 1.9059 + (61 - 3) * 0.6359333333 = 38.79 TFLOPs
|
||
|
||
86 TFLOPs FP16 意味着,1s 处理 ~1k tokens 的 context 时,只能达到 1tps,完成不可用!
|
||
|
||
layers: 6
|
||
2 * 512: 8.0623
|
||
1 * 1024: 8.9369
|
||
|
||
|
||
layers: 5
|
||
1 * 1024: 7.4458
|
||
1 * 2048: 17.8143
|
||
|
||
|
||
583.48 M + (187.11 M + (1.84 M + 44.04 M * 9)) * (61 - 3) = 34531.46 M = 33.72 GB
|
||
|
||
|
||
|
||
1.49 * 61 = 90.89 GFLOPs
|
||
|
||
|
||
单层算力需求(GFLOPs)
|
||
- MLA: normal: 0.374292, absorb: 0.714439
|
||
- MLP: 0.792723
|
||
- MoE: 1 expert: 0.088080
|