Files
obsidian/ongoing/Hardware for Inference.md

190 lines
8.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Background
## Hardware spec
![[250326-153847.png]]
| | memory | bandwidth | FP32 TFLOPs | TF32 TFLOPS | FP16 TFLOPs | FP8 TFLOPs (FP16 Accumulate) | inter bandwidth |
| -------- | ------ | --------- | ----------- | ------------- | ----------- | ---------------------------- | ---------------------------------------- |
| M3-Ultra | 512 GB | 800 GB/s | 43 | NA | 114.688 | NA | thunderbolt5-15 GB/s (max: 6 links * 15) |
| 4090 | 24 GB | 1.01 TB/s | 82.6 | 82.6 / 165.2 | 82.58 | 660.6 / 1321.2 | |
| 5090 | 32 GB | 1.79 TB/s | 104.8 | 104.8 / 209.5 | 419 / 838 | 838 / 1676 | |
| H20-HGX | 96 GB | 4 TB/s | 44 | 74 | 148 | 296 | |
| H100-SXM | 80 GB | 3.35 TB/s | 67 | 495 / 989 | 990 / 1979 | 1979 / 3958 | |
| A100-SXM | 80 GB | 2039 GB/s | 19.5 | 156 / 312 | 312 / 624 | NA | |
## M3-Ultra
- 算力
对于 DeepSeek-V3/R1 的架构prefill per token 需 0.0763 TFLOPs。假设 M3-Ultra 能吃满 80% 的算力,则单卡 M3-Ultra 至多做 1195 tokens/s 的 prefill。
在平均 seq_len 为 2k tokens 时,做一次 decode 需要 0.1055 TFLOPs。假设此时 M3-Ultra 能吃满 30% 的算力,则至多做 324 tokens/s如果 batch_size = 16对每个用户可以达到 20tpsTPOT = 50 ms满足 SLO。即使 seq_len 达到 8k也能提供 10tps满足基本使用需求。
| seq_len | TFLOPs/token | TPS for a batch (30% utilization) |
| ------- | ------------ | --------------------------------- |
| 256 | 0.075 | 455.825 |
| 512 | 0.079 | 430.848 |
| 1024 | 0.088 | 388.296 |
| 2048 | 0.105 | 324.247 |
| 4096 | 0.140 | 243.813 |
| 8192 | 0.210 | 162.963 |
| 16384 | 0.349 | 97.981 |
- 访存
每次激活 37 B使用 FP8 的参数,需要 load 37 GB假设 800 GB/s 的带宽能吃满 85%54.41 ms 的参数访存时间。TPS 从 20 降至 10.
- 通信
MLA 部分单 token all_reduce 需要 0.000814438 GB。Decode 阶段bsz=6415 GB/s bandwidth利用率 50%,则耗时为 6.95 ms。bsz=16则耗时为 1.74ms。从 50 ms 的 decode考虑带宽开销后变为 51.74 ms此时为每个用户可提供 19.32 tps。
- KV cache load
seq_len=4096, bsz=16KV load 的开销为 6.7 ms以访存计算可达到每用户 149 tps。考虑 TPOT=50ms加上通信开销与 KV cache load 开销后,在 60ms 以内,每用户可达到 16.7 tps
- KV cache 存储
FP16 保存时,单 token 需要 68.62 KB。bsz=16平均 input+output=8k tokens则共需 8.58 GB。对于最大支持 512 GB 的 M3-Ultra 来说就是洒洒水啦~
综合考虑以上部分,假设输入的 batch 为 16 reqs * 4k context在单卡 M3-Ultra 上进行推理output 下一个 token 的耗时为:
- load model params37 GB / (800 GB/s * 0.85) = 54.41 ms
- load KV cache: 6.7 ms
- 计算:(0.105 / (114 * 0.3)) * 16 = 49.12 ms
可做到对每个请求达到 110.23ms / token 9 TPS
## KTransformers
假设使用单卡 4090report 的性能数据286 tokens/s prefill14 tokens/s decode
TBD:
- 理论计算 KT 的性能
On-demand quantization, Module injection, Operator placement
核心 feature
- 写一个 yaml 即可匹配模型的指定部分,将这部分参数 offload 至 CPU 进行计算
- 算子注入框架,将指定模块的算子进行替换
# References
- [DeepSeek-V3/R1推理效率分析(v0.15)](https://mp.weixin.qq.com/s/-KWSFpjiTggFP4Y_jJa1oA)
- [GPU specs database](https://www.techpowerup.com/gpu-specs/)
- [M3 Ultra is a slightly weakened 3090 w/ 512GB](https://www.reddit.com/r/LocalLLaMA/comments/1j4jpij/m3_ultra_is_a_slightly_weakened_3090_w_512gb/)
---
# Backup
### KTransformers
假设使用单卡 4090report 的性能数据286 tokens/s prefill14 tokens/s decode
TBD: 理论计算 KT 的性能
On-demand quantization, Module injection, Operator placement
核心 feature
- 写一个 yaml 即可匹配模型的指定部分,将这部分参数 offload 至 CPU 进行计算
- 算子注入框架,将指定模块的算子进行替换
## 模型情况
DeepSeek-V3[计算参考](https://zhuanlan.zhihu.com/p/24954705040)。
每个专家 44,040,192Brouter 参数量 1,835,264B每层 load 8 个 experts8 * 44,040,192 + 1,835,264 = 354,156,800B每层约 load 0.35GB 参数。共 58 层(前 3 层为 dense。每层参数走 PCIe 5.0 * 8则耗时为 11ms。
算力需求:
$$8 \cdot 7168^2 + 2 \cdot 14336 \cdot L^2 + 473432064 \cdot L$$
114 TFLOPs fp16 算力下decode 吃满算力可做 ~4k tokens in a batch。
## 各层算力需求计算
### MLA
![[250326-153847-1.png]]
$$
\begin{aligned}
o_{t, i} &= \sum_{j = 1}^{t} \mathrm{Softmax}_j(\frac{q_{t, i}^T k_{j, i}}{\sqrt{d_h + d_h^R}})v_{j, i}^C \\
&= \sum_{j = 1}^{t} \mathrm{Softmax}_j(\frac{[q_{t, i}^C; q_{t, i}^R]^T [k_{j, i}^C; k_j^R]}{\sqrt{d_h + d_h^R}})v_{j, i}^C \\
&= \sum_{j = 1}^{t} \mathrm{Softmax}_j(\frac{[W_i^{UQ}c_t^{Q}; q_{t, i}^R]^T [W_i^{UK}c_j^{KV}; k_j^R]}{\sqrt{d_h + d_h^R}})v_{j, i}^C \\
&= \sum_{j = 1}^{t} \mathrm{Softmax}_j(\frac{[W_i^{UQ}c_t^{Q}; q_{t, i}^R]^T [W_i^{UK}\textcolor{blue}{c_j^{KV}}; \textcolor{blue}{k_j^R}]}{\sqrt{d_h + d_h^R}})(W_i^{UV}\textcolor{blue}{c_j^{KV}}) \\
\end{aligned}
$$
$$
\begin{aligned}
u_{t, i} &= W^O_i \cdot o_{t, i} \\
&= W^O_i \sum_{j = 1}^{t} \mathrm{Softmax}_j(\frac{[W_i^{UQ}c_t^{Q}; q_{t, i}^R]^T [W_i^{UK}\textcolor{blue}{c_j^{KV}}; \textcolor{blue}{k_j^R}]}{\sqrt{d_h + d_h^R}})(W_i^{UV}\textcolor{blue}{c_j^{KV}}) \\
&= \sum_{j = 1}^t \mathrm{Softmax}_j(\frac{[W_i^{UQ}c_t^{Q}; q_{t, i}^R]^T [W_i^{UK}\textcolor{blue}{c_j^{KV}}; \textcolor{blue}{k_j^R}]}{\sqrt{d_h + d_h^R}})(W^O_i W_i^{UV})\textcolor{blue}{c_j^{KV}} \\
\end{aligned}
$$
$h_t \in \mathbb{R}^{\text{hidden\_size} \times 1}$
$W^{DQ} \in \mathbb{R}^{\text{q\_lora\_rank} \times \text{hidden\_size}}$, $c_t^Q \in \mathbb{R}^{\text{q\_lora\_rank} \times 1}$
$W^{UQ} \in \mathbb{R}^{(\text{num\_attention\_heads} \cdot \text{qk\_nope\_head\_dim}) \times \text{q\_lora\_rank}}$
$q_t^C \in \mathbb{R}^{(\text{num\_attention\_heads} \cdot \text{qk\_nope\_head\_dim}) \times 1}$
$W^{QR} \in \mathbb{R}^{\text{qk\_rope\_head\_dim} \times \text{q\_lora\_rank}}$, $q_t^R \in \mathbb{R}^{\text{qk\_rope\_head\_dim} \times 1}$
$q_{t, i} \in \mathbb{R}^{(\text{qk\_nope\_head\_dim} + \text{qk\_rope\_head\_dim}) \times 1}$
$W^{DKV} \in \mathbb{R}^{\text{kv\_lora\_rank} \times \text{hidden\_size}}$, $c_t^{KV} \in \mathbb{R}^{\text{kv\_lora\_rank} \times 1}$
$W^{UK} \in \mathbb{R}^{(\text{num\_key\_value\_heads} \cdot \text{qk\_nope\_head\_dim}) \times \text{kv\_lora\_rank}}$
$k_t^{C} \in \mathbb{R}^{(\text{num\_key\_value\_heads} \cdot \text{qk\_nope\_head\_dim}) \times 1}$
$W^{KR} \in \mathbb{R}^{\text{qk\_rope\_head\_dim} \times \text{kv\_lora\_rank}}$, $k_t^R \in \mathbb{R}^{\text{qk\_rope\_head\_dim} \times 1}$
$k_{t, i} \in \mathbb{R}^{(\text{qk\_nope\_head\_dim} + \text{qk\_rope\_head\_dim}) \times 1}$
$W^{UV} \in \mathbb{R}^{(\text{num\_key\_value\_heads} \cdot \text{qk\_nope\_head\_dim}) \times \text{kv\_lora\_rank}}$
$v_t^{C} \in \mathbb{R}^{(\text{num\_key\_value\_heads} \cdot \text{qk\_nope\_head\_dim}) \times 1}$
$W_O \in \mathbb{R}^{\text{num\_head} }$
总计算量 `n_h * ((qk_nope_head_dim + qk_rope_head_dim))`
KV cache 的部分 $C_t^{KV}$ 与 $k_t^R$,对于长度为 $l$ 的 tokens总 cache 量为 $(\text{kv\_lora\_rank} + \text{qk\_rope\_head\_dim}) \cdot l$
## TBD
- [ ] 每层计算时按需 load/unload 的带宽耗时 对比 mac 统一内存下很差的 TFLOPs 计算速度,哪个会成为瓶颈
batch_size = 2, seq_len = 256
| n_layers | TFLOPs |
| -------- | ------ |
| 3 | 1.9059 |
| 4 | 2.5418 |
| 5 | 3.1778 |
| 6 | 3.8137 |
layers 61: 1.9059 + (61 - 3) * 0.6359333333 = 38.79 TFLOPs
86 TFLOPs FP16 意味着1s 处理 ~1k tokens 的 context 时,只能达到 1tps完成不可用
layers: 6
2 * 512: 8.0623
1 * 1024: 8.9369
layers: 5
1 * 1024: 7.4458
1 * 2048: 17.8143
583.48 M + (187.11 M + (1.84 M + 44.04 M * 9)) * (61 - 3) = 34531.46 M = 33.72 GB
1.49 * 61 = 90.89 GFLOPs
单层算力需求GFLOPs
- MLA: normal: 0.374292, absorb: 0.714439
- MLP: 0.792723
- MoE: 1 expert: 0.088080