Files
obsidian/projects/auto-tuner/Theoretical Analysis.md

2.2 KiB
Raw Permalink Blame History

prefill

对于 attention

对一个序列长度为 S 的 transformer attention多头、按 d = H * head_dim

  1. Q/K/V 投影(一次性做 3 个线性变换):

    FLOPsQKV_proj≈2×S×d×(3d)=6Sd2FLOPsQKV_proj≈2×S×d×(3d)=6Sd2

    矩阵乘法的常用近似2·m·n·k

  2. Attention 矩阵乘Q·K^T

    FLOPsQK≈2×H×S×dhead×S=2S2dFLOPsQK≈2×H×S×dhead×S=2S2d

    因为 Hdhead=dHdhead=d。

  3. Attention·V权重与 V 相乘)

    FLOPsAV≈2S2dFLOPsAV≈2S2d

  4. 输出投影(从 heads 拼回 d 再线性变换):

    FLOPsout≈2Sd2FLOPsout≈2Sd2

总 FLOPs8Sd^2 + 4S^2d


T_{\text{comp}} = \frac{\text{FLOPs}_{\text{per\_GPU}}}{\text{peak\_flops\_per\_GPU} \times \text{compute\_utils}}

总 memory$\text{bytes}_\text{prefill} \approx N \cdot \alpha \cdot BLd \cdot \text{elem_bytes}$\alpha \sim 6


T_{\text{mem}} = \frac{\text{bytes}_{\text{per\_GPU}}}{\text{bandwidth\_per\_GPU} \times \text{mem\_utils}}

decode

总 FLOPs8d^2 + 4dL

总 memory$\text{bytes}_\text{decode} \approx N \cdot \beta \cdot BLd \cdot \text{elem_bytes}$\beta \sim 4

\text{output} = \text{SiLU}(xW_1)W_2

TP 下,每 token T 激活的 expert E 的通信为:

  1. 输入 x AllGather 到所有 TP 节点通信量hidden_size * (TP - 1)
  2. 每个 TP 节点独立计算 xW_1'
  3. AllGather 后每个节点得到完整的 $xW_1$通信量moe_intermediate_size / TP * (TP - 1)
  4. 每个节点计算 SiLU 和 $IW2'$AllReduce 每个节点得到完整的 output通信量hidden_size * (TP - 1)

EP 下dispatch+combine

2 * hidden_size * (EP - 1) / EP (假设负载均衡)


With batch_size=2000, seq_len=2048, EP=8

Qwen-235B: attention comp time 0.06944874306412531 moe combine comm time 0.00028672 moe comp time 0.0004069259060131379 moe comm time with TP 0.00045056

Qwen-30B: attention comp time 0.01736218682573421 moe combine comm time 0.00014336 moe comp time 0.00010173147650328448 moe comm time with TP 0.00022528

EP=64: Qwen-235B: attention comp time 0.06944874306412531 moe combine comm time 0.00032256 moe comp time 0.0004069259060131379 moe comm time with TP 0.00045056