Files

Gahow Wang dae98c6472 Working-set sizing tool + GLM-5.1-FP8/B300 result

Configurable KV working-set analyzer (GPU model x TP/PP/EP x model
config.json with MLA/GQA auto x KV/weight dtype). Computes Denning W(T),
oracle [first,last], and retain-forever footprints vs a per-replica KV
pool, plus the APC captured at each retention window.

GLM-5.1-FP8 (MLA, 43.9 KiB/token) on 1x B300 node (1528 GB KV pool):
live KV fits trivially (~533 GB), but the full 80.4% APC ceiling needs
~14 nodes (oracle) -> long-tail reuse motivates DRAM offload, not HBM.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-28 16:03:25 +08:00

README.md

Working-set sizing tool + GLM-5.1-FP8/B300 result

2026-05-28 16:03:25 +08:00

README.md

KV-cache Working-Set Sizing — GLM-5.1-FP8 · TP=8 · 1× B300 node

工具：scripts/working_set_analysis.py（可配置 GPU 型号 / 并行度 TP·PP·EP / 模型 config.json / KV dtype / 权重大小）。图：figs/working_set/glm5_fp8_tp8_b300.png。

复现

.venv/bin/python scripts/working_set_analysis.py \
  /home/gahow/phd/kvcache-simulator/bailian-traces/glm_coder_blksz_512_040915-040917.jsonl \
  --model-config /home/gahow/phd/kvcache-simulator/models/GLM-5/config.json \
  --gpu B300 --tp 8 --ep 8 --kv-dtype-bytes 1 --weight-gb 744 --min-ts 0 \
  --out figs/working_set/glm5_fp8_tp8_b300.png

方法

hash_ids 是全局内容寻址 block id（同内容=同 id，复用=同 id 再现）。vLLM prefix cache 是 block 级，所以集群级 KV footprint = 任一时刻必须常驻的 distinct block 数，与 placement 无关（affinity 只搬运 block，不改总量）。三种 working set：

W_all 永不淘汰（真上界）
W_oracle 每 block 只在 [首次, 末次复用] 常驻（Belady 完美预知 → 满 APC 上界的最小 HBM）
W_denning(T) 滑窗 T 内被访问的 distinct block（现实 TTL-LRU）

KV/token：MLA → L·(kv_lora_rank+qk_rope_head_dim)·dtype；GQA → 2·L·kv_heads·head_dim·dtype （与 kvcache-simulator/src/config.rs::kv_block_bytes 一致）。

配置

项	值
模型	GLM-5.1-FP8（MLA, L=78, kv_lora=512+rope=64）
KV/token · KV/block(512)	43.9 KiB · 23.0 MB（≈ Qwen3 GQA 96 KiB 的一半）
硬件	8× B300 (288 GB) = 2304 GB HBM/replica
预算	FP8 权重 744 GB + act 32 GB → KV pool = 1528 GB/node
trace	dash0 glm_coder，475k req，1.25h active @ 106 QPS，~40k tok/req（剔除 77 条负 ts 暖机）
APC 上界	80.4%

结果

保留窗口 T	peak footprint	= 节点 (GPU)	APC@T
2s（在飞下限）	533 GB	0.3 (3)	1.7%
10s	2,068 GB	1.4 (11)	15%
30s	4,906 GB	3.2 (26)	42%
60s	7,698 GB	5.0 (40)	56%
300s	21,960 GB	14.4 (115)	74%
oracle（满 80.4%）	21,399 GB	14.0 (112)	80.4%
retain-forever	167,018 GB	109 (874)	—

结论

Serving：1 节点绰绰有余。 在飞 KV（τ≈2-5s）仅 533–1157 GB ≪ 单节点 1528 GB。 MLA + B300 大 HBM 让 live footprint 微不足道——跑起来根本不缺显存。
缓存全部复用（80.4%）：1 节点差 ~14×。 oracle 下限 21.4 TB = 14 节点（112 GPU），真实 LRU ~2× → ~28 节点。单节点（1528 GB）只能 hold ~10s 窗口 → cache 侧 APC 仅 ~10-15%。要 ~56% 需 5 节点，~74% 需 ~14 节点。
瓶颈在长尾，不在 live。 把 APC 50%→80% 装进 GPU HBM 要 5→14 节点，极不经济 → offload/migration 到 CPU DRAM（每节点 ~1.5 TB）是定量动机。与 Qwen 结论方向一致。

注意

footprint 是 TTL-LRU（最浪费）+ shared-cache 下限：真实 capacity-LRU 同容量下 APC 更高，但分区/affinity 不均衡又抬高需求；oracle / retain-forever 给出下/上界。
GLM trace mean ~40k tok/req，是 Qwen trace（11k）的 ~3.5×（tokenizer + 抽取不同）， 绝对 GB 不可跨模型横比，方法与定性结论可比。
EP 不改变 KV 总量（只影响 expert 权重分布），--ep 仅作标注。

README.md Unescape Escape

KV-cache Working-Set Sizing — GLM-5.1-FP8 · TP=8 · 1× B300 node

复现

方法

配置

结果

结论

注意

README.md