Working-set sizing tool + GLM-5.1-FP8/B300 result

Configurable KV working-set analyzer (GPU model x TP/PP/EP x model
config.json with MLA/GQA auto x KV/weight dtype). Computes Denning W(T),
oracle [first,last], and retain-forever footprints vs a per-replica KV
pool, plus the APC captured at each retention window.

GLM-5.1-FP8 (MLA, 43.9 KiB/token) on 1x B300 node (1528 GB KV pool):
live KV fits trivially (~533 GB), but the full 80.4% APC ceiling needs
~14 nodes (oracle) -> long-tail reuse motivates DRAM offload, not HBM.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

This commit is contained in:

Gahow Wang

2026-05-28 16:03:25 +08:00

parent 2e6a369046

commit dae98c6472

3 changed files with 343 additions and 0 deletions

BIN
figs/working_set/glm5_fp8_tp8_b300.png Normal file

View File

Binary file not shown.

After

Width: | Height: | Size: 134 KiB

Working-set sizing tool + GLM-5.1-FP8/B300 result

BIN figs/working_set/glm5_fp8_tp8_b300.png Normal file View File

BIN
figs/working_set/glm5_fp8_tp8_b300.png Normal file

View File