f2c: switch to per-instance decode-concurrency view; correct KV pool ceiling

Old f2c plotted per-request KV footprint MiB against an "H20 ~95 GiB
usable" reference line. That ceiling was wrong — a 30B-A3B bf16
deployment burns roughly:
  ~50% HBM for model params (~48 GiB on 96 GiB H20)
  ~10% for runtime activation buffers
  ~40% left for the KV cache pool (~38.4 GiB)
so 95 GiB was overstating the available pool by 2.5×.

New f2c reframes the same data into the answer that actually motivates
the paper: how many concurrent decodes does a single instance hold,
and how does PD-disagg change that? Grouped bars per percentile show
system-wide concurrent decode capacity for three 8-GPU deployments:
  Combined 8C, PD-disagg 4P+4D (N_D=4), PD-disagg 6P+2D (N_D=2)

Key reads off the figure:
  p50 (1.8 GiB/req): 20 fit/inst → 160 / 80 / 40 system-wide
  p90 (8.0 GiB/req):  4 fit/inst →  32 / 16 /  8
  p95 (9.6 GiB/req):  4 fit/inst →  32 / 16 /  8
  p99 (11.5 GiB/req): 3 fit/inst →  24 / 12 /  6

PD-disagg 4P+4D literally halves the decode population at the same
per-request KV pressure — this is the concrete §3.2 "KV memory wall"
penalty stated in terms users care about (concurrency).

- analysis/characterization/render_window1_figures.py:
  fig_kv_footprint_cdf rewritten; reads same kv_footprint_summary.json
  but computes floor(KV_pool / req_size) × N_D and annotates the
  per-instance fit count below each percentile group.
- figs/f2c_kv_footprint_cdf.png: regenerated.
- MEETING.md / PAPER_OUTLINE.md §2.1, §2.4: prose updated with the
  new ceiling and the "3 p99 decodes per instance / halved by PD-disagg"
  framing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-27 11:28:47 +08:00
parent 922d79ac95
commit 555cabcf1f
4 changed files with 62 additions and 14 deletions

View File

@@ -27,7 +27,7 @@ L = Λ · N · W_turn(L) # agentic, T_human≈0
|---|---|---|
| KV reuse 几乎只在 session 内 | intra 93.2% / cross 5.7% / shared 1.1% | ![](figs/f2a_reuse_topology.png) |
| Session 极度偏斜 | production trace 上 top 1% / 5% / 10% / 25% / 50% = **46.5% / 66.5% / 74.6% / 87.5% / 96.0%** input mass | ![](figs/f2b_session_skew.png) |
| 单请求 KV footprint 已经很大 | p99 = 11.8 GiB ≈ H20 12% | ![](figs/f2c_kv_footprint_cdf.png) |
| 单请求 KV footprint 大,单 instance KV pool 很快被占满 | per-instance KV pool ≈ **38 GiB**0.4 × 96 GiB H20剩 50% params + 10% activationp99 req 11.5 GiB → 一个 instance 只装 **3 个 p99 decode**4P+4D 让系统 decode 容量直接减半 | ![](figs/f2c_kv_footprint_cdf.png) |
理论 APC 上界 = intra-session 79.6% / any-session 80.3%,差 <1pp。**任何不 affinity 的调度都丢绝大部分 reuse。**

View File

@@ -50,7 +50,7 @@ Agentic workload 与 chatbot 的三个本质差异:
- **Prefill-dominated**input/output token ratio **75x**98% 计算在 prefill 阶段chatbot 为 1-10x
- **Skewed sessions**(来自 Qwen3 production tracen=1.3M session / 2.1M req / 7200stop 1% 贡献 **46.5%** input tokentop 5% **66.5%**top 10% **74.6%**top 25% **87.5%**top 50% **96.0%** —— 半数 session 几乎占满全部 input mass
平均 session 长度 TBD turn、TBD 输入 tokenp99 单请求 KV 占用 **11.49 GiB**H20 96GB HBM 的 12%)。
平均 session 长度 TBD turn、TBD 输入 token。Per-request KV footprintQwen3-Coder-30B-A3B, 98304 B/tokenp50 **1.8 GiB**, p90 **8.0 GiB**, p95 **9.6 GiB**, p99 **11.5 GiB**. 单 instance KV pool ≈ 0.4 × 96 GiB = **38.4 GiB**(剩 50% model params bf16 + 10% runtime activation所以 p99 请求一个 instance 只能装 **3 个 concurrent decode**;改 PD-disagg 4P+4D 让系统 decode 容量直接减半(系统并发 24 → 12
### §2.2 KV Cache Reuse Topology
@@ -70,7 +70,7 @@ Trace 上 KV reuse 的分解:
![F2b Session input-token mass CDF — production trace top 1%/5%/10%/25%/50% = 46.5%/66.5%/74.6%/87.5%/96.0% (replay window overlaid for sanity)](figs/f2b_session_skew.png)
![F2c KV footprint CDF — p99 = 11.8 GiB ≈ 12% of H20](figs/f2c_kv_footprint_cdf.png)
![F2c Per-instance decode concurrency vs deployment (KV pool 38.4 GiB; p99 req fits only 3/inst; PD-disagg halves system decode capacity)](figs/f2c_kv_footprint_cdf.png)
> 📝 Layout TBD三张拼成 1×3 还是分散到 §2.1/§2.2/§2.4 各一张。

View File

@@ -308,19 +308,67 @@ def fig_reuse_decomposition(reuse: dict, out: Path) -> None:
def fig_kv_footprint_cdf(kv: dict, out: Path) -> None:
"""How many concurrent decodes fit per percentile, under three deployments.
KV pool assumption: 96 GiB H20 HBM split ~50% model params (Qwen3-Coder-
30B-A3B in bf16 + headroom), ~10% runtime activations, leaving ~40% for
the KV cache pool — i.e. ~38.4 GiB per instance.
For each request-size percentile, we report system-wide concurrent
decode capacity = N_D × floor(KV_pool / req_size_MiB) under three 8-GPU
deployments: all-combined, 4P+4D, 6P+2D. The point is that going from
combined 8C to 4P+4D halves the system's decode population at the
same per-request KV pressure.
"""
s = kv.get("kv_mib_per_request") or {}
vals = [s.get(k) for k in ("p50", "p90", "p95", "p99")]
labels = ["p50", "p90", "p95", "p99"]
fig, ax = plt.subplots(figsize=(6, 3.5))
ax.bar(labels, vals, color="#1f77b4", edgecolor="black", linewidth=0.5)
for i, v in enumerate(vals):
ax.text(i, v, f"{v:.0f} MiB", ha="center", va="bottom", fontsize=9)
ax.axhline(95 * 1024, color="red", linestyle="--", alpha=0.5,
label="H20 ~95 GiB usable")
ax.set_ylabel("KV bytes per request (MiB)")
ax.set_title("B1' Per-request KV footprint (Qwen3-Coder-30B-A3B, 98304 B/token)")
ax.legend()
pct_keys = ["p50", "p90", "p95", "p99"]
req_mib = [float(s.get(k, 0.0)) for k in pct_keys]
req_gib = [v / 1024 for v in req_mib]
hbm_gib = 96.0
kv_pool_frac = 0.40
kv_pool_mib = hbm_gib * kv_pool_frac * 1024 # ≈ 39322 MiB per instance
deploys = [
("Combined 8C", 8, "#2ca02c"),
("PD-disagg 4P+4D", 4, "#ff7f0e"),
("PD-disagg 6P+2D", 2, "#d62728"),
]
import numpy as _np
x = _np.arange(len(pct_keys))
bar_w = 0.26
fig, ax = plt.subplots(figsize=(9, 5.2))
for i, (label, n_d, color) in enumerate(deploys):
per_inst = [int(kv_pool_mib // r) if r > 0 else 0 for r in req_mib]
sys_cap = [n_d * pi for pi in per_inst]
bars = ax.bar(x + (i - 1) * bar_w, sys_cap, bar_w,
label=f"{label} (N_D={n_d})",
color=color, edgecolor="black", linewidth=0.5)
for j, (b, n) in enumerate(zip(bars, sys_cap)):
ax.text(b.get_x() + b.get_width() / 2, n, str(n),
ha="center", va="bottom", fontsize=9, color="#333")
# Annotate per-request KV size and per-instance fit just above the x-axis
per_inst_combined = [int(kv_pool_mib // r) if r > 0 else 0 for r in req_mib]
annot = [
f"{pct}\n{rg:.1f} GiB / req\nfits {pi}/inst"
for pct, rg, pi in zip(pct_keys, req_gib, per_inst_combined)
]
ax.set_xticks(x)
ax.set_xticklabels(annot, fontsize=10)
ax.set_ylabel("System-wide concurrent decodes")
ax.set_title(
f"Per-instance KV pool ≈ {kv_pool_mib / 1024:.1f} GiB "
f"(0.4 × H20 96 GiB; remaining 0.5 model + 0.1 activation)\n"
f"PD-disagg halves the decode population at p90+ "
f"(Qwen3-Coder-30B-A3B, 98304 B/token)"
)
ax.legend(loc="upper right")
ax.grid(alpha=0.3, axis="y")
ax.margins(y=0.15)
fig.tight_layout()
fig.savefig(out, dpi=120)
plt.close(fig)

Binary file not shown.

Before

Width:  |  Height:  |  Size: 36 KiB

After

Width:  |  Height:  |  Size: 73 KiB