f2c: switch to per-instance decode-concurrency view; correct KV pool ceiling

Old f2c plotted per-request KV footprint MiB against an "H20 ~95 GiB usable" reference line. That ceiling was wrong — a 30B-A3B bf16 deployment burns roughly: ~50% HBM for model params (~48 GiB on 96 GiB H20) ~10% for runtime activation buffers ~40% left for the KV cache pool (~38.4 GiB) so 95 GiB was overstating the available pool by 2.5×. New f2c reframes the same data into the answer that actually motivates the paper: how many concurrent decodes does a single instance hold, and how does PD-disagg change that? Grouped bars per percentile show system-wide concurrent decode capacity for three 8-GPU deployments: Combined 8C, PD-disagg 4P+4D (N_D=4), PD-disagg 6P+2D (N_D=2) Key reads off the figure: p50 (1.8 GiB/req): 20 fit/inst → 160 / 80 / 40 system-wide p90 (8.0 GiB/req): 4 fit/inst → 32 / 16 / 8 p95 (9.6 GiB/req): 4 fit/inst → 32 / 16 / 8 p99 (11.5 GiB/req): 3 fit/inst → 24 / 12 / 6 PD-disagg 4P+4D literally halves the decode population at the same per-request KV pressure — this is the concrete §3.2 "KV memory wall" penalty stated in terms users care about (concurrency). - analysis/characterization/render_window1_figures.py: fig_kv_footprint_cdf rewritten; reads same kv_footprint_summary.json but computes floor(KV_pool / req_size) × N_D and annotates the per-instance fit count below each percentile group. - figs/f2c_kv_footprint_cdf.png: regenerated. - MEETING.md / PAPER_OUTLINE.md §2.1, §2.4: prose updated with the new ceiling and the "3 p99 decodes per instance / halved by PD-disagg" framing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 11:28:47 +08:00
parent 922d79ac95
commit 555cabcf1f
4 changed files with 62 additions and 14 deletions
--- a/MEETING.md
+++ b/MEETING.md
@@ -27,7 +27,7 @@ L = Λ · N · W_turn(L)        # agentic, T_human≈0
 |---|---|---|
 | KV reuse 几乎只在 session 内 | intra 93.2% / cross 5.7% / shared 1.1% | ![](figs/f2a_reuse_topology.png) |
 | Session 极度偏斜 | production trace 上 top 1% / 5% / 10% / 25% / 50% = **46.5% / 66.5% / 74.6% / 87.5% / 96.0%** input mass | ![](figs/f2b_session_skew.png) |
-| 单请求 KV footprint 已经很大 | p99 = 11.8 GiB ≈ H20 12% | ![](figs/f2c_kv_footprint_cdf.png) |
+| 单请求 KV footprint 大，单 instance KV pool 很快被占满 | per-instance KV pool ≈ **38 GiB**（0.4 × 96 GiB H20，剩 50% params + 10% activation）；p99 req 11.5 GiB → 一个 instance 只装 **3 个 p99 decode**；4P+4D 让系统 decode 容量直接减半 | ![](figs/f2c_kv_footprint_cdf.png) |

 理论 APC 上界 = intra-session 79.6% / any-session 80.3%，差 <1pp。**任何不 affinity 的调度都丢绝大部分 reuse。**

--- a/PAPER_OUTLINE.md
+++ b/PAPER_OUTLINE.md
@@ -50,7 +50,7 @@ Agentic workload 与 chatbot 的三个本质差异：
 - **Prefill-dominated**：input/output token ratio **75x**，98% 计算在 prefill 阶段（chatbot 为 1-10x）
 - **Skewed sessions**（来自 Qwen3 production trace，n=1.3M session / 2.1M req / 7200s）：top 1% 贡献 **46.5%** input token，top 5% **66.5%**，top 10% **74.6%**，top 25% **87.5%**，top 50% **96.0%** —— 半数 session 几乎占满全部 input mass

-平均 session 长度 TBD turn、TBD 输入 token；p99 单请求 KV 占用 **11.49 GiB**（H20 96GB HBM 的 12%）。
+平均 session 长度 TBD turn、TBD 输入 token。Per-request KV footprint（Qwen3-Coder-30B-A3B, 98304 B/token）：p50 **1.8 GiB**, p90 **8.0 GiB**, p95 **9.6 GiB**, p99 **11.5 GiB**. 单 instance KV pool ≈ 0.4 × 96 GiB = **38.4 GiB**（剩 50% model params bf16 + 10% runtime activation），所以 p99 请求一个 instance 只能装 **3 个 concurrent decode**；改 PD-disagg 4P+4D 让系统 decode 容量直接减半（系统并发 24 → 12）。

 ### §2.2 KV Cache Reuse Topology

@@ -70,7 +70,7 @@ Trace 上 KV reuse 的分解：

 ![F2b Session input-token mass CDF — production trace top 1%/5%/10%/25%/50% = 46.5%/66.5%/74.6%/87.5%/96.0% (replay window overlaid for sanity)](figs/f2b_session_skew.png)

-![F2c KV footprint CDF — p99 = 11.8 GiB ≈ 12% of H20](figs/f2c_kv_footprint_cdf.png)
+![F2c Per-instance decode concurrency vs deployment (KV pool 38.4 GiB; p99 req fits only 3/inst; PD-disagg halves system decode capacity)](figs/f2c_kv_footprint_cdf.png)

 > 📝 Layout TBD：三张拼成 1×3 还是分散到 §2.1/§2.2/§2.4 各一张。

--- a/analysis/characterization/render_window1_figures.py
+++ b/analysis/characterization/render_window1_figures.py
@@ -308,19 +308,67 @@ def fig_reuse_decomposition(reuse: dict, out: Path) -> None:


 def fig_kv_footprint_cdf(kv: dict, out: Path) -> None:
+    """How many concurrent decodes fit per percentile, under three deployments.
+
+    KV pool assumption: 96 GiB H20 HBM split ~50% model params (Qwen3-Coder-
+    30B-A3B in bf16 + headroom), ~10% runtime activations, leaving ~40% for
+    the KV cache pool — i.e. ~38.4 GiB per instance.
+
+    For each request-size percentile, we report system-wide concurrent
+    decode capacity = N_D × floor(KV_pool / req_size_MiB) under three 8-GPU
+    deployments: all-combined, 4P+4D, 6P+2D. The point is that going from
+    combined 8C to 4P+4D halves the system's decode population at the
+    same per-request KV pressure.
+    """
    s = kv.get("kv_mib_per_request") or {}
-    vals = [s.get(k) for k in ("p50", "p90", "p95", "p99")]
-    labels = ["p50", "p90", "p95", "p99"]
-    fig, ax = plt.subplots(figsize=(6, 3.5))
-    ax.bar(labels, vals, color="#1f77b4", edgecolor="black", linewidth=0.5)
-    for i, v in enumerate(vals):
-        ax.text(i, v, f"{v:.0f} MiB", ha="center", va="bottom", fontsize=9)
-    ax.axhline(95 * 1024, color="red", linestyle="--", alpha=0.5,
-                label="H20 ~95 GiB usable")
-    ax.set_ylabel("KV bytes per request (MiB)")
-    ax.set_title("B1' Per-request KV footprint (Qwen3-Coder-30B-A3B, 98304 B/token)")
-    ax.legend()
+    pct_keys = ["p50", "p90", "p95", "p99"]
+    req_mib = [float(s.get(k, 0.0)) for k in pct_keys]
+    req_gib = [v / 1024 for v in req_mib]
+
+    hbm_gib = 96.0
+    kv_pool_frac = 0.40
+    kv_pool_mib = hbm_gib * kv_pool_frac * 1024  # ≈ 39322 MiB per instance
+
+    deploys = [
+        ("Combined 8C",     8, "#2ca02c"),
+        ("PD-disagg 4P+4D", 4, "#ff7f0e"),
+        ("PD-disagg 6P+2D", 2, "#d62728"),
+    ]
+
+    import numpy as _np
+    x = _np.arange(len(pct_keys))
+    bar_w = 0.26
+
+    fig, ax = plt.subplots(figsize=(9, 5.2))
+    for i, (label, n_d, color) in enumerate(deploys):
+        per_inst = [int(kv_pool_mib // r) if r > 0 else 0 for r in req_mib]
+        sys_cap = [n_d * pi for pi in per_inst]
+        bars = ax.bar(x + (i - 1) * bar_w, sys_cap, bar_w,
+                      label=f"{label} (N_D={n_d})",
+                      color=color, edgecolor="black", linewidth=0.5)
+        for j, (b, n) in enumerate(zip(bars, sys_cap)):
+            ax.text(b.get_x() + b.get_width() / 2, n, str(n),
+                    ha="center", va="bottom", fontsize=9, color="#333")
+
+    # Annotate per-request KV size and per-instance fit just above the x-axis
+    per_inst_combined = [int(kv_pool_mib // r) if r > 0 else 0 for r in req_mib]
+    annot = [
+        f"{pct}\n{rg:.1f} GiB / req\nfits {pi}/inst"
+        for pct, rg, pi in zip(pct_keys, req_gib, per_inst_combined)
+    ]
+    ax.set_xticks(x)
+    ax.set_xticklabels(annot, fontsize=10)
+
+    ax.set_ylabel("System-wide concurrent decodes")
+    ax.set_title(
+        f"Per-instance KV pool ≈ {kv_pool_mib / 1024:.1f} GiB "
+        f"(0.4 × H20 96 GiB; remaining 0.5 model + 0.1 activation)\n"
+        f"PD-disagg halves the decode population at p90+ "
+        f"(Qwen3-Coder-30B-A3B, 98304 B/token)"
+    )
+    ax.legend(loc="upper right")
    ax.grid(alpha=0.3, axis="y")
+    ax.margins(y=0.15)
    fig.tight_layout()
    fig.savefig(out, dpi=120)
    plt.close(fig)
--- a/figs/f2c_kv_footprint_cdf.png
+++ b/figs/f2c_kv_footprint_cdf.png