diff --git a/MEETING.md b/MEETING.md index 66bc308..5e93890 100644 --- a/MEETING.md +++ b/MEETING.md @@ -27,7 +27,7 @@ L = Λ · N · W_turn(L) # agentic, T_human≈0 |---|---|---| | KV reuse 几乎只在 session 内 | intra 93.2% / cross 5.7% / shared 1.1% | ![](figs/f2a_reuse_topology.png) | | Session 极度偏斜 | production trace 上 top 1% / 5% / 10% / 25% / 50% = **46.5% / 66.5% / 74.6% / 87.5% / 96.0%** input mass | ![](figs/f2b_session_skew.png) | -| 单请求 KV footprint 已经很大 | p99 = 11.8 GiB ≈ H20 12% | ![](figs/f2c_kv_footprint_cdf.png) | +| 单请求 KV footprint 大,单 instance KV pool 很快被占满 | per-instance KV pool ≈ **38 GiB**(0.4 × 96 GiB H20,剩 50% params + 10% activation);p99 req 11.5 GiB → 一个 instance 只装 **3 个 p99 decode**;4P+4D 让系统 decode 容量直接减半 | ![](figs/f2c_kv_footprint_cdf.png) | 理论 APC 上界 = intra-session 79.6% / any-session 80.3%,差 <1pp。**任何不 affinity 的调度都丢绝大部分 reuse。** diff --git a/PAPER_OUTLINE.md b/PAPER_OUTLINE.md index a43f449..73728a4 100644 --- a/PAPER_OUTLINE.md +++ b/PAPER_OUTLINE.md @@ -50,7 +50,7 @@ Agentic workload 与 chatbot 的三个本质差异: - **Prefill-dominated**:input/output token ratio **75x**,98% 计算在 prefill 阶段(chatbot 为 1-10x) - **Skewed sessions**(来自 Qwen3 production trace,n=1.3M session / 2.1M req / 7200s):top 1% 贡献 **46.5%** input token,top 5% **66.5%**,top 10% **74.6%**,top 25% **87.5%**,top 50% **96.0%** —— 半数 session 几乎占满全部 input mass -平均 session 长度 TBD turn、TBD 输入 token;p99 单请求 KV 占用 **11.49 GiB**(H20 96GB HBM 的 12%)。 +平均 session 长度 TBD turn、TBD 输入 token。Per-request KV footprint(Qwen3-Coder-30B-A3B, 98304 B/token):p50 **1.8 GiB**, p90 **8.0 GiB**, p95 **9.6 GiB**, p99 **11.5 GiB**. 单 instance KV pool ≈ 0.4 × 96 GiB = **38.4 GiB**(剩 50% model params bf16 + 10% runtime activation),所以 p99 请求一个 instance 只能装 **3 个 concurrent decode**;改 PD-disagg 4P+4D 让系统 decode 容量直接减半(系统并发 24 → 12)。 ### §2.2 KV Cache Reuse Topology @@ -70,7 +70,7 @@ Trace 上 KV reuse 的分解: ![F2b Session input-token mass CDF — production trace top 1%/5%/10%/25%/50% = 46.5%/66.5%/74.6%/87.5%/96.0% (replay window overlaid for sanity)](figs/f2b_session_skew.png) -![F2c KV footprint CDF — p99 = 11.8 GiB ≈ 12% of H20](figs/f2c_kv_footprint_cdf.png) +![F2c Per-instance decode concurrency vs deployment (KV pool 38.4 GiB; p99 req fits only 3/inst; PD-disagg halves system decode capacity)](figs/f2c_kv_footprint_cdf.png) > 📝 Layout TBD:三张拼成 1×3 还是分散到 §2.1/§2.2/§2.4 各一张。 diff --git a/analysis/characterization/render_window1_figures.py b/analysis/characterization/render_window1_figures.py index a336863..4c2e6f9 100644 --- a/analysis/characterization/render_window1_figures.py +++ b/analysis/characterization/render_window1_figures.py @@ -308,19 +308,67 @@ def fig_reuse_decomposition(reuse: dict, out: Path) -> None: def fig_kv_footprint_cdf(kv: dict, out: Path) -> None: + """How many concurrent decodes fit per percentile, under three deployments. + + KV pool assumption: 96 GiB H20 HBM split ~50% model params (Qwen3-Coder- + 30B-A3B in bf16 + headroom), ~10% runtime activations, leaving ~40% for + the KV cache pool — i.e. ~38.4 GiB per instance. + + For each request-size percentile, we report system-wide concurrent + decode capacity = N_D × floor(KV_pool / req_size_MiB) under three 8-GPU + deployments: all-combined, 4P+4D, 6P+2D. The point is that going from + combined 8C to 4P+4D halves the system's decode population at the + same per-request KV pressure. + """ s = kv.get("kv_mib_per_request") or {} - vals = [s.get(k) for k in ("p50", "p90", "p95", "p99")] - labels = ["p50", "p90", "p95", "p99"] - fig, ax = plt.subplots(figsize=(6, 3.5)) - ax.bar(labels, vals, color="#1f77b4", edgecolor="black", linewidth=0.5) - for i, v in enumerate(vals): - ax.text(i, v, f"{v:.0f} MiB", ha="center", va="bottom", fontsize=9) - ax.axhline(95 * 1024, color="red", linestyle="--", alpha=0.5, - label="H20 ~95 GiB usable") - ax.set_ylabel("KV bytes per request (MiB)") - ax.set_title("B1' Per-request KV footprint (Qwen3-Coder-30B-A3B, 98304 B/token)") - ax.legend() + pct_keys = ["p50", "p90", "p95", "p99"] + req_mib = [float(s.get(k, 0.0)) for k in pct_keys] + req_gib = [v / 1024 for v in req_mib] + + hbm_gib = 96.0 + kv_pool_frac = 0.40 + kv_pool_mib = hbm_gib * kv_pool_frac * 1024 # ≈ 39322 MiB per instance + + deploys = [ + ("Combined 8C", 8, "#2ca02c"), + ("PD-disagg 4P+4D", 4, "#ff7f0e"), + ("PD-disagg 6P+2D", 2, "#d62728"), + ] + + import numpy as _np + x = _np.arange(len(pct_keys)) + bar_w = 0.26 + + fig, ax = plt.subplots(figsize=(9, 5.2)) + for i, (label, n_d, color) in enumerate(deploys): + per_inst = [int(kv_pool_mib // r) if r > 0 else 0 for r in req_mib] + sys_cap = [n_d * pi for pi in per_inst] + bars = ax.bar(x + (i - 1) * bar_w, sys_cap, bar_w, + label=f"{label} (N_D={n_d})", + color=color, edgecolor="black", linewidth=0.5) + for j, (b, n) in enumerate(zip(bars, sys_cap)): + ax.text(b.get_x() + b.get_width() / 2, n, str(n), + ha="center", va="bottom", fontsize=9, color="#333") + + # Annotate per-request KV size and per-instance fit just above the x-axis + per_inst_combined = [int(kv_pool_mib // r) if r > 0 else 0 for r in req_mib] + annot = [ + f"{pct}\n{rg:.1f} GiB / req\nfits {pi}/inst" + for pct, rg, pi in zip(pct_keys, req_gib, per_inst_combined) + ] + ax.set_xticks(x) + ax.set_xticklabels(annot, fontsize=10) + + ax.set_ylabel("System-wide concurrent decodes") + ax.set_title( + f"Per-instance KV pool ≈ {kv_pool_mib / 1024:.1f} GiB " + f"(0.4 × H20 96 GiB; remaining 0.5 model + 0.1 activation)\n" + f"PD-disagg halves the decode population at p90+ " + f"(Qwen3-Coder-30B-A3B, 98304 B/token)" + ) + ax.legend(loc="upper right") ax.grid(alpha=0.3, axis="y") + ax.margins(y=0.15) fig.tight_layout() fig.savefig(out, dpi=120) plt.close(fig) diff --git a/figs/f2c_kv_footprint_cdf.png b/figs/f2c_kv_footprint_cdf.png index 63ac975..7138f87 100644 Binary files a/figs/f2c_kv_footprint_cdf.png and b/figs/f2c_kv_footprint_cdf.png differ