f2c: switch to per-instance decode-concurrency view; correct KV pool ceiling
Old f2c plotted per-request KV footprint MiB against an "H20 ~95 GiB usable" reference line. That ceiling was wrong — a 30B-A3B bf16 deployment burns roughly: ~50% HBM for model params (~48 GiB on 96 GiB H20) ~10% for runtime activation buffers ~40% left for the KV cache pool (~38.4 GiB) so 95 GiB was overstating the available pool by 2.5×. New f2c reframes the same data into the answer that actually motivates the paper: how many concurrent decodes does a single instance hold, and how does PD-disagg change that? Grouped bars per percentile show system-wide concurrent decode capacity for three 8-GPU deployments: Combined 8C, PD-disagg 4P+4D (N_D=4), PD-disagg 6P+2D (N_D=2) Key reads off the figure: p50 (1.8 GiB/req): 20 fit/inst → 160 / 80 / 40 system-wide p90 (8.0 GiB/req): 4 fit/inst → 32 / 16 / 8 p95 (9.6 GiB/req): 4 fit/inst → 32 / 16 / 8 p99 (11.5 GiB/req): 3 fit/inst → 24 / 12 / 6 PD-disagg 4P+4D literally halves the decode population at the same per-request KV pressure — this is the concrete §3.2 "KV memory wall" penalty stated in terms users care about (concurrency). - analysis/characterization/render_window1_figures.py: fig_kv_footprint_cdf rewritten; reads same kv_footprint_summary.json but computes floor(KV_pool / req_size) × N_D and annotates the per-instance fit count below each percentile group. - figs/f2c_kv_footprint_cdf.png: regenerated. - MEETING.md / PAPER_OUTLINE.md §2.1, §2.4: prose updated with the new ceiling and the "3 p99 decodes per instance / halved by PD-disagg" framing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -27,7 +27,7 @@ L = Λ · N · W_turn(L) # agentic, T_human≈0
|
|||||||
|---|---|---|
|
|---|---|---|
|
||||||
| KV reuse 几乎只在 session 内 | intra 93.2% / cross 5.7% / shared 1.1% |  |
|
| KV reuse 几乎只在 session 内 | intra 93.2% / cross 5.7% / shared 1.1% |  |
|
||||||
| Session 极度偏斜 | production trace 上 top 1% / 5% / 10% / 25% / 50% = **46.5% / 66.5% / 74.6% / 87.5% / 96.0%** input mass |  |
|
| Session 极度偏斜 | production trace 上 top 1% / 5% / 10% / 25% / 50% = **46.5% / 66.5% / 74.6% / 87.5% / 96.0%** input mass |  |
|
||||||
| 单请求 KV footprint 已经很大 | p99 = 11.8 GiB ≈ H20 12% |  |
|
| 单请求 KV footprint 大,单 instance KV pool 很快被占满 | per-instance KV pool ≈ **38 GiB**(0.4 × 96 GiB H20,剩 50% params + 10% activation);p99 req 11.5 GiB → 一个 instance 只装 **3 个 p99 decode**;4P+4D 让系统 decode 容量直接减半 |  |
|
||||||
|
|
||||||
理论 APC 上界 = intra-session 79.6% / any-session 80.3%,差 <1pp。**任何不 affinity 的调度都丢绝大部分 reuse。**
|
理论 APC 上界 = intra-session 79.6% / any-session 80.3%,差 <1pp。**任何不 affinity 的调度都丢绝大部分 reuse。**
|
||||||
|
|
||||||
|
|||||||
@@ -50,7 +50,7 @@ Agentic workload 与 chatbot 的三个本质差异:
|
|||||||
- **Prefill-dominated**:input/output token ratio **75x**,98% 计算在 prefill 阶段(chatbot 为 1-10x)
|
- **Prefill-dominated**:input/output token ratio **75x**,98% 计算在 prefill 阶段(chatbot 为 1-10x)
|
||||||
- **Skewed sessions**(来自 Qwen3 production trace,n=1.3M session / 2.1M req / 7200s):top 1% 贡献 **46.5%** input token,top 5% **66.5%**,top 10% **74.6%**,top 25% **87.5%**,top 50% **96.0%** —— 半数 session 几乎占满全部 input mass
|
- **Skewed sessions**(来自 Qwen3 production trace,n=1.3M session / 2.1M req / 7200s):top 1% 贡献 **46.5%** input token,top 5% **66.5%**,top 10% **74.6%**,top 25% **87.5%**,top 50% **96.0%** —— 半数 session 几乎占满全部 input mass
|
||||||
|
|
||||||
平均 session 长度 TBD turn、TBD 输入 token;p99 单请求 KV 占用 **11.49 GiB**(H20 96GB HBM 的 12%)。
|
平均 session 长度 TBD turn、TBD 输入 token。Per-request KV footprint(Qwen3-Coder-30B-A3B, 98304 B/token):p50 **1.8 GiB**, p90 **8.0 GiB**, p95 **9.6 GiB**, p99 **11.5 GiB**. 单 instance KV pool ≈ 0.4 × 96 GiB = **38.4 GiB**(剩 50% model params bf16 + 10% runtime activation),所以 p99 请求一个 instance 只能装 **3 个 concurrent decode**;改 PD-disagg 4P+4D 让系统 decode 容量直接减半(系统并发 24 → 12)。
|
||||||
|
|
||||||
### §2.2 KV Cache Reuse Topology
|
### §2.2 KV Cache Reuse Topology
|
||||||
|
|
||||||
@@ -70,7 +70,7 @@ Trace 上 KV reuse 的分解:
|
|||||||
|
|
||||||

|

|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
> 📝 Layout TBD:三张拼成 1×3 还是分散到 §2.1/§2.2/§2.4 各一张。
|
> 📝 Layout TBD:三张拼成 1×3 还是分散到 §2.1/§2.2/§2.4 各一张。
|
||||||
|
|
||||||
|
|||||||
@@ -308,19 +308,67 @@ def fig_reuse_decomposition(reuse: dict, out: Path) -> None:
|
|||||||
|
|
||||||
|
|
||||||
def fig_kv_footprint_cdf(kv: dict, out: Path) -> None:
|
def fig_kv_footprint_cdf(kv: dict, out: Path) -> None:
|
||||||
|
"""How many concurrent decodes fit per percentile, under three deployments.
|
||||||
|
|
||||||
|
KV pool assumption: 96 GiB H20 HBM split ~50% model params (Qwen3-Coder-
|
||||||
|
30B-A3B in bf16 + headroom), ~10% runtime activations, leaving ~40% for
|
||||||
|
the KV cache pool — i.e. ~38.4 GiB per instance.
|
||||||
|
|
||||||
|
For each request-size percentile, we report system-wide concurrent
|
||||||
|
decode capacity = N_D × floor(KV_pool / req_size_MiB) under three 8-GPU
|
||||||
|
deployments: all-combined, 4P+4D, 6P+2D. The point is that going from
|
||||||
|
combined 8C to 4P+4D halves the system's decode population at the
|
||||||
|
same per-request KV pressure.
|
||||||
|
"""
|
||||||
s = kv.get("kv_mib_per_request") or {}
|
s = kv.get("kv_mib_per_request") or {}
|
||||||
vals = [s.get(k) for k in ("p50", "p90", "p95", "p99")]
|
pct_keys = ["p50", "p90", "p95", "p99"]
|
||||||
labels = ["p50", "p90", "p95", "p99"]
|
req_mib = [float(s.get(k, 0.0)) for k in pct_keys]
|
||||||
fig, ax = plt.subplots(figsize=(6, 3.5))
|
req_gib = [v / 1024 for v in req_mib]
|
||||||
ax.bar(labels, vals, color="#1f77b4", edgecolor="black", linewidth=0.5)
|
|
||||||
for i, v in enumerate(vals):
|
hbm_gib = 96.0
|
||||||
ax.text(i, v, f"{v:.0f} MiB", ha="center", va="bottom", fontsize=9)
|
kv_pool_frac = 0.40
|
||||||
ax.axhline(95 * 1024, color="red", linestyle="--", alpha=0.5,
|
kv_pool_mib = hbm_gib * kv_pool_frac * 1024 # ≈ 39322 MiB per instance
|
||||||
label="H20 ~95 GiB usable")
|
|
||||||
ax.set_ylabel("KV bytes per request (MiB)")
|
deploys = [
|
||||||
ax.set_title("B1' Per-request KV footprint (Qwen3-Coder-30B-A3B, 98304 B/token)")
|
("Combined 8C", 8, "#2ca02c"),
|
||||||
ax.legend()
|
("PD-disagg 4P+4D", 4, "#ff7f0e"),
|
||||||
|
("PD-disagg 6P+2D", 2, "#d62728"),
|
||||||
|
]
|
||||||
|
|
||||||
|
import numpy as _np
|
||||||
|
x = _np.arange(len(pct_keys))
|
||||||
|
bar_w = 0.26
|
||||||
|
|
||||||
|
fig, ax = plt.subplots(figsize=(9, 5.2))
|
||||||
|
for i, (label, n_d, color) in enumerate(deploys):
|
||||||
|
per_inst = [int(kv_pool_mib // r) if r > 0 else 0 for r in req_mib]
|
||||||
|
sys_cap = [n_d * pi for pi in per_inst]
|
||||||
|
bars = ax.bar(x + (i - 1) * bar_w, sys_cap, bar_w,
|
||||||
|
label=f"{label} (N_D={n_d})",
|
||||||
|
color=color, edgecolor="black", linewidth=0.5)
|
||||||
|
for j, (b, n) in enumerate(zip(bars, sys_cap)):
|
||||||
|
ax.text(b.get_x() + b.get_width() / 2, n, str(n),
|
||||||
|
ha="center", va="bottom", fontsize=9, color="#333")
|
||||||
|
|
||||||
|
# Annotate per-request KV size and per-instance fit just above the x-axis
|
||||||
|
per_inst_combined = [int(kv_pool_mib // r) if r > 0 else 0 for r in req_mib]
|
||||||
|
annot = [
|
||||||
|
f"{pct}\n{rg:.1f} GiB / req\nfits {pi}/inst"
|
||||||
|
for pct, rg, pi in zip(pct_keys, req_gib, per_inst_combined)
|
||||||
|
]
|
||||||
|
ax.set_xticks(x)
|
||||||
|
ax.set_xticklabels(annot, fontsize=10)
|
||||||
|
|
||||||
|
ax.set_ylabel("System-wide concurrent decodes")
|
||||||
|
ax.set_title(
|
||||||
|
f"Per-instance KV pool ≈ {kv_pool_mib / 1024:.1f} GiB "
|
||||||
|
f"(0.4 × H20 96 GiB; remaining 0.5 model + 0.1 activation)\n"
|
||||||
|
f"PD-disagg halves the decode population at p90+ "
|
||||||
|
f"(Qwen3-Coder-30B-A3B, 98304 B/token)"
|
||||||
|
)
|
||||||
|
ax.legend(loc="upper right")
|
||||||
ax.grid(alpha=0.3, axis="y")
|
ax.grid(alpha=0.3, axis="y")
|
||||||
|
ax.margins(y=0.15)
|
||||||
fig.tight_layout()
|
fig.tight_layout()
|
||||||
fig.savefig(out, dpi=120)
|
fig.savefig(out, dpi=120)
|
||||||
plt.close(fig)
|
plt.close(fig)
|
||||||
|
|||||||
Binary file not shown.
|
Before Width: | Height: | Size: 36 KiB After Width: | Height: | Size: 73 KiB |
Reference in New Issue
Block a user