Paper section: PD-sep scaffold + drop --enforce-eager from launch scripts

Adds analysis/pd_sep_paper_section/ as the home for the "PD separation is net negative under agentic workloads" paper section: plot scripts for C1 (workload chars), C6 (roofline), C7 (routing-vs-PD-sep lever), the C6/C7 PDFs already rendered, and a README mapping candidate claims to required figures plus open re-run items. Removes --enforce-eager from bench.sh and all active launch scripts so cuda graphs are captured -- the prior methodology suppressed one of PD-sep's structural advantages (D-node fixed-shape decode). Legacy scripts under scripts/legacy/ are intentionally untouched as historical records. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 11:24:16 +08:00
parent 6a27f75337
commit d71a111099
11 changed files with 576 additions and 9 deletions
--- a/analysis/pd_sep_paper_section/README.md
+++ b/analysis/pd_sep_paper_section/README.md
@@ -0,0 +1,87 @@
+# Paper section: PD separation under agentic workloads
+
+This directory collects everything produced for the "PD-sep is net negative
+on agentic workloads" paper section. It is one section of a larger paper,
+not the whole paper.
+
+## Layout
+
+```
+analysis/pd_sep_paper_section/
+├── README.md                       # this file
+├── scripts/
+│   ├── plot_workload.py            # C1: input/output CDF + KV reuse decomposition
+│   ├── plot_roofline.py            # C6: prefill roofline at varying cache reuse
+│   └── plot_routing_lever.py       # C7: routing vs PD-sep as design levers
+└── figures/
+    ├── fig_c6_roofline.pdf         # rendered locally (analytical, no trace needed)
+    ├── fig_c7_routing_lever.pdf    # rendered locally (from REPORT.md §3.1)
+    └── (fig_c1a_io_cdf.pdf,        # produced on dash0 when trace is available
+          fig_c1b_reuse.pdf)
+```
+
+## Candidate claims -> figures (status)
+
+| Claim | Figure | Status |
+|---|---|---|
+| C1: 98% prefill share + 91% intra-session KV reuse | `figures/fig_c1a_io_cdf.pdf`, `figures/fig_c1b_reuse.pdf` | **needs trace on dash0** |
+| C2: PD-sep vs Combined headline numbers | (not yet) | **needs re-run without --enforce-eager on `traces/w600_r0.0015_st30.jsonl`** |
+| C3: decode KV cache memory wall (time-series) | (not yet) | needs step-level vLLM telemetry during PD-sep run |
+| C4: TTFT stacked breakdown (prefill / KV pull / decode wait) | (not yet) | needs per-request breakdown.json from PD-sep run |
+| C5: cuda-graph ablation (eager vs cudagraph × Combined vs PD-sep) | (not yet) | needs the 2×2 matrix |
+| C6: prefill stays compute-bound at 95% reuse | `figures/fig_c6_roofline.pdf` | **rendered** |
+| C7: cache-aware routing is a larger lever than PD-sep | `figures/fig_c7_routing_lever.pdf` | **rendered** (legacy data, footer caveat) |
+
+## In-place edits made for this task
+
+These edits are in the repo, not in this directory, because they modify
+existing launch scripts. `--enforce-eager` was removed so cuda graphs can be
+captured — PD-sep's D-node is a particularly clean case for cuda-graph
+benefit and the prior methodology suppressed it.
+
+| File | Lines | Change |
+|---|---|---|
+| `scripts/bench.sh` | 150, 161 | drop `--enforce-eager` (elastic + baseline modes) |
+| `scripts/launch_pd_mooncake.sh` | 47, 64 | drop `--enforce-eager` (P and D instances) |
+| `scripts/launch_pd_separated.sh` | 52, 68 | drop `--enforce-eager` (P and D instances) |
+| `scripts/launch_phase1_ps.sh` | 32, 43 | drop `--enforce-eager` (C and PS instances) |
+| `scripts/launch_elastic_p2p.sh` | 57 | drop `--enforce-eager` (kv_both instances) |
+
+`scripts/legacy/*.sh` are intentionally left as-is — they record the
+configuration of past experiments.
+
+`REPORT.md` and `analysis/pd_separation_analysis.md` still describe the
+old `--enforce-eager` setup. Update them once the new runs land.
+
+## Reproducing the figures
+
+From repo root:
+
+```bash
+# C1 (needs sampled trace on dash0)
+.venv/bin/python analysis/pd_sep_paper_section/scripts/plot_workload.py \
+    --trace traces/w600_r0.0015_st30.jsonl
+
+# C6 (analytical, runs anywhere with matplotlib)
+.venv/bin/python analysis/pd_sep_paper_section/scripts/plot_roofline.py
+
+# C7 (hardcoded REPORT.md §3.1 numbers; no inputs)
+.venv/bin/python analysis/pd_sep_paper_section/scripts/plot_routing_lever.py
+```
+
+All three default `--outdir` to `analysis/pd_sep_paper_section/figures`.
+
+## Caveats / open items
+
+- **C7 uses legacy data**. The footer of `fig_c7_routing_lever.pdf` says so:
+  PD-sep numbers come from the random-sampled trace + `--enforce-eager`. Re-run
+  on `traces/w600_r0.0015_st30.jsonl` with cuda-graphs on before paper-grade
+  citation. The plotting code keeps the source numbers in a single `ROWS`
+  table (top of `plot_routing_lever.py`) for a one-line swap.
+- **C2/C3/C4/C5 figures are not produced** because the experiments have not
+  been re-run. The 4h matrix proposed in the prior conversation turn
+  (Combined + RR, Combined + cache-aware, PD-sep 4P+4D, PD-sep 6P+2D, plus
+  eager-vs-cudagraph ablation, ×3 seeds) is the prerequisite.
+- **C6 is analytical**, so it is independent of any re-run. The numbers
+  match `scripts/compute_roofline.py` (constants are duplicated; if one
+  changes, the other must change too).
--- a/analysis/pd_sep_paper_section/figures/fig_c6_roofline.pdf
+++ b/analysis/pd_sep_paper_section/figures/fig_c6_roofline.pdf
--- a/analysis/pd_sep_paper_section/figures/fig_c7_routing_lever.pdf
+++ b/analysis/pd_sep_paper_section/figures/fig_c7_routing_lever.pdf
--- a/analysis/pd_sep_paper_section/scripts/plot_roofline.py
+++ b/analysis/pd_sep_paper_section/scripts/plot_roofline.py
@@ -0,0 +1,144 @@
+"""C6: roofline plot for Qwen3-Coder-30B-A3B on H20.
+
+Reproduces the analytical roofline used in scripts/compute_roofline.py and
+plots it as a single PDF: AI vs achievable throughput, with annotated
+operating points for prefill at reuse {0, 70, 90, 95}% and decode.
+
+The constants must stay in lockstep with compute_roofline.py. If you change
+one, change the other.
+"""
+import argparse
+from pathlib import Path
+
+import matplotlib
+matplotlib.use("Agg")
+import matplotlib.pyplot as plt
+import numpy as np
+
+# ---- model constants (mirror scripts/compute_roofline.py) ----
+L, D, H_KV, D_HEAD, D_FFN = 48, 2048, 4, 128, 6144
+K_EXPERTS = 8
+BYTES = 2  # bf16
+
+# ---- H20 ----
+PEAK_FLOPS = 148e12
+HBM_BW = 4.0e12
+RIDGE = PEAK_FLOPS / HBM_BW  # ~37
+
+
+def attn_prefill_flops(seq_len, new_tokens):
+    d_kv = H_KV * D_HEAD
+    qkv = new_tokens * (D * D * 2 + D * d_kv * 2 * 2)
+    attn = new_tokens * seq_len * D * 2 * 2
+    out = new_tokens * D * D * 2
+    return (qkv + attn + out) * L
+
+
+def attn_prefill_bytes(seq_len, new_tokens, cached_tokens):
+    d_kv = H_KV * D_HEAD
+    weight = D * (D + 2 * d_kv + D) * BYTES * L
+    cached_kv = cached_tokens * 2 * d_kv * BYTES * L
+    act = new_tokens * D * BYTES * 2 * L
+    new_kv = new_tokens * 2 * d_kv * BYTES * L
+    return weight + cached_kv + act + new_kv
+
+
+def ffn_flops(n):
+    return 3 * n * D * D_FFN * 2 * K_EXPERTS * L
+
+
+def ffn_bytes(n):
+    weight = K_EXPERTS * 3 * D * D_FFN * BYTES * L
+    act = n * D * BYTES * 2 * L
+    return weight + act
+
+
+def point(seq_len, reuse):
+    cached = int(seq_len * reuse)
+    new = max(1, seq_len - cached)
+    f = attn_prefill_flops(seq_len, new) + ffn_flops(new)
+    b = attn_prefill_bytes(seq_len, new, cached) + ffn_bytes(new)
+    return f, b, new
+
+
+def decode_point(seq_len):
+    f = attn_prefill_flops(seq_len, 1) + ffn_flops(1)
+    b = attn_prefill_bytes(seq_len, 1, seq_len) + ffn_bytes(1)
+    return f, b
+
+
+def plot(out_path, seq_len=64000):
+    fig, ax = plt.subplots(figsize=(6.5, 4.2))
+
+    ai_grid = np.logspace(-1, 5, 400)
+    achievable = np.minimum(ai_grid * HBM_BW, PEAK_FLOPS) / 1e12
+    ax.plot(ai_grid, achievable, color="#222", lw=1.5, label="H20 roofline")
+
+    ax.axvline(RIDGE, color="#888", ls=":", lw=1)
+    ax.text(RIDGE, 420, f"ridge = {RIDGE:.0f}", color="#666",
+            fontsize=8, ha="center", va="top",
+            bbox=dict(boxstyle="round,pad=0.2", fc="white", ec="none", alpha=0.85))
+
+    ax.axhline(PEAK_FLOPS / 1e12, color="#aaa", ls="--", lw=0.6)
+    ax.text(2, PEAK_FLOPS / 1e12 * 1.08, "compute ceiling (148 TFLOPS bf16)",
+            fontsize=8, color="#666", ha="left")
+
+    # operating points: use a legend (not annotations with leader lines, since
+    # all 4 prefill points sit on the compute ceiling and would overlap).
+    reuses = [0.0, 0.7, 0.9, 0.95]
+    colors = ["#d62728", "#ff7f0e", "#2ca02c", "#1f77b4"]
+    for reuse, color in zip(reuses, colors):
+        f, b, new = point(seq_len, reuse)
+        ai = f / b
+        thpt = min(ai * HBM_BW, PEAK_FLOPS) / 1e12
+        ax.scatter([ai], [thpt], color=color, s=80, zorder=5,
+                   edgecolor="white", linewidth=1.2,
+                   label=f"prefill reuse={int(reuse*100):>2}%  "
+                         f"(new={new:>6,} tok, AI={ai:>6,.0f})")
+
+    f, b = decode_point(seq_len)
+    ai_dec = f / b
+    thpt_dec = min(ai_dec * HBM_BW, PEAK_FLOPS) / 1e12
+    ax.scatter([ai_dec], [thpt_dec], color="#8c564b", s=80, marker="D",
+               zorder=5, edgecolor="white", linewidth=1.2,
+               label=f"decode  (per-token, seqlen={seq_len:,}, AI={ai_dec:.1f})")
+
+    ax.legend(loc="lower right", fontsize=8.5, framealpha=0.95,
+              prop={"family": "monospace", "size": 8})
+
+    ax.set_xscale("log")
+    ax.set_yscale("log")
+    ax.set_xlim(0.5, 1e5)
+    ax.set_ylim(0.5, 500)
+    ax.set_xlabel("Arithmetic intensity (FLOP/byte)")
+    ax.set_ylabel("Achievable throughput (TFLOPS)")
+    ax.set_title(
+        f"Prefill stays compute-bound even at 95% reuse  "
+        f"(Qwen3-Coder-30B-A3B, H20, seqlen={seq_len:,})",
+        fontsize=10,
+    )
+    ax.grid(True, which="both", alpha=0.25)
+    fig.tight_layout()
+    fig.savefig(out_path, bbox_inches="tight")
+    plt.close(fig)
+    print(f"[C6] wrote {out_path}")
+    for reuse in reuses:
+        f, b, new = point(seq_len, reuse)
+        print(f"     reuse={int(reuse*100):>3}%  new={new:>6,}  AI={f/b:>8.1f}  "
+              f"bound={'COMPUTE' if f/b > RIDGE else 'MEMORY'}")
+    f, b = decode_point(seq_len)
+    print(f"     decode             AI={f/b:>8.1f}  bound={'COMPUTE' if f/b > RIDGE else 'MEMORY'}")
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--seq-len", type=int, default=64000)
+    ap.add_argument("--outdir", default="analysis/pd_sep_paper_section/figures")
+    args = ap.parse_args()
+    out = Path(args.outdir)
+    out.mkdir(parents=True, exist_ok=True)
+    plot(out / "fig_c6_roofline.pdf", seq_len=args.seq_len)
+
+
+if __name__ == "__main__":
+    main()
--- a/analysis/pd_sep_paper_section/scripts/plot_routing_lever.py
+++ b/analysis/pd_sep_paper_section/scripts/plot_routing_lever.py
@@ -0,0 +1,123 @@
+"""C7: routing lever vs PD-separation lever.
+
+Side-by-side comparison of the magnitude of two design changes on the same
+agentic workload:
+  (A) Round-robin -> cache-aware routing, both Combined-mode
+  (B) Combined -> PD-separated, both cache-aware
+
+For each, plot delta TTFT p50 / TPOT p90 / APC. Green = improvement, red =
+regression. Numbers come from REPORT.md §3.1 (PD-separation_analysis.md §3.1).
+
+CAVEAT shown on the figure: these numbers are from the legacy
+trace methodology (random sampling, 1 req/GPU). They are not yet reproduced
+on the trace-driven 850-req sampling at production concurrency, and the
+PD-sep runs were captured with --enforce-eager. The current plot is meant
+to show the qualitative gap between the two levers; a re-run is required
+for paper-grade quantitative claims.
+"""
+import argparse
+from pathlib import Path
+
+import matplotlib
+matplotlib.use("Agg")
+import matplotlib.pyplot as plt
+import numpy as np
+
+# (label, RR baseline, cache-aware baseline, PD-sep w/ cache-aware,
+#  unit, format, "improve_when_smaller")
+ROWS = [
+    ("TTFT p50 (s)",  1.836, 0.731, 1.261, "s",  "{:.2f}", True),
+    ("TPOT p90 (s)",  0.086, 0.073, 0.074, "s",  "{:.3f}", True),
+    ("APC (%)",       20.8,  44.7,  40.2,  "pp", "{:.1f}", False),
+]
+
+
+def pct_delta(before, after, improve_when_smaller):
+    """Return signed % change framed so positive = improvement.
+
+    For APC (pp): return absolute pp delta because relative % is misleading.
+    """
+    diff = after - before
+    if improve_when_smaller:
+        improvement = -(diff / before) * 100
+        return improvement, f"{improvement:+.0f}%"
+    pp = diff
+    return pp, f"{pp:+.1f}pp"
+
+
+def plot(out_path):
+    fig, axes = plt.subplots(1, 3, figsize=(10, 3.5))
+
+    bar_colors = lambda val: "#2ca02c" if val >= 0 else "#d62728"
+
+    for ax, (metric, rr, ca, pdsep, unit, fmt, smaller_better) in zip(axes, ROWS):
+        # lever A: RR -> cache-aware (both combined)
+        a_val, a_txt = pct_delta(rr, ca, smaller_better)
+        # lever B: combined -> PD-sep (both cache-aware)
+        b_val, b_txt = pct_delta(ca, pdsep, smaller_better)
+
+        bars = ax.bar(
+            ["RR → cache-aware\n(within Combined)",
+             "Combined → PD-Sep\n(both cache-aware)"],
+            [a_val, b_val],
+            color=[bar_colors(a_val), bar_colors(b_val)],
+            edgecolor="black", linewidth=0.6, width=0.55,
+        )
+
+        ymax = max(abs(a_val), abs(b_val))
+        ax.set_ylim(-ymax * 1.35, ymax * 1.35)
+        ax.axhline(0, color="black", lw=0.6)
+
+        for bar, val, txt in zip(bars, [a_val, b_val], [a_txt, b_txt]):
+            yoff = ymax * 0.06 if val >= 0 else -ymax * 0.06
+            ax.text(bar.get_x() + bar.get_width() / 2,
+                    val + yoff,
+                    txt,
+                    ha="center", va="bottom" if val >= 0 else "top",
+                    fontsize=10, fontweight="bold")
+
+        ax.set_title(metric, fontsize=10)
+        if smaller_better:
+            ax.set_ylabel("Δ (positive = improvement)")
+        else:
+            ax.set_ylabel("Δ percentage points")
+        ax.grid(True, axis="y", alpha=0.25)
+        ax.tick_params(axis="x", labelsize=8.5)
+        u = "" if unit == "pp" else unit
+        ax.set_xlabel(
+            f"RR={fmt.format(rr)}{u}  ·  CA={fmt.format(ca)}{u}  ·  PD-Sep={fmt.format(pdsep)}{u}",
+            fontsize=8, color="#555", labelpad=8,
+        )
+
+    fig.suptitle(
+        "Cache-aware routing is a larger lever than PD separation on agentic workload",
+        fontsize=11, y=1.02,
+    )
+    fig.tight_layout(rect=(0, 0.10, 1, 0.96))
+    footer = (
+        "Source: REPORT.md §3.1 / analysis/pd_separation_analysis.md §3.1. "
+        "Legacy random-sampling methodology + --enforce-eager. "
+        "Re-run on trace-driven w600_r0.0015_st30 with cuda-graph required before paper-grade citation."
+    )
+    fig.text(0.5, 0.01, footer, ha="center", fontsize=7.5, color="#666",
+             style="italic", wrap=True)
+    fig.savefig(out_path, bbox_inches="tight")
+    plt.close(fig)
+    print(f"[C7] wrote {out_path}")
+    for metric, rr, ca, pdsep, unit, fmt, smaller in ROWS:
+        a, a_txt = pct_delta(rr, ca, smaller)
+        b, b_txt = pct_delta(ca, pdsep, smaller)
+        print(f"     {metric:14s}  RR→CA: {a_txt:>7s}   Combined→PD-Sep: {b_txt:>7s}")
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--outdir", default="analysis/pd_sep_paper_section/figures")
+    args = ap.parse_args()
+    out = Path(args.outdir)
+    out.mkdir(parents=True, exist_ok=True)
+    plot(out / "fig_c7_routing_lever.pdf")
+
+
+if __name__ == "__main__":
+    main()
--- a/analysis/pd_sep_paper_section/scripts/plot_workload.py
+++ b/analysis/pd_sep_paper_section/scripts/plot_workload.py
@@ -0,0 +1,217 @@
+"""C1: workload characterization figures.
+
+Generates two figures from the sampled trace:
+  fig_c1a_io_cdf.pdf  -- input / output token CDF (two panels)
+  fig_c1b_reuse.pdf   -- KV-block reuse decomposition
+
+Run on dash0 where the trace lives and matplotlib is installed.
+
+Usage:
+  .venv/bin/python scripts/plot_workload.py \
+      --trace traces/w600_r0.0015_st30.jsonl \
+      --outdir analysis/figures
+"""
+import argparse
+import json
+import sys
+from collections import Counter
+from pathlib import Path
+
+import matplotlib
+matplotlib.use("Agg")
+import matplotlib.pyplot as plt
+import numpy as np
+
+BLOCK_SIZE = 512
+
+
+def load_trace(path):
+    rows = [json.loads(l) for l in open(path)]
+    rows.sort(key=lambda r: float(r["timestamp"]))
+    return rows
+
+
+def percentile_markers(arr, qs=(0.5, 0.9, 0.99)):
+    arr = np.asarray(arr)
+    return {q: float(np.quantile(arr, q)) for q in qs}
+
+
+def plot_io_cdf(rows, out_path):
+    inputs = np.array([r["input_length"] for r in rows if r["input_length"] > 0])
+    outputs = np.array([r["output_length"] for r in rows if r["output_length"] > 0])
+
+    fig, axes = plt.subplots(1, 2, figsize=(8.5, 3.2))
+
+    for ax, data, label, log in [
+        (axes[0], inputs, "input tokens (log scale)", True),
+        (axes[1], outputs, "output tokens", False),
+    ]:
+        sorted_d = np.sort(data)
+        cdf = np.arange(1, len(sorted_d) + 1) / len(sorted_d)
+        ax.plot(sorted_d, cdf, color="#1f77b4", lw=1.6)
+        if log:
+            ax.set_xscale("log")
+        ax.set_xlabel(label)
+        ax.set_ylabel("CDF")
+        ax.set_ylim(0, 1.02)
+        ax.grid(True, alpha=0.3)
+
+        pcts = percentile_markers(data)
+        for q, v in pcts.items():
+            ax.axvline(v, color="#888", ls=":", lw=0.8)
+            ax.annotate(
+                f"p{int(q*100)}={int(v):,}",
+                xy=(v, q),
+                xytext=(4, -8),
+                textcoords="offset points",
+                fontsize=8,
+                color="#444",
+            )
+
+    io_ratio = inputs.sum() / max(outputs.sum(), 1)
+    fig.suptitle(
+        f"Agentic workload I/O: aggregate ratio = {io_ratio:.1f}x  "
+        f"(N={len(rows)} requests, sampled from GLM-5.1)",
+        fontsize=10,
+    )
+    fig.tight_layout(rect=(0, 0, 1, 0.94))
+    fig.savefig(out_path, bbox_inches="tight")
+    plt.close(fig)
+    print(f"[C1a] wrote {out_path}")
+    print(f"      input  p50={int(np.quantile(inputs, 0.5)):,} "
+          f"p90={int(np.quantile(inputs, 0.9)):,} "
+          f"p99={int(np.quantile(inputs, 0.99)):,}")
+    print(f"      output p50={int(np.quantile(outputs, 0.5)):,} "
+          f"p90={int(np.quantile(outputs, 0.9)):,} "
+          f"p99={int(np.quantile(outputs, 0.99)):,}")
+    print(f"      aggregate I/O ratio = {io_ratio:.2f}x")
+
+
+def reuse_decomposition(rows):
+    """Classify every cacheable block as intra-session / cross-session / unique.
+
+    Walk requests in timestamp order. For each block (hash_id) in the request:
+      - if first time seen globally -> 'unique-or-future-reuse' (resolved later)
+      - if already seen earlier within the same session -> 'intra-session'
+      - if already seen in a different session -> 'cross-session'
+    After the pass, blocks classified as 'unique-or-future-reuse' that have
+    a global refcount of 1 are 'unique'; those with refcount > 1 stay where
+    they were first seen (counted under whichever later request reused them).
+
+    Token counts use BLOCK_SIZE = 512.
+    """
+    # Session id resolution mirrors analyze_cache_hit.py.
+    chat_to_session = {}
+    block_first_session = {}      # hid -> session_id of first emitter
+    block_seen_in_session = {}    # hid -> set of session_ids that have seen it
+    block_global_count = Counter()
+
+    intra = 0
+    cross = 0
+    first_time = 0  # token-count of blocks the first time they appear
+
+    for r in rows:
+        cid = int(r["chat_id"])
+        pid = int(r["parent_chat_id"])
+        sid = r.get("session_id",
+                    str(cid) if pid < 0 else chat_to_session.get(pid, str(pid)))
+        sid = str(sid)
+        chat_to_session[cid] = sid
+
+        for hid in r.get("hash_ids", []):
+            block_global_count[hid] += 1
+            if hid not in block_first_session:
+                block_first_session[hid] = sid
+                block_seen_in_session[hid] = {sid}
+                first_time += BLOCK_SIZE
+            else:
+                if sid in block_seen_in_session[hid]:
+                    intra += BLOCK_SIZE
+                else:
+                    cross += BLOCK_SIZE
+                    block_seen_in_session[hid].add(sid)
+
+    # Of the first-time tokens, those whose block was never reused are 'unique'.
+    unique_tokens = 0
+    reused_first = 0
+    for hid, count in block_global_count.items():
+        if count == 1:
+            unique_tokens += BLOCK_SIZE
+        else:
+            reused_first += BLOCK_SIZE  # first emission of a reused block
+
+    # Total tokens (block-rounded) = intra + cross + first_time
+    #   first_time decomposes into: unique_tokens + reused_first
+    # For the reuse story we attribute first_time to 'unique vs the
+    # first-emit-of-a-shared-block'. Convention used in the figure:
+    #   intra-session reuse = subsequent hits within the same session
+    #   cross-session reuse = subsequent hits across sessions
+    #   first emission (will-reuse) = block emitted once, reused later
+    #   unique (never-reuse) = block emitted exactly once, never hit again
+    return {
+        "intra_session_reuse_tokens": intra,
+        "cross_session_reuse_tokens": cross,
+        "first_emission_will_reuse_tokens": reused_first,
+        "unique_no_reuse_tokens": unique_tokens,
+    }
+
+
+def plot_reuse(rows, out_path):
+    d = reuse_decomposition(rows)
+    total = sum(d.values())
+    parts = [
+        ("intra-session reuse",          d["intra_session_reuse_tokens"], "#2ca02c"),
+        ("cross-session reuse",          d["cross_session_reuse_tokens"], "#1f77b4"),
+        ("first emission (reused later)", d["first_emission_will_reuse_tokens"], "#ff7f0e"),
+        ("unique (never reused)",        d["unique_no_reuse_tokens"], "#d62728"),
+    ]
+
+    fig, ax = plt.subplots(figsize=(8.5, 1.9))
+    left = 0
+    for label, val, color in parts:
+        frac = val / total
+        ax.barh(0, frac, left=left, color=color, edgecolor="white", height=0.6, label=label)
+        if frac > 0.025:
+            ax.text(left + frac / 2, 0,
+                    f"{label}\n{frac*100:.1f}%",
+                    ha="center", va="center", fontsize=8.5, color="white")
+        left += frac
+
+    ax.set_xlim(0, 1)
+    ax.set_yticks([])
+    ax.set_xlabel("share of total cacheable tokens (block-aligned, 512 tok blocks)")
+    ax.set_title("Where do prefix cache hits come from?  "
+                 f"(N={len(rows)} requests, sampled trace)")
+    ax.legend(loc="upper center", bbox_to_anchor=(0.5, -0.45), ncol=4, fontsize=8, frameon=False)
+    for spine in ("top", "right", "left"):
+        ax.spines[spine].set_visible(False)
+    fig.tight_layout()
+    fig.savefig(out_path, bbox_inches="tight")
+    plt.close(fig)
+    print(f"[C1b] wrote {out_path}")
+    for label, val, _ in parts:
+        print(f"      {label:40s} {val/total*100:5.1f}%  ({val:>12,} tokens)")
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--trace", default="traces/w600_r0.0015_st30.jsonl")
+    ap.add_argument("--outdir", default="analysis/pd_sep_paper_section/figures")
+    args = ap.parse_args()
+
+    trace = Path(args.trace)
+    outdir = Path(args.outdir)
+    outdir.mkdir(parents=True, exist_ok=True)
+
+    if not trace.exists():
+        sys.exit(f"trace not found: {trace}")
+
+    rows = load_trace(trace)
+    print(f"loaded {len(rows)} requests from {trace}")
+
+    plot_io_cdf(rows, outdir / "fig_c1a_io_cdf.pdf")
+    plot_reuse(rows, outdir / "fig_c1b_reuse.pdf")
+
+
+if __name__ == "__main__":
+    main()
--- a/scripts/bench.sh
+++ b/scripts/bench.sh
@@ -147,7 +147,7 @@ launch_instances() {
            $VLLM serve "$MODEL" \
                --host 0.0.0.0 --port $port \
                --tensor-parallel-size 1 \
-                --trust-remote-code --enable-prefix-caching --enforce-eager \
+                --trust-remote-code --enable-prefix-caching \
                --dtype auto --gpu-memory-utilization 0.9 --max-model-len 200000 \
                --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_both"}' \
                $vllm_extra_args \
@@ -158,7 +158,7 @@ launch_instances() {
            $VLLM serve "$MODEL" \
                --host 0.0.0.0 --port $port \
                --tensor-parallel-size 1 \
-                --trust-remote-code --enable-prefix-caching --enforce-eager \
+                --trust-remote-code --enable-prefix-caching \
                --dtype auto --gpu-memory-utilization 0.9 --max-model-len 200000 \
                $vllm_extra_args \
                > "$logfile" 2>&1 &
--- a/scripts/launch_elastic_p2p.sh
+++ b/scripts/launch_elastic_p2p.sh
@@ -54,7 +54,7 @@ for i in $(seq 0 $((N_INSTANCES - 1))); do
    $VLLM serve "$MODEL" \
        --host 0.0.0.0 --port $port \
        --tensor-parallel-size 1 \
-        --trust-remote-code --enable-prefix-caching --enforce-eager \
+        --trust-remote-code --enable-prefix-caching \
        --dtype auto --gpu-memory-utilization 0.9 --max-model-len 200000 \
        --kv-transfer-config \
        '{"kv_connector":"MooncakeConnector","kv_role":"kv_both"}' \
--- a/scripts/launch_pd_mooncake.sh
+++ b/scripts/launch_pd_mooncake.sh
@@ -44,7 +44,6 @@ $VLLM serve "$MODEL_PATH" \
    --tensor-parallel-size 4 \
    --trust-remote-code \
    --enable-prefix-caching \
-    --enforce-eager \
    --dtype auto \
    --gpu-memory-utilization 0.9 \
    --kv-transfer-config \
@@ -61,7 +60,6 @@ $VLLM serve "$MODEL_PATH" \
    --tensor-parallel-size 4 \
    --trust-remote-code \
    --enable-prefix-caching \
-    --enforce-eager \
    --dtype auto \
    --gpu-memory-utilization 0.8 \
    --kv-transfer-config \
--- a/scripts/launch_pd_separated.sh
+++ b/scripts/launch_pd_separated.sh
@@ -49,7 +49,6 @@ CUDA_VISIBLE_DEVICES=0,1,2,3 $VLLM serve "$MODEL_PATH" \
    --tensor-parallel-size 4 \
    --trust-remote-code \
    --enable-prefix-caching \
-    --enforce-eager \
    --dtype auto \
    --gpu-memory-utilization 0.9 \
    --kv-transfer-config \
@@ -65,7 +64,6 @@ CUDA_VISIBLE_DEVICES=4,5,6,7 $VLLM serve "$MODEL_PATH" \
    --tensor-parallel-size 4 \
    --trust-remote-code \
    --enable-prefix-caching \
-    --enforce-eager \
    --dtype auto \
    --gpu-memory-utilization 0.8 \
    --kv-transfer-config \
--- a/scripts/launch_phase1_ps.sh
+++ b/scripts/launch_phase1_ps.sh
@@ -29,7 +29,7 @@ for i in $(seq 0 6); do
    echo "Starting C instance $i on GPU $i, port $((8000+i)), bootstrap $((8998+i))"
    VLLM_MOONCAKE_BOOTSTRAP_PORT=$((8998+i)) MASTER_PORT=$((29500+i)) CUDA_VISIBLE_DEVICES=$i \
    .venv/bin/vllm serve "$MODEL" --host 0.0.0.0 --port $((8000+i)) --tensor-parallel-size 1 \
-        --trust-remote-code --enable-prefix-caching --enforce-eager \
+        --trust-remote-code --enable-prefix-caching \
        --dtype auto --gpu-memory-utilization 0.9 --max-model-len 200000 \
        --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_both"}' \
        > "$OUTDIR/vllm_c_$i.log" 2>&1 &
@@ -40,7 +40,7 @@ done
 echo "=== Launching PS instance on GPU 7, port 8007, bootstrap 9005 ==="
 VLLM_MOONCAKE_BOOTSTRAP_PORT=9005 MASTER_PORT=29507 CUDA_VISIBLE_DEVICES=7 \
 .venv/bin/vllm serve "$MODEL" --host 0.0.0.0 --port 8007 --tensor-parallel-size 1 \
-    --trust-remote-code --enable-prefix-caching --enforce-eager \
+    --trust-remote-code --enable-prefix-caching \
    --dtype auto --gpu-memory-utilization 0.9 --max-model-len 200000 \
    --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_both"}' \
    > "$OUTDIR/vllm_ps_0.log" 2>&1 &