Captures 5 runs from the experiment matrix (combined-ca x3 seeds, pdsep-4p4d seed1, pdsep-6p2d seed1) on traces/w600_r0.0015_st30.jsonl with cuda graphs enabled. The headline: combined-ca: TTFT p50 0.91s success 99.5% pdsep-4p4d: TTFT p50 62.8s success 52% (69x worse, half dropped) pdsep-6p2d: TTFT p50 51.1s success 68% (56x worse, third dropped) C2 (fig_c2): headline bars per config with error bars. C3 (fig_c3): per-instance KV utilization time-series. Both PD-sep splits hit the memory wall, but the side differs by P:D ratio -- 4P+4D pins the P-side, 6P+2D pins both sides (D-side back-pressures P-side). C4 (fig_c4): TTFT stacked breakdown. 99% of PD-sep TTFT is P-side prefill compute; D-side wait + first token is <=1.2s. The bottleneck is P-side prefill queueing, not D-side decode wait as the original analytical model assumed. system_analysis.md gains a Layer 5b that reconciles the analytical KV-wall model (which considered D-side only) with the empirical finding that the wall hits whichever side has fewer GPUs, and co-saturates both at extreme splits via D-side back-pressure. plot_pd_matrix.py ingests outputs/pd_matrix/* into all four figures. bench.sh gained AGENTIC_STEP_LOG_DIR hooks for future runs (set during this work but not used by the current matrix's data). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Paper section: PD separation under agentic workloads
This directory collects everything produced for the "PD-sep is net negative on agentic workloads" paper section. It is one section of a larger paper, not the whole paper.
Layout
analysis/pd_sep_paper_section/
├── README.md # this file
├── system_analysis.md # why PD-sep loses despite compute-bound prefill (6 layers)
├── scripts/
│ ├── plot_workload.py # C1: input/output CDF + KV reuse decomposition
│ ├── plot_roofline.py # C6: prefill roofline at varying cache reuse
│ ├── plot_routing_lever.py # C7: routing vs PD-sep as design levers
│ ├── plot_kv_memory_wall.py # KV mem-wall model + empirical anchor
│ └── bench_pd_matrix.sh # orchestrates the C2/C3/C4/C5 experiment matrix on dash0
└── figures/
├── fig_c1a_io_cdf.pdf # input/output token CDF (from traces/w600_r0.0015_st30.jsonl)
├── fig_c1b_reuse.pdf # KV reuse decomposition: 79% intra-session
├── fig_c6_roofline.pdf # analytical roofline
├── fig_c7_routing_lever.pdf # routing vs PD-sep (legacy data, footer caveat)
└── fig_kv_memory_wall.pdf # the explanatory figure for system_analysis.md
Candidate claims -> figures (status)
| Claim | Figure | Status |
|---|---|---|
| C1a: agentic input distribution (p50=33.5k, p90=101k, p99=132k); I/O = 142x | figures/fig_c1a_io_cdf.pdf |
rendered |
| C1b: 79% intra-session reuse + 0.8% cross-session | figures/fig_c1b_reuse.pdf |
rendered |
| C2: PD-sep vs Combined headline (TTFT 69× worse, success 52%) | figures/fig_c2_pdsep_vs_combined.pdf |
rendered (N=3 combined-ca, N=1 each PD-sep config) |
| C3: KV cache time-series — both PD-sep splits hit the wall | figures/fig_c3_kv_timeseries.pdf |
rendered |
| C4: TTFT decomposition — 99% is P-side prefill compute | figures/fig_c4_ttft_stacked.pdf |
rendered |
| C5: cuda-graph ablation (eager vs cudagraph × Combined vs PD-sep) | (not yet) | needs --with-eager re-run |
| C6: prefill stays compute-bound at 95% reuse | figures/fig_c6_roofline.pdf |
rendered |
| C7: cache-aware routing is a larger lever than PD-sep | figures/fig_c7_routing_lever.pdf |
rendered (legacy data, footer caveat) |
| KV-WALL: per-D-instance KV demand vs PD layout (system mechanism) | figures/fig_kv_memory_wall.pdf |
rendered (analytical, audit constants in script) |
System-level argument (system_analysis.md)
The doc answers: if prefill stays compute-bound even at 95% reuse, why does PD separation not help? Six layers, each pointing to a figure in this directory:
- compute-bound is a kernel property, not a system claim
- absolute prefill work after cache hit is small (~hundreds of ms savings ceiling)
- PD separation relocates compute; it doesn't accelerate it
- PD separation's costs (KV transfer, decode-side concentration) scale with workload size
- decode-side KV memory wall — quantified in
fig_kv_memory_wall.pdf - the DistServe / Splitwise assumption that silently breaks:
concurrent × KV/req / (N_D × HBM)is ≪ 1 for chatbot but ≥ 1 for agentic at p90+ context
In-place edits made for this task
These edits are in the repo, not in this directory, because they modify
existing launch scripts. --enforce-eager was removed so cuda graphs can be
captured — PD-sep's D-node is a particularly clean case for cuda-graph
benefit and the prior methodology suppressed it.
| File | Lines | Change |
|---|---|---|
scripts/bench.sh |
150, 161 | drop --enforce-eager (elastic + baseline modes) |
scripts/launch_pd_mooncake.sh |
47, 64 | drop --enforce-eager (P and D instances) |
scripts/launch_pd_separated.sh |
52, 68 | drop --enforce-eager (P and D instances) |
scripts/launch_phase1_ps.sh |
32, 43 | drop --enforce-eager (C and PS instances) |
scripts/launch_elastic_p2p.sh |
57 | drop --enforce-eager (kv_both instances) |
scripts/legacy/*.sh are intentionally left as-is — they record the
configuration of past experiments.
REPORT.md and analysis/pd_separation_analysis.md still describe the
old --enforce-eager setup. Update them once the new runs land.
Reproducing the figures
From repo root:
# C1 (needs traces/w600_r0.0015_st30.jsonl; ~1.2 MB, pull from dash0 if missing)
.venv/bin/python analysis/pd_sep_paper_section/scripts/plot_workload.py \
--trace traces/w600_r0.0015_st30.jsonl
# C6 (analytical, runs anywhere with matplotlib)
.venv/bin/python analysis/pd_sep_paper_section/scripts/plot_roofline.py
# C7 (hardcoded REPORT.md §3.1 numbers; no inputs)
.venv/bin/python analysis/pd_sep_paper_section/scripts/plot_routing_lever.py
# KV mem-wall (analytical; audit constants at top of the script)
.venv/bin/python analysis/pd_sep_paper_section/scripts/plot_kv_memory_wall.py
All four default --outdir to analysis/pd_sep_paper_section/figures.
Running the experiment matrix (gating C2/C3/C4/C5)
bench_pd_matrix.sh orchestrates the experiments that the unrendered
claims depend on. It uses the extended scripts/bench.sh (now supports
--mode pdsep --pd-ratio 4:4|6:2 and --eager for the cuda-graph
ablation; all launchers no longer pin --enforce-eager).
On dash0:
cd ~/agentic-kv
# minimal set: 3 configs (combined-ca, pdsep-4p4d, pdsep-6p2d) x 3 seeds
# = 9 runs ~= 2 h
bash analysis/pd_sep_paper_section/scripts/bench_pd_matrix.sh
# full matrix (adds combined-rr and the eager ablation)
bash analysis/pd_sep_paper_section/scripts/bench_pd_matrix.sh \
--with-rr --with-eager
Each run writes to outputs/pd_matrix/<config>_<mode>_seed<N>/ with
metrics.summary.json, breakdown.json, apc.txt, gpu_util.csv, and
per-instance vLLM logs (the latter contain the step-level
KV cache: X% lines needed for the C3 time-series figure).
Caveats / open items
- C7 uses legacy data. The footer of
fig_c7_routing_lever.pdfsays so: PD-sep numbers come from the random-sampled trace +--enforce-eager. Afterpd_matrixlands, swap the four numbers inplot_routing_lever.py'sROWStable and re-render. - C2/C3/C4/C5 figures depend on
pd_matrixoutputs. Followup plotters (TBD) will readoutputs/pd_matrix/*/metrics.summary.json,breakdown.json, and theKV cache: X%lines from per-instance logs to produce: bar chart with error bars (C2), KV utilization time-series (C3), TTFT stacked breakdown (C4), 2x2 cuda-graph ablation (C5). - C8 (mined logs): rejected — existing PD-sep
outputs/exp3_*directories have per-request metrics but no per-stage breakdown and no step-level KV utilization. C3/C4 require fresh runs with proxy/breakdowncollection (already automatic inbench.sh collect_artifacts()). - C6 is analytical, so it is independent of any re-run. The numbers
match
scripts/compute_roofline.py(constants are duplicated; if one changes, the other must change too). - fig_kv_memory_wall.pdf is analytical with one empirical anchor (the
star marker for REPORT.md §3.3 6P+2D @ 97 % KV utilization). It does not
need a re-run, but the empirical anchor's pinpoint would be more
rigorous from a
pd_matrix6P+2D log (KV-utilization time-series rather than the single snapshot).