Files
agentic-kvc/analysis/pd_sep_paper_section/README.md
Gahow Wang cd82b8c2a2 PD-sep matrix results: C2/C3/C4 figures + empirical mechanism refined
Captures 5 runs from the experiment matrix (combined-ca x3 seeds,
pdsep-4p4d seed1, pdsep-6p2d seed1) on traces/w600_r0.0015_st30.jsonl
with cuda graphs enabled. The headline:

  combined-ca:  TTFT p50 0.91s   success 99.5%
  pdsep-4p4d:   TTFT p50 62.8s   success 52%   (69x worse, half dropped)
  pdsep-6p2d:   TTFT p50 51.1s   success 68%   (56x worse, third dropped)

C2 (fig_c2): headline bars per config with error bars.
C3 (fig_c3): per-instance KV utilization time-series. Both PD-sep
  splits hit the memory wall, but the side differs by P:D ratio --
  4P+4D pins the P-side, 6P+2D pins both sides (D-side back-pressures
  P-side).
C4 (fig_c4): TTFT stacked breakdown. 99% of PD-sep TTFT is P-side
  prefill compute; D-side wait + first token is <=1.2s. The bottleneck
  is P-side prefill queueing, not D-side decode wait as the original
  analytical model assumed.

system_analysis.md gains a Layer 5b that reconciles the analytical
KV-wall model (which considered D-side only) with the empirical
finding that the wall hits whichever side has fewer GPUs, and
co-saturates both at extreme splits via D-side back-pressure.

plot_pd_matrix.py ingests outputs/pd_matrix/* into all four figures.
bench.sh gained AGENTIC_STEP_LOG_DIR hooks for future runs (set during
this work but not used by the current matrix's data).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 16:23:52 +08:00

7.3 KiB
Raw Blame History

Paper section: PD separation under agentic workloads

This directory collects everything produced for the "PD-sep is net negative on agentic workloads" paper section. It is one section of a larger paper, not the whole paper.

Layout

analysis/pd_sep_paper_section/
├── README.md                       # this file
├── system_analysis.md              # why PD-sep loses despite compute-bound prefill (6 layers)
├── scripts/
│   ├── plot_workload.py            # C1: input/output CDF + KV reuse decomposition
│   ├── plot_roofline.py            # C6: prefill roofline at varying cache reuse
│   ├── plot_routing_lever.py       # C7: routing vs PD-sep as design levers
│   ├── plot_kv_memory_wall.py      # KV mem-wall model + empirical anchor
│   └── bench_pd_matrix.sh          # orchestrates the C2/C3/C4/C5 experiment matrix on dash0
└── figures/
    ├── fig_c1a_io_cdf.pdf          # input/output token CDF (from traces/w600_r0.0015_st30.jsonl)
    ├── fig_c1b_reuse.pdf           # KV reuse decomposition: 79% intra-session
    ├── fig_c6_roofline.pdf         # analytical roofline
    ├── fig_c7_routing_lever.pdf    # routing vs PD-sep (legacy data, footer caveat)
    └── fig_kv_memory_wall.pdf      # the explanatory figure for system_analysis.md

Candidate claims -> figures (status)

Claim Figure Status
C1a: agentic input distribution (p50=33.5k, p90=101k, p99=132k); I/O = 142x figures/fig_c1a_io_cdf.pdf rendered
C1b: 79% intra-session reuse + 0.8% cross-session figures/fig_c1b_reuse.pdf rendered
C2: PD-sep vs Combined headline (TTFT 69× worse, success 52%) figures/fig_c2_pdsep_vs_combined.pdf rendered (N=3 combined-ca, N=1 each PD-sep config)
C3: KV cache time-series — both PD-sep splits hit the wall figures/fig_c3_kv_timeseries.pdf rendered
C4: TTFT decomposition — 99% is P-side prefill compute figures/fig_c4_ttft_stacked.pdf rendered
C5: cuda-graph ablation (eager vs cudagraph × Combined vs PD-sep) (not yet) needs --with-eager re-run
C6: prefill stays compute-bound at 95% reuse figures/fig_c6_roofline.pdf rendered
C7: cache-aware routing is a larger lever than PD-sep figures/fig_c7_routing_lever.pdf rendered (legacy data, footer caveat)
KV-WALL: per-D-instance KV demand vs PD layout (system mechanism) figures/fig_kv_memory_wall.pdf rendered (analytical, audit constants in script)

System-level argument (system_analysis.md)

The doc answers: if prefill stays compute-bound even at 95% reuse, why does PD separation not help? Six layers, each pointing to a figure in this directory:

  1. compute-bound is a kernel property, not a system claim
  2. absolute prefill work after cache hit is small (~hundreds of ms savings ceiling)
  3. PD separation relocates compute; it doesn't accelerate it
  4. PD separation's costs (KV transfer, decode-side concentration) scale with workload size
  5. decode-side KV memory wall — quantified in fig_kv_memory_wall.pdf
  6. the DistServe / Splitwise assumption that silently breaks: concurrent × KV/req / (N_D × HBM) is ≪ 1 for chatbot but ≥ 1 for agentic at p90+ context

In-place edits made for this task

These edits are in the repo, not in this directory, because they modify existing launch scripts. --enforce-eager was removed so cuda graphs can be captured — PD-sep's D-node is a particularly clean case for cuda-graph benefit and the prior methodology suppressed it.

File Lines Change
scripts/bench.sh 150, 161 drop --enforce-eager (elastic + baseline modes)
scripts/launch_pd_mooncake.sh 47, 64 drop --enforce-eager (P and D instances)
scripts/launch_pd_separated.sh 52, 68 drop --enforce-eager (P and D instances)
scripts/launch_phase1_ps.sh 32, 43 drop --enforce-eager (C and PS instances)
scripts/launch_elastic_p2p.sh 57 drop --enforce-eager (kv_both instances)

scripts/legacy/*.sh are intentionally left as-is — they record the configuration of past experiments.

REPORT.md and analysis/pd_separation_analysis.md still describe the old --enforce-eager setup. Update them once the new runs land.

Reproducing the figures

From repo root:

# C1 (needs traces/w600_r0.0015_st30.jsonl; ~1.2 MB, pull from dash0 if missing)
.venv/bin/python analysis/pd_sep_paper_section/scripts/plot_workload.py \
    --trace traces/w600_r0.0015_st30.jsonl

# C6 (analytical, runs anywhere with matplotlib)
.venv/bin/python analysis/pd_sep_paper_section/scripts/plot_roofline.py

# C7 (hardcoded REPORT.md §3.1 numbers; no inputs)
.venv/bin/python analysis/pd_sep_paper_section/scripts/plot_routing_lever.py

# KV mem-wall (analytical; audit constants at top of the script)
.venv/bin/python analysis/pd_sep_paper_section/scripts/plot_kv_memory_wall.py

All four default --outdir to analysis/pd_sep_paper_section/figures.

Running the experiment matrix (gating C2/C3/C4/C5)

bench_pd_matrix.sh orchestrates the experiments that the unrendered claims depend on. It uses the extended scripts/bench.sh (now supports --mode pdsep --pd-ratio 4:4|6:2 and --eager for the cuda-graph ablation; all launchers no longer pin --enforce-eager).

On dash0:

cd ~/agentic-kv
# minimal set: 3 configs (combined-ca, pdsep-4p4d, pdsep-6p2d) x 3 seeds
# = 9 runs ~= 2 h
bash analysis/pd_sep_paper_section/scripts/bench_pd_matrix.sh

# full matrix (adds combined-rr and the eager ablation)
bash analysis/pd_sep_paper_section/scripts/bench_pd_matrix.sh \
    --with-rr --with-eager

Each run writes to outputs/pd_matrix/<config>_<mode>_seed<N>/ with metrics.summary.json, breakdown.json, apc.txt, gpu_util.csv, and per-instance vLLM logs (the latter contain the step-level KV cache: X% lines needed for the C3 time-series figure).

Caveats / open items

  • C7 uses legacy data. The footer of fig_c7_routing_lever.pdf says so: PD-sep numbers come from the random-sampled trace + --enforce-eager. After pd_matrix lands, swap the four numbers in plot_routing_lever.py's ROWS table and re-render.
  • C2/C3/C4/C5 figures depend on pd_matrix outputs. Followup plotters (TBD) will read outputs/pd_matrix/*/metrics.summary.json, breakdown.json, and the KV cache: X% lines from per-instance logs to produce: bar chart with error bars (C2), KV utilization time-series (C3), TTFT stacked breakdown (C4), 2x2 cuda-graph ablation (C5).
  • C8 (mined logs): rejected — existing PD-sep outputs/exp3_* directories have per-request metrics but no per-stage breakdown and no step-level KV utilization. C3/C4 require fresh runs with proxy /breakdown collection (already automatic in bench.sh collect_artifacts()).
  • C6 is analytical, so it is independent of any re-run. The numbers match scripts/compute_roofline.py (constants are duplicated; if one changes, the other must change too).
  • fig_kv_memory_wall.pdf is analytical with one empirical anchor (the star marker for REPORT.md §3.3 6P+2D @ 97 % KV utilization). It does not need a re-run, but the empirical anchor's pinpoint would be more rigorous from a pd_matrix 6P+2D log (KV-utilization time-series rather than the single snapshot).