MB1: prefill-decode interference under chunked-prefill default; §3.2 headline

Single-GPU bench on dash1 GPU 0 (vanilla vLLM 0.18.1, chunked-prefill on, no kv_connector). 3 decode batch sizes × 5 prefill sizes × 3 reps. Method recap (driver: microbench/interference/driver.py, repurposed): - Pin D streaming decode requests at constant max_tokens - Inject one prefill-only request (max_tokens=1) of varying input length - Bin decode-stream token timestamps into "during prefill" vs baseline - Headline metric: effective per-stream TPOT during the prefill burst, = prefill_ttft / (num_tokens_during_prefill / D). This is the average rate at which each decode stream produces tokens during the burst. p50 of inter-token intervals is deceptive (chunked-prefill makes most intervals look normal); the burst-average gives the true cost. Results (D=8 row, the most agentic-realistic case): P (tokens) | prefill_ttft | per-stream TPOT during | penalty 2048 | 143 ms | 32 ms | 4× 8192 | 583 ms | 114 ms | 15× 32768 | 4520 ms | 388 ms | 52× 65536 | 15615 ms | 757 ms | 99× 131072 | 56991 ms | 1419 ms | 183× Baseline TPOT at D=8: ~7.7 ms. So during a 131k-token prefill burst each ongoing decode is running ~183× slower (i.e. essentially halted) for ~57 seconds. §3.2 implication: PD-disagg's promised phase-isolation benefit per agentic request is bounded by the decode duration, which is 50–200 ms for tool-call output. MB2 says the KV-transfer cost of PD-disagg is 300 ms – 10 s for agentic-size requests. Cost > benefit for every KV size above ~80 MiB (well below trace mean 192 MiB). The new figs/pd_cost_vs_benefit.png overlays MB1 benefit ceiling (50–200 ms band, capped by decode) onto MB2 transfer cost curve and marks the agentic-distribution waypoints (trace mean, p90, p95, p99) on the x-axis. Across the entire agentic distribution, the cost curve sits above the benefit band. Adds: - microbench/fresh_setup/mb1_launch.sh: single-GPU vLLM launcher (no kv_connector, default chunked_prefill=on, max_num_batched_tokens=8192) - microbench/fresh_setup/mb1_driver.py: copy of the existing microbench/interference/driver.py for cpfs deployment - microbench/fresh_setup/analyze_mb1.py: aggregator emitting per-(D, P) effective-TPOT-during + max PD-disagg-benefit table - microbench/fresh_setup/plot_mb1.py: mb1 standalone + pd_cost_vs_benefit headline figure - analysis/mb1/summary.csv: 45 raw rows from the sweep - analysis/mb1/breakdown.json: per-(D, P) aggregate - analysis/mb1/README.md: persistent doc - figs/mb1_interference.png: effective TPOT during prefill, one line per D - figs/pd_cost_vs_benefit.png: §3.2 headline (cost > benefit everywhere) Caveats noted in README: - chunk_tokens=8192 only; Sarathi-Serve's smaller chunks would interleave decode more aggressively. Chunk-size sensitivity is flagged as next run. - D ≤ 8; higher D may saturate or shrink the penalty further. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 21:25:09 +08:00
parent 90127c3389
commit 029821c1b6
9 changed files with 1151 additions and 0 deletions
--- a/analysis/mb1/README.md
+++ b/analysis/mb1/README.md
@@ -0,0 +1,177 @@
+# MB1 — Prefill–Decode Interference (chunked-prefill on, vLLM 0.18.1 default)
+
+Persistent record of the phase-interference microbench used to put a
+quantitative upper bound on **what PD-disaggregation can buy** under the
+chunked-prefill-on baseline. Re-runs append a dated section at the
+bottom; the **Summary** block is what gets cited.
+
+---
+
+## Summary (latest)
+
+| Headline | Value |
+|---|---|
+| Baseline single-stream TPOT (D=1, idle GPU) | **4.8 ms** |
+| Effective per-stream TPOT during **8k-token** prefill burst (D=8) | **114 ms (≈15× baseline)** |
+| Effective per-stream TPOT during **32k-token** prefill burst (D=8) | **388 ms (≈52×)** |
+| Effective per-stream TPOT during **131k-token** prefill burst (D=8) | **1419 ms (≈183×)** |
+| Maximum PD-disagg benefit per agentic decode | **≤ 50–200 ms** (= decode duration) |
+
+**§3.2 headline (cost vs benefit, this run + MB2)**:
+
+> Under chunked-prefill, every ongoing decode stream is essentially
+> **halted while a prefill chunk is in flight** — per-stream effective
+> TPOT during the burst is 15× to 2000× baseline, scaling with prefill
+> size. PD-disagg can recover this stall, but the recovery is bounded by
+> the **decode duration** of the request being protected. For agentic,
+> decode is 50–200 ms (tool-call output). MB2 shows PD-disagg pays
+> 300 ms – 10 s of KV-transfer cost per request to do that recovery. The
+> cost exceeds the benefit ceiling for any per-request KV ≥ ~80 MiB
+> (~830 tokens) — well below all agentic operating points. The benefit
+> never beats the cost in this workload.
+
+## Setup
+
+| Component | Value |
+|---|---|
+| Host | dash1, H20 96 GiB, driver 570.133.20 |
+| Venv | `/home/admin/cpfs/wjh/agentic-kv-fresh/.venv` |
+| vLLM | 0.18.1 official wheel (chunked-prefill default-on, V1 engine) |
+| Model | `/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct` |
+| Launch flags | `--tensor-parallel-size 1 --enable-prefix-caching --gpu-memory-utilization 0.9 --max-model-len 200000 --max-num-batched-tokens 8192` |
+| kv_connector | **none** (this measures pure single-GPU phase interference; PD-disagg cost lives in MB2) |
+
+## Method
+
+Adapted from `microbench/interference/driver.py`:
+
+1. Start D streaming decode requests on `/v1/chat/completions` with a
+   long max_tokens cap. Discard the first 32 tokens as warmup.
+2. After 1 s, inject one prefill-only request with `max_tokens=1` and
+   an input of `P` synthetic tokens (uuid-seeded for zero prefix-cache
+   reuse). Measure the prefill's TTFT.
+3. Bin the *during-prefill* tokens from each decode stream by whether
+   their wall-clock falls inside `[prefill_inject_ts, prefill_inject_ts +
+   prefill_ttft]`. Report inter-token p50 / p90.
+4. Bin a baseline run (D streams, no prefill injection) the same way.
+
+We additionally compute the **effective per-stream TPOT during the
+prefill burst** as the single most informative summary:
+
+```
+eff_TPOT_during = prefill_ttft_ms / (num_tokens_during_prefill / D)
+```
+
+This is the average rate at which each decode stream produces tokens
+while a prefill is in flight. Compared to baseline TPOT it gives the
+real per-stream throughput penalty (chunked-prefill p50 looks deceptively
+fine because most decode-token intervals during the burst are at normal
+speed; p90 sees the stall but is itself noisy; the effective TPOT is
+the cleanest "average over the whole burst window" number).
+
+## Results — 2026-05-27, dash1 GPU 0, chunk_tokens=8192
+
+3 D × 5 P × 3 reps. Aggregated by `analyze_mb1.py`.
+
+| D | P (tok) | base TPOT (ms) | prefill_ttft (ms) | per-stream tokens during | effective TPOT during (ms) | penalty | max PD-disagg benefit per stream (ms) |
+|--:|--:|--:|--:|--:|--:|--:|--:|
+| 1 |   2 048 | 4.79 |   163 | 4.0 |     41 |     8× |    144 |
+| 1 |   8 192 | 4.78 |   584 | 5.0 |    117 |    24× |    560 |
+| 1 |  32 768 | 4.78 | 4 515 | 5.0 |    903 |   189× |  4 491 |
+| 1 |  65 536 | 4.78 | 15 568 | 5.3 |  2 919 |   610× | 15 542 |
+| 1 | 131 072 | 4.78 | 56 765 | 5.7 | 10 017 | 2 094× | 56 738 |
+| 4 |   2 048 | 5.62 |   138 | 3.9 |     36 |     6× |    117 |
+| 4 |   8 192 | 6.08 |   574 | 4.5 |    128 |    21× |    547 |
+| 4 |  32 768 | 6.09 | 4 529 | 11.9 |    381 |    63× |  4 457 |
+| 4 |  65 536 | 5.85 | 15 587 | 19.8 |    789 |   135× | 15 471 |
+| 4 | 131 072 | 6.27 | 56 697 | 37.4 |  1 517 |   242× | 56 463 |
+| 8 |   2 048 | 7.71 |   143 | 4.5 |     32 |     4× |    109 |
+| 8 |   8 192 | 7.69 |   583 | 5.1 |    114 |    15× |    544 |
+| 8 |  32 768 | 7.42 | 4 520 | 11.7 |    387 |    52× |  4 434 |
+| 8 |  65 536 | 7.67 | 15 615 | 20.6 |    757 |    99× | 15 457 |
+| 8 | 131 072 | 7.74 | 56 991 | 40.2 |  1 419 |   183× | 56 680 |
+
+**Reading the table**:
+
+- *Baseline TPOT* grows mildly with D (4.8 ms → 7.7 ms as D goes 1 → 8).
+  Multi-stream decoding has small but nonzero contention even without
+  prefill.
+- *Effective TPOT during* grows mostly with P: a single 8k prefill stalls
+  decode for ~580 ms regardless of D, so each stream emits only a handful
+  of tokens during that 580 ms window — effective per-stream TPOT
+  collapses to 100–130 ms. Larger prefill = more chunks = larger stall.
+- *Penalty* is the eff/baseline ratio. Above 50× for P ≥ 32k. Above
+  500× for D=1 at P ≥ 65k.
+- *Max PD-disagg benefit per stream* = `prefill_ttft − per_stream_tokens
+  × baseline_TPOT` ≈ `prefill_ttft` (since interference essentially
+  halts decode). This is the entire prefill duration's worth of decode
+  time that could in principle be recovered.
+
+Two big caveats for **agentic** application:
+
+1. **Decode is short** (~50–200 ms for tool-call output). The actual
+   recoverable benefit per request is bounded by the decode duration,
+   not by `prefill_ttft`. If a decode lasts 100 ms and a 5-second prefill
+   collides with it, PD-disagg can save at most 100 ms — not 5 s.
+2. **PD-disagg pays KV-transfer cost** (MB2: 300 ms – 10 s per request
+   for agentic sizes). For any KV ≥ ~80 MiB the cost already exceeds the
+   ~100 ms benefit ceiling. Cost > benefit across the whole agentic
+   distribution.
+
+## §3.2 cost-vs-benefit figure
+
+`figs/pd_cost_vs_benefit.png` overlays MB1 benefit ceiling (50–200 ms
+band, capped by decode duration) on top of MB2 transfer cost curve. The
+cost curve crosses the benefit ceiling somewhere around **80 MiB / 830
+tokens** of KV — well below the trace mean (192 MiB / 2k tok ≈ trace
+mean per request KV, and we know agentic averages 33k tokens, p99
+125k). For anything bigger PD-disagg pays more than it can recover.
+
+## Reproduction
+
+```bash
+# vllm pair-free single-instance launch
+ssh dash1 'GPU=0 PORT=8000 CHUNK_TOKENS=8192 \
+  bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb1_launch.sh start'
+
+# sweep
+ssh dash1 'source /home/admin/cpfs/wjh/agentic-kv-fresh/.venv/bin/activate && \
+  python /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb1_driver.py \
+    --host 127.0.0.1 --port 8000 \
+    --model /home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \
+    --decode-batch-sizes 1,4,8 --prefill-tokens 2048,8192,32768,65536,131072 \
+    --reps 3 --output-dir /home/admin/cpfs/wjh/agentic-kv-fresh/mb1_results'
+
+# pull + analyze
+scp dash1:/home/admin/cpfs/wjh/agentic-kv-fresh/mb1_results/chunk8192/summary.csv \
+    analysis/mb1/summary.csv
+.venv/bin/python microbench/fresh_setup/analyze_mb1.py \
+  --summary analysis/mb1/summary.csv --out analysis/mb1/breakdown.json
+.venv/bin/python microbench/fresh_setup/plot_mb1.py \
+  --mb1 analysis/mb1/breakdown.json \
+  --mb2-intra analysis/mb2/intra_kvboth_breakdown.json \
+  --mb2-inter analysis/mb2/inter_kvboth_breakdown.json
+
+# teardown
+ssh dash1 'bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb1_launch.sh stop'
+```
+
+## Open questions / next runs
+
+- **Chunk size sensitivity**: this run uses `--max-num-batched-tokens
+  8192`. Sarathi-Serve goes smaller (e.g. 1024) and recovers more
+  decode interleaving inside each prefill burst. Worth running
+  chunk_tokens ∈ {1024, 2048, 4096, 16384} to map the chunk-size axis.
+- **Higher D**: 12, 16 streams to see whether the penalty saturates or
+  keeps shrinking per-stream.
+- **Cross-validate effective_TPOT_during with token-time-series plot**:
+  raw per-token timestamps could reveal whether the stall is a few big
+  spikes or many small ones (currently inferred from p50/p90 spread).
+
+## Run log
+
+### 2026-05-27 — dash1 GPU 0, chunk_tokens=8192
+
+3 × 5 × 3 sweep. CSV: `analysis/mb1/summary.csv`. Per-config JSONs on
+dash1 at `/home/admin/cpfs/wjh/agentic-kv-fresh/mb1_results/chunk8192/`.
+Figures: `figs/mb1_interference.png`, `figs/pd_cost_vs_benefit.png`.
--- a/analysis/mb1/breakdown.json
+++ b/analysis/mb1/breakdown.json
@@ -0,0 +1,199 @@
+{
+  "summary": [
+    {
+      "decode_batch_size": 1,
+      "new_prefill_tokens": 2048,
+      "baseline_tpot_ms": 4.79,
+      "during_tpot_p50_ms_raw": 35.43,
+      "during_tpot_p90_ms_raw": 79.91,
+      "prefill_ttft_ms": 163.3,
+      "num_tokens_during_prefill_total": 4.0,
+      "per_stream_tokens_during": 4.0,
+      "effective_tpot_during_ms": 40.8,
+      "interference_penalty_x": 8.5,
+      "max_pd_disagg_benefit_ms_per_stream": 144.2
+    },
+    {
+      "decode_batch_size": 1,
+      "new_prefill_tokens": 8192,
+      "baseline_tpot_ms": 4.78,
+      "during_tpot_p50_ms_raw": 6.56,
+      "during_tpot_p90_ms_raw": 328.57,
+      "prefill_ttft_ms": 583.9,
+      "num_tokens_during_prefill_total": 5.0,
+      "per_stream_tokens_during": 5.0,
+      "effective_tpot_during_ms": 116.8,
+      "interference_penalty_x": 24.4,
+      "max_pd_disagg_benefit_ms_per_stream": 560.0
+    },
+    {
+      "decode_batch_size": 1,
+      "new_prefill_tokens": 32768,
+      "baseline_tpot_ms": 4.78,
+      "during_tpot_p50_ms_raw": 4.75,
+      "during_tpot_p90_ms_raw": 4.9,
+      "prefill_ttft_ms": 4515.3,
+      "num_tokens_during_prefill_total": 5.0,
+      "per_stream_tokens_during": 5.0,
+      "effective_tpot_during_ms": 903.1,
+      "interference_penalty_x": 188.8,
+      "max_pd_disagg_benefit_ms_per_stream": 4491.4
+    },
+    {
+      "decode_batch_size": 1,
+      "new_prefill_tokens": 65536,
+      "baseline_tpot_ms": 4.78,
+      "during_tpot_p50_ms_raw": 4.69,
+      "during_tpot_p90_ms_raw": 4.97,
+      "prefill_ttft_ms": 15567.6,
+      "num_tokens_during_prefill_total": 5.3,
+      "per_stream_tokens_during": 5.33,
+      "effective_tpot_during_ms": 2918.9,
+      "interference_penalty_x": 610.2,
+      "max_pd_disagg_benefit_ms_per_stream": 15542.0
+    },
+    {
+      "decode_batch_size": 1,
+      "new_prefill_tokens": 131072,
+      "baseline_tpot_ms": 4.78,
+      "during_tpot_p50_ms_raw": 4.71,
+      "during_tpot_p90_ms_raw": 4.9,
+      "prefill_ttft_ms": 56765.2,
+      "num_tokens_during_prefill_total": 5.7,
+      "per_stream_tokens_during": 5.67,
+      "effective_tpot_during_ms": 10017.4,
+      "interference_penalty_x": 2094.5,
+      "max_pd_disagg_benefit_ms_per_stream": 56738.1
+    },
+    {
+      "decode_batch_size": 4,
+      "new_prefill_tokens": 2048,
+      "baseline_tpot_ms": 5.62,
+      "during_tpot_p50_ms_raw": 22.18,
+      "during_tpot_p90_ms_raw": 84.85,
+      "prefill_ttft_ms": 138.3,
+      "num_tokens_during_prefill_total": 15.5,
+      "per_stream_tokens_during": 3.88,
+      "effective_tpot_during_ms": 35.7,
+      "interference_penalty_x": 6.3,
+      "max_pd_disagg_benefit_ms_per_stream": 116.6
+    },
+    {
+      "decode_batch_size": 4,
+      "new_prefill_tokens": 8192,
+      "baseline_tpot_ms": 6.08,
+      "during_tpot_p50_ms_raw": 8.45,
+      "during_tpot_p90_ms_raw": 515.39,
+      "prefill_ttft_ms": 574.1,
+      "num_tokens_during_prefill_total": 18.0,
+      "per_stream_tokens_during": 4.5,
+      "effective_tpot_during_ms": 127.6,
+      "interference_penalty_x": 21.0,
+      "max_pd_disagg_benefit_ms_per_stream": 546.8
+    },
+    {
+      "decode_batch_size": 4,
+      "new_prefill_tokens": 32768,
+      "baseline_tpot_ms": 6.09,
+      "during_tpot_p50_ms_raw": 9.83,
+      "during_tpot_p90_ms_raw": 1314.87,
+      "prefill_ttft_ms": 4529.1,
+      "num_tokens_during_prefill_total": 47.5,
+      "per_stream_tokens_during": 11.88,
+      "effective_tpot_during_ms": 381.4,
+      "interference_penalty_x": 62.7,
+      "max_pd_disagg_benefit_ms_per_stream": 4456.9
+    },
+    {
+      "decode_batch_size": 4,
+      "new_prefill_tokens": 65536,
+      "baseline_tpot_ms": 5.85,
+      "during_tpot_p50_ms_raw": 6.41,
+      "during_tpot_p90_ms_raw": 2077.47,
+      "prefill_ttft_ms": 15586.5,
+      "num_tokens_during_prefill_total": 79.0,
+      "per_stream_tokens_during": 19.75,
+      "effective_tpot_during_ms": 789.2,
+      "interference_penalty_x": 135.0,
+      "max_pd_disagg_benefit_ms_per_stream": 15471.0
+    },
+    {
+      "decode_batch_size": 4,
+      "new_prefill_tokens": 131072,
+      "baseline_tpot_ms": 6.27,
+      "during_tpot_p50_ms_raw": 6.3,
+      "during_tpot_p90_ms_raw": 4405.18,
+      "prefill_ttft_ms": 56697.1,
+      "num_tokens_during_prefill_total": 149.5,
+      "per_stream_tokens_during": 37.38,
+      "effective_tpot_during_ms": 1517.0,
+      "interference_penalty_x": 241.8,
+      "max_pd_disagg_benefit_ms_per_stream": 56462.6
+    },
+    {
+      "decode_batch_size": 8,
+      "new_prefill_tokens": 2048,
+      "baseline_tpot_ms": 7.71,
+      "during_tpot_p50_ms_raw": 8.38,
+      "during_tpot_p90_ms_raw": 98.98,
+      "prefill_ttft_ms": 143.1,
+      "num_tokens_during_prefill_total": 35.7,
+      "per_stream_tokens_during": 4.46,
+      "effective_tpot_during_ms": 32.1,
+      "interference_penalty_x": 4.2,
+      "max_pd_disagg_benefit_ms_per_stream": 108.8
+    },
+    {
+      "decode_batch_size": 8,
+      "new_prefill_tokens": 8192,
+      "baseline_tpot_ms": 7.69,
+      "during_tpot_p50_ms_raw": 9.34,
+      "during_tpot_p90_ms_raw": 519.29,
+      "prefill_ttft_ms": 583.3,
+      "num_tokens_during_prefill_total": 41.0,
+      "per_stream_tokens_during": 5.12,
+      "effective_tpot_during_ms": 113.8,
+      "interference_penalty_x": 14.8,
+      "max_pd_disagg_benefit_ms_per_stream": 543.9
+    },
+    {
+      "decode_batch_size": 8,
+      "new_prefill_tokens": 32768,
+      "baseline_tpot_ms": 7.42,
+      "during_tpot_p50_ms_raw": 11.61,
+      "during_tpot_p90_ms_raw": 1315.48,
+      "prefill_ttft_ms": 4520.3,
+      "num_tokens_during_prefill_total": 93.3,
+      "per_stream_tokens_during": 11.67,
+      "effective_tpot_during_ms": 387.5,
+      "interference_penalty_x": 52.2,
+      "max_pd_disagg_benefit_ms_per_stream": 4433.7
+    },
+    {
+      "decode_batch_size": 8,
+      "new_prefill_tokens": 65536,
+      "baseline_tpot_ms": 7.67,
+      "during_tpot_p50_ms_raw": 19.09,
+      "during_tpot_p90_ms_raw": 2471.4,
+      "prefill_ttft_ms": 15615.5,
+      "num_tokens_during_prefill_total": 165.0,
+      "per_stream_tokens_during": 20.62,
+      "effective_tpot_during_ms": 757.1,
+      "interference_penalty_x": 98.8,
+      "max_pd_disagg_benefit_ms_per_stream": 15457.4
+    },
+    {
+      "decode_batch_size": 8,
+      "new_prefill_tokens": 131072,
+      "baseline_tpot_ms": 7.74,
+      "during_tpot_p50_ms_raw": 11.51,
+      "during_tpot_p90_ms_raw": 4895.27,
+      "prefill_ttft_ms": 56991.4,
+      "num_tokens_during_prefill_total": 321.3,
+      "per_stream_tokens_during": 40.17,
+      "effective_tpot_during_ms": 1418.9,
+      "interference_penalty_x": 183.3,
+      "max_pd_disagg_benefit_ms_per_stream": 56680.4
+    }
+  ]
+}
--- a/analysis/mb1/summary.csv
+++ b/analysis/mb1/summary.csv
@@ -0,0 +1,46 @@
+chunk_size,decode_batch_size,new_prefill_tokens,repetition,tpot_baseline_p50_ms,tpot_baseline_p90_ms,tpot_during_prefill_p50_ms,tpot_during_prefill_p90_ms,tpot_after_prefill_p50_ms,prefill_ttft_ms,num_tokens_during_prefill,tpot_penalty_p50_ms,tpot_penalty_ratio
+8192,1,131072,0,4.777565016411245,4.900234832894057,4.701301048044115,4.948397364933044,0.0,56719.25117995124,7,-0.07626396836712956,0.9840370632099913
+8192,1,131072,1,4.779465030878782,4.883405601140112,4.707481013610959,4.85471700085327,0.0,56696.089847013354,5,-0.07198401726782322,0.9849388965495606
+8192,1,131072,2,4.790953011251986,4.880544205661863,4.728371975943446,4.907831805758178,0.0,56880.19039196661,5,-0.06258103530853987,0.9869376645603573
+8192,1,2048,0,4.77885699365288,4.894876398611814,41.434570477576926,88.97331730695441,0.0,183.2046649651602,4,36.655713483924046,8.670393471202205
+8192,1,2048,1,4.788161953911185,4.949774022679776,41.68213551747613,83.5143867880106,0.0,175.55483896285295,4,36.89397356356494,8.705247633369687
+8192,1,2048,2,4.7893429873511195,4.874200583435595,23.186982492916286,67.25202781381086,0.0,131.23180496040732,4,18.397639505565166,4.841370215946989
+8192,1,32768,0,4.789774015080184,4.870833398308605,4.738486022688448,4.886626999359578,0.0,4500.839321000967,5,-0.051287992391735315,0.9892921895207875
+8192,1,32768,1,4.776834975928068,4.891659819986671,4.729953012429178,4.9245511763729155,0.0,4496.073378017172,5,-0.0468819634988904,0.9901855593221991
+8192,1,32768,2,4.784431017469615,4.866032593417913,4.782894975505769,4.8977664206177,0.0,4549.013931944501,5,-0.0015360419638454914,0.9996789499193871
+8192,1,65536,0,4.778854956384748,4.9255444086156785,4.633405013009906,4.895579582080245,0.0,15530.37424501963,5,-0.1454499433748424,0.9695638506080803
+8192,1,65536,1,4.784283053595573,4.8808404128067195,4.754905996378511,4.985795798711479,0.0,15584.887631004676,5,-0.02937705721706152,0.99385967408534
+8192,1,65536,2,4.787993966601789,4.9004736240021884,4.6836750116199255,5.0271204963792115,0.0,15587.390075030271,6,-0.1043189549818635,0.9782123879625725
+8192,1,8192,0,4.785028984770179,4.878618801012635,7.490115996915847,324.06569679733366,0.0,573.2795029762201,5,2.7050870121456683,1.565323014919123
+8192,1,8192,1,4.778591974172741,4.899543372448534,5.9131429879926145,336.8099076091312,0.0,606.6823820001446,5,1.1345510138198733,1.237423705550061
+8192,1,8192,2,4.78826800826937,4.90188361145556,6.276679981965572,324.8370993998833,0.0,571.7499859747477,5,1.488411973696202,1.310845585736994
+8192,4,131072,0,6.113810988608748,6.309205386787653,0.0,0.0,0.0,56702.702289039735,0,-6.113810988608748,0.0
+8192,4,131072,1,6.630807969486341,7.086459483252838,6.2820459716022015,4400.500871409893,0.0,56807.70832300186,150,-0.3487619978841394,0.9474027902045915
+8192,4,131072,2,6.073819473385811,6.344516028184444,6.326125003397465,4409.856556192978,0.0,56580.784838995896,149,0.2523055300116539,1.0415398467335428
+8192,4,2048,0,5.402160517405719,5.543816485442221,6.210724503034726,84.62208869168535,6.125201500253752,140.3041940066032,18,0.8085639856290072,1.1496741873966574
+8192,4,2048,1,6.067108013667166,6.381415005307645,0.0,0.0,0.0,140.06177097326145,0,-6.067108013667166,0.0
+8192,4,2048,2,5.400336522143334,5.536347016459331,38.15686801681295,85.07051098858938,5.25214200024493,134.67552902875468,13,32.756531494669616,7.065646346363043
+8192,4,32768,0,6.115561525803059,6.369604001520202,7.216634490760043,1314.6978712815326,5.17624247004278,4522.433568025008,50,1.101072964956984,1.1800444587649532
+8192,4,32768,1,6.070095987524837,6.3612310332246125,0.0,0.0,0.0,4508.074064040557,0,-6.070095987524837,0.0
+8192,4,32768,2,6.0734800063073635,6.312666402664036,12.442811043001711,1315.0411327951588,4.754714027512819,4556.892123946454,45,6.369331036694348,2.0487119460473635
+8192,4,65536,0,5.406292999396101,5.540905491216108,0.0,0.0,0.0,15581.590663990937,0,-5.406292999396101,0.0
+8192,4,65536,1,6.076910009142011,6.315114628523588,0.0,0.0,0.0,15574.196094006766,0,-6.076910009142011,0.0
+8192,4,65536,2,6.060379033442587,6.384042033459991,6.411670008674264,2077.4700703914277,4.8022730043157935,15603.720718005206,79,0.3512909752316773,1.0579651822589267
+8192,4,8192,0,6.110575021011755,6.416070973500609,8.451583969872445,515.3855616226792,5.358011490898207,574.6672929963097,18,2.34100894886069,1.3831077993169092
+8192,4,8192,1,6.051429023500532,6.398122606333345,0.0,0.0,0.0,573.6081749782898,0,-6.051429023500532,0.0
+8192,4,8192,2,6.064729997888207,6.366449000779539,0.0,0.0,0.0,574.1707819979638,0,-6.064729997888207,0.0
+8192,8,131072,0,7.737616979284212,7.99839201499708,10.740376019384712,4742.438135773409,7.792441989295185,57010.66731195897,335,3.0027590401005,1.388072845701685
+8192,8,131072,1,7.744895527139306,8.013638522243127,8.647068490972742,5123.228083999129,7.672236970392987,56970.40947602363,310,0.9021729638334364,1.116486137310966
+8192,8,131072,2,7.740180502878502,8.016240986762568,15.140031988266855,4820.136589207682,7.68946303287521,56993.02393599646,319,7.3998514853883535,1.9560308680962177
+8192,8,2048,0,7.741285488009453,8.022559515666217,8.103576023131609,124.87094267853536,7.6825070136692375,141.97922096354887,30,0.36229053512215614,1.046799789993963
+8192,8,2048,1,7.728310010861605,8.021069981623441,8.17067950265482,84.82906777062453,7.745136506855488,144.1582590341568,38,0.4423694917932153,1.0572401328584768
+8192,8,2048,2,7.662211020942777,8.034424972720444,8.87883099494502,87.23540699575096,7.592331967316568,143.27958395006135,39,1.216619974002242,1.1587818412566437
+8192,8,32768,0,7.295333489309996,7.422819995554164,11.429400008637458,1315.43214758276,7.8034960315562785,4523.641717038117,94,4.134066519327462,1.5666727265292526
+8192,8,32768,1,7.278127042809501,7.490781514206901,12.640403030673042,1315.491412486881,7.821676495950669,4519.993302994408,90,5.362275987863541,1.736765922925357
+8192,8,32768,2,7.684049021918327,8.047712198458612,10.752685484476388,1315.5166705255397,7.80402502277866,4517.200137954205,96,3.068636462558061,1.3993514947399404
+8192,8,65536,0,7.708174001891166,8.017168991500512,26.662671996746212,2496.8427699001018,7.768569514155388,15603.601168957539,160,18.954497994855046,3.459012729889679
+8192,8,65536,1,7.594842027174309,7.9874323040712625,13.054963492322713,2459.1690181812737,7.54699349636212,15620.474929979537,174,5.460121465148404,1.7189249553331216
+8192,8,65536,2,7.693717983784154,7.933055714238435,17.5579380011186,2458.176895044744,7.808708498487249,15622.32490995666,161,9.864220017334446,2.2821135422594123
+8192,8,8192,0,7.636573514901102,7.904737605713308,10.151655005756766,514.8188057704829,7.7977380133233964,575.7745200535282,37,2.515081490855664,1.3293468577167538
+8192,8,8192,1,7.687711506150663,7.965393498307094,9.002390026580542,524.0793236298487,7.753994490485638,592.1044679707848,45,1.3146785204298794,1.1710103870804793
+8192,8,8192,2,7.756220467854291,8.035426988499239,8.864110975991935,518.9726910321042,7.770269992761314,581.98908099439,41,1.1078905081376433,1.1428389655411813
--- a/figs/mb1_interference.png
+++ b/figs/mb1_interference.png
--- a/figs/pd_cost_vs_benefit.png
+++ b/figs/pd_cost_vs_benefit.png
--- a/microbench/fresh_setup/analyze_mb1.py
+++ b/microbench/fresh_setup/analyze_mb1.py
@@ -0,0 +1,98 @@
+#!/usr/bin/env python3
+"""Aggregate MB1 results: per-(D, P) baseline vs during-prefill effective TPOT.
+
+The driver's `tpot_during_prefill_p50_ms` is computed per-token and can be
+misleading: chunked-prefill schedules decode alongside each prefill chunk,
+so most decode-token intervals during the prefill burst look "normal" — but
+each chunk completion creates a long-stall token. p50 hides this, p90
+exposes it, but the BEST single-number summary of "how much was decode
+slowed by prefill" is the *effective TPOT during the prefill burst*:
+
+    effective_TPOT_during = prefill_ttft_ms / (num_tokens_during_prefill / D)
+
+i.e. wall-clock time divided by per-stream tokens emitted in that window.
+This captures the true average throughput of each decode stream while a
+prefill burst is underway. Compared to baseline_TPOT it gives the
+"phase-interference penalty" PD-disagg could in principle recover.
+"""
+from __future__ import annotations
+
+import argparse
+import csv
+import json
+import statistics
+from collections import defaultdict
+from pathlib import Path
+
+
+def main() -> None:
+    p = argparse.ArgumentParser()
+    p.add_argument("--summary", type=Path, required=True)
+    p.add_argument("--out", type=Path, required=True)
+    args = p.parse_args()
+
+    rows = list(csv.DictReader(args.summary.open()))
+    by_dp: dict[tuple[int, int], list[dict]] = defaultdict(list)
+    for r in rows:
+        D = int(r["decode_batch_size"])
+        P = int(r["new_prefill_tokens"])
+        by_dp[(D, P)].append(r)
+
+    summary = []
+    for (D, P) in sorted(by_dp):
+        rs = by_dp[(D, P)]
+        base = statistics.mean(float(r["tpot_baseline_p50_ms"]) for r in rs)
+        during_p50_vals = [float(r["tpot_during_prefill_p50_ms"]) for r in rs
+                            if float(r["tpot_during_prefill_p50_ms"]) > 0]
+        during_p90_vals = [float(r["tpot_during_prefill_p90_ms"]) for r in rs
+                            if float(r["tpot_during_prefill_p90_ms"]) > 0]
+        ttft_vals = [float(r["prefill_ttft_ms"]) for r in rs]
+        n_tok_vals = [float(r["num_tokens_during_prefill"]) for r in rs
+                       if float(r["num_tokens_during_prefill"]) > 0]
+
+        if not n_tok_vals or D == 0:
+            continue
+        ttft = statistics.mean(ttft_vals)
+        n_tok_total = statistics.mean(n_tok_vals)
+        per_stream_tokens = n_tok_total / D
+        eff_tpot_during = ttft / per_stream_tokens if per_stream_tokens > 0 else 0
+        penalty_x = eff_tpot_during / base if base > 0 else 0
+
+        # PD-disagg potential benefit (per stream, ms):
+        #   if decode ran at baseline rate throughout the prefill window,
+        #   it would emit ttft/baseline tokens. Actual is per_stream_tokens.
+        #   Time saved if no interference = ttft - per_stream_tokens * baseline
+        time_saved_per_stream = ttft - per_stream_tokens * base
+
+        summary.append({
+            "decode_batch_size": D,
+            "new_prefill_tokens": P,
+            "baseline_tpot_ms": round(base, 2),
+            "during_tpot_p50_ms_raw": (round(statistics.mean(during_p50_vals), 2)
+                                        if during_p50_vals else None),
+            "during_tpot_p90_ms_raw": (round(statistics.mean(during_p90_vals), 2)
+                                        if during_p90_vals else None),
+            "prefill_ttft_ms": round(ttft, 1),
+            "num_tokens_during_prefill_total": round(n_tok_total, 1),
+            "per_stream_tokens_during": round(per_stream_tokens, 2),
+            "effective_tpot_during_ms": round(eff_tpot_during, 1),
+            "interference_penalty_x": round(penalty_x, 1),
+            "max_pd_disagg_benefit_ms_per_stream": round(time_saved_per_stream, 1),
+        })
+
+    args.out.parent.mkdir(parents=True, exist_ok=True)
+    args.out.write_text(json.dumps({"summary": summary}, indent=2))
+
+    print(f"{'D':>3} {'P':>7} {'base_ms':>9} {'eff_during_ms':>15} "
+          f"{'penalty':>10} {'pd_benefit_ms':>15}")
+    for s in summary:
+        print(f"{s['decode_batch_size']:>3} {s['new_prefill_tokens']:>7} "
+              f"{s['baseline_tpot_ms']:>9.2f} "
+              f"{s['effective_tpot_during_ms']:>15.1f} "
+              f"{s['interference_penalty_x']:>9.1f}x "
+              f"{s['max_pd_disagg_benefit_ms_per_stream']:>15.0f}")
+    print(f"\nwrote {args.out}")
+
+
+if __name__ == "__main__":
+    main()
--- a/microbench/fresh_setup/mb1_driver.py
+++ b/microbench/fresh_setup/mb1_driver.py
@@ -0,0 +1,422 @@
+#!/usr/bin/env python3
+"""Prefill-Decode Interference Microbenchmark Driver.
+
+Measures TPOT degradation caused by prefill chunks interfering with ongoing decode batches.
+Produces: f(decode_batch_size, new_prefill_tokens, chunk_size) -> TPOT_penalty_ms
+
+Usage:
+    python driver.py --host 127.0.0.1 --port 8000 \
+        --decode-batch-sizes 0,1,2,4,6,8,12 \
+        --prefill-tokens 512,1024,2048,4096,8192,16384,32768 \
+        --reps 5 --output-dir results/
+"""
+
+import argparse
+import asyncio
+import json
+import os
+import time
+from dataclasses import dataclass, asdict
+from pathlib import Path
+from typing import Optional
+
+import httpx
+import numpy as np
+
+
+FIXED_SEED_PROMPT = (
+    "You are a helpful assistant. Please analyze the following document carefully "
+    "and provide a comprehensive summary covering all key points, main arguments, "
+    "supporting evidence, and conclusions. The document discusses various aspects "
+    "of distributed systems, including consensus protocols, fault tolerance mechanisms, "
+    "and performance optimization strategies for large-scale deployments.\n\n"
+) * 50  # ~4k tokens worth of repeated text for prefix cache sharing
+
+WARMUP_TOKENS = 32
+MEASURE_WINDOW_TOKENS = 500
+
+
+@dataclass
+class Config:
+    decode_batch_size: int
+    new_prefill_tokens: int
+    chunk_size: int
+    model: str
+    repetition: int
+
+
+@dataclass
+class BaselineResult:
+    tpot_p50_ms: float
+    tpot_p90_ms: float
+    tpot_p99_ms: float
+    tokens_collected: int
+
+
+@dataclass
+class InterferenceResult:
+    tpot_during_prefill_p50_ms: float
+    tpot_during_prefill_p90_ms: float
+    tpot_after_prefill_p50_ms: float
+    prefill_ttft_ms: float
+    num_tokens_during_prefill: int
+
+
+async def stream_tokens(client: httpx.AsyncClient, url: str, payload: dict) -> list[float]:
+    """Send a streaming request, return list of timestamps (seconds) for each token."""
+    timestamps = []
+    async with client.stream("POST", url, json=payload, timeout=300.0) as resp:
+        resp.raise_for_status()
+        async for line in resp.aiter_lines():
+            if line.startswith("data: "):
+                data = line[6:]
+                if data.strip() == "[DONE]":
+                    break
+                try:
+                    chunk = json.loads(data)
+                    choices = chunk.get("choices", [])
+                    if not choices:
+                        continue
+                    delta = choices[0].get("delta", {})
+                    if "role" in delta:
+                        continue
+                    timestamps.append(time.perf_counter())
+                except json.JSONDecodeError:
+                    continue
+    return timestamps
+
+
+def compute_tpot(timestamps: list[float], skip_first: int = 0) -> np.ndarray:
+    """Compute inter-token intervals in ms, skipping first N tokens."""
+    if len(timestamps) < skip_first + 2:
+        return np.array([])
+    ts = np.array(timestamps[skip_first:])
+    return np.diff(ts) * 1000.0  # seconds → ms
+
+
+def make_decode_payload(model: str) -> dict:
+    return {
+        "model": model,
+        "messages": [{"role": "user", "content": FIXED_SEED_PROMPT}],
+        "max_tokens": WARMUP_TOKENS + MEASURE_WINDOW_TOKENS + 50,
+        "temperature": 0,
+        "stream": True,
+    }
+
+
+def make_prefill_payload(model: str, num_tokens: int) -> dict:
+    import hashlib
+    import uuid
+    # Generate UNIQUE content every call to guarantee zero prefix cache hits.
+    # Calibration: each "Block N: <32-hex>" → ~35 tokens after tokenization
+    unique_id = f"{uuid.uuid4().hex}_{time.time_ns()}"
+    n_parts = max(1, num_tokens // 35)
+    content_parts = []
+    for i in range(n_parts):
+        seed = hashlib.md5(f"{unique_id}_{i}".encode()).hexdigest()
+        content_parts.append(f"Block {i}: {seed}")
+    content = " ".join(content_parts)
+    return {
+        "model": model,
+        "messages": [{"role": "user", "content": content}],
+        "max_tokens": 1,
+        "temperature": 0,
+        "stream": True,
+    }
+
+
+async def wait_for_steady_state(decode_streams: list[asyncio.Task], min_tokens: int = 32):
+    """Wait until all decode streams have emitted at least min_tokens."""
+    # We don't directly control this — we wait a fixed time based on expected TPOT
+    # At ~50ms/token, 32 tokens ≈ 1.6s. Wait 3s to be safe.
+    await asyncio.sleep(3.0)
+
+
+async def run_baseline(
+    client: httpx.AsyncClient, url: str, model: str, decode_batch_size: int
+) -> Optional[BaselineResult]:
+    """Measure decode-only TPOT (no prefill interference)."""
+    if decode_batch_size == 0:
+        return BaselineResult(tpot_p50_ms=0, tpot_p90_ms=0, tpot_p99_ms=0, tokens_collected=0)
+
+    payloads = [make_decode_payload(model) for _ in range(decode_batch_size)]
+    tasks = [asyncio.create_task(stream_tokens(client, url, p)) for p in payloads]
+
+    all_timestamps = await asyncio.gather(*tasks, return_exceptions=True)
+
+    all_tpots = []
+    for ts in all_timestamps:
+        if isinstance(ts, Exception):
+            print(f"  [WARN] decode stream error: {ts}")
+            continue
+        tpot = compute_tpot(ts, skip_first=WARMUP_TOKENS)
+        if len(tpot) > 0:
+            all_tpots.extend(tpot.tolist())
+
+    if not all_tpots:
+        return None
+
+    arr = np.array(all_tpots)
+    return BaselineResult(
+        tpot_p50_ms=float(np.percentile(arr, 50)),
+        tpot_p90_ms=float(np.percentile(arr, 90)),
+        tpot_p99_ms=float(np.percentile(arr, 99)),
+        tokens_collected=len(arr),
+    )
+
+
+async def run_interference(
+    client: httpx.AsyncClient,
+    url: str,
+    model: str,
+    decode_batch_size: int,
+    new_prefill_tokens: int,
+) -> Optional[InterferenceResult]:
+    """Measure TPOT while a prefill request is being processed."""
+    if decode_batch_size == 0:
+        # No decode to interfere with; just measure prefill TTFT
+        prefill_payload = make_prefill_payload(model, new_prefill_tokens)
+        t_start = time.perf_counter()
+        ts = await stream_tokens(client, url, prefill_payload)
+        prefill_ttft = (ts[0] - t_start) * 1000.0 if ts else 0
+        return InterferenceResult(
+            tpot_during_prefill_p50_ms=0,
+            tpot_during_prefill_p90_ms=0,
+            tpot_after_prefill_p50_ms=0,
+            prefill_ttft_ms=prefill_ttft,
+            num_tokens_during_prefill=0,
+        )
+
+    # Phase 1: Start decode streams
+    decode_payloads = [make_decode_payload(model) for _ in range(decode_batch_size)]
+
+    decode_timestamps: list[list[float]] = [[] for _ in range(decode_batch_size)]
+    prefill_done_event = asyncio.Event()
+    prefill_ttft_ms = 0.0
+    prefill_inject_time = 0.0
+
+    async def decode_stream_with_tracking(idx: int, payload: dict):
+        timestamps = await stream_tokens(client, url, payload)
+        decode_timestamps[idx] = timestamps
+
+    async def prefill_after_warmup():
+        nonlocal prefill_ttft_ms, prefill_inject_time
+        # Wait for decode streams to stabilize
+        await asyncio.sleep(1.0)
+        prefill_inject_time = time.perf_counter()
+        prefill_payload = make_prefill_payload(model, new_prefill_tokens)
+        ts = await stream_tokens(client, url, prefill_payload)
+        if ts:
+            prefill_ttft_ms = (ts[0] - prefill_inject_time) * 1000.0
+        prefill_done_event.set()
+
+    # Launch all
+    decode_tasks = [
+        asyncio.create_task(decode_stream_with_tracking(i, p))
+        for i, p in enumerate(decode_payloads)
+    ]
+    prefill_task = asyncio.create_task(prefill_after_warmup())
+
+    await asyncio.gather(*decode_tasks, prefill_task, return_exceptions=True)
+
+    # Analyze: split decode tokens into "during prefill" and "after prefill"
+    prefill_end_time = prefill_inject_time + prefill_ttft_ms / 1000.0
+
+    tpot_during = []
+    tpot_after = []
+
+    for ts_list in decode_timestamps:
+        if len(ts_list) < WARMUP_TOKENS + 5:
+            continue
+        for i in range(WARMUP_TOKENS + 1, len(ts_list)):
+            t_prev = ts_list[i - 1]
+            t_curr = ts_list[i]
+            interval_ms = (t_curr - t_prev) * 1000.0
+
+            if prefill_inject_time <= t_prev <= prefill_end_time:
+                tpot_during.append(interval_ms)
+            elif t_curr > prefill_end_time + 0.05:  # 50ms after prefill settles
+                tpot_after.append(interval_ms)
+
+    during_arr = np.array(tpot_during) if tpot_during else np.array([0.0])
+    after_arr = np.array(tpot_after) if tpot_after else np.array([0.0])
+
+    return InterferenceResult(
+        tpot_during_prefill_p50_ms=float(np.percentile(during_arr, 50)),
+        tpot_during_prefill_p90_ms=float(np.percentile(during_arr, 90)),
+        tpot_after_prefill_p50_ms=float(np.percentile(after_arr, 50)),
+        prefill_ttft_ms=prefill_ttft_ms,
+        num_tokens_during_prefill=len(tpot_during),
+    )
+
+
+async def run_single_config(
+    client: httpx.AsyncClient,
+    url: str,
+    model: str,
+    decode_batch_size: int,
+    new_prefill_tokens: int,
+    chunk_size: int,
+    rep: int,
+    output_dir: Path,
+):
+    """Run one (D, P) configuration."""
+    config = Config(
+        decode_batch_size=decode_batch_size,
+        new_prefill_tokens=new_prefill_tokens,
+        chunk_size=chunk_size,
+        model=model,
+        repetition=rep,
+    )
+
+    print(f"  [rep {rep}] Running baseline (D={decode_batch_size})...")
+    baseline = await run_baseline(client, url, model, decode_batch_size)
+    if baseline is None:
+        print(f"  [rep {rep}] Baseline failed, skipping")
+        return
+
+    # Brief cooldown between baseline and interference
+    await asyncio.sleep(2.0)
+
+    print(f"  [rep {rep}] Running interference (D={decode_batch_size}, P={new_prefill_tokens})...")
+    interference = await run_interference(
+        client, url, model, decode_batch_size, new_prefill_tokens
+    )
+    if interference is None:
+        print(f"  [rep {rep}] Interference measurement failed, skipping")
+        return
+
+    # Compute derived metrics
+    tpot_penalty_p50 = interference.tpot_during_prefill_p50_ms - baseline.tpot_p50_ms
+    penalty_ratio = (
+        interference.tpot_during_prefill_p50_ms / baseline.tpot_p50_ms
+        if baseline.tpot_p50_ms > 0 else 0
+    )
+
+    result = {
+        "config": asdict(config),
+        "baseline": asdict(baseline),
+        "interference": asdict(interference),
+        "derived": {
+            "tpot_penalty_p50_ms": tpot_penalty_p50,
+            "tpot_penalty_ratio": penalty_ratio,
+        },
+    }
+
+    # Save
+    fname = f"D{decode_batch_size}_P{new_prefill_tokens}_rep{rep}.json"
+    out_path = output_dir / fname
+    out_path.write_text(json.dumps(result, indent=2))
+    print(f"  [rep {rep}] Done. penalty={tpot_penalty_p50:.1f}ms ratio={penalty_ratio:.2f}")
+
+
+async def main():
+    parser = argparse.ArgumentParser(description="Prefill-Decode Interference Microbenchmark")
+    parser.add_argument("--host", default="127.0.0.1")
+    parser.add_argument("--port", type=int, default=8000)
+    parser.add_argument("--model", default="Qwen3-Coder-30B-A3B-Instruct")
+    parser.add_argument("--decode-batch-sizes", default="0,1,2,4,6,8,12",
+                        help="Comma-separated decode batch sizes")
+    parser.add_argument("--prefill-tokens", default="512,1024,2048,4096,8192,16384,32768",
+                        help="Comma-separated prefill token counts")
+    parser.add_argument("--chunk-size", type=int, default=8192,
+                        help="vLLM max_num_batched_tokens (effective chunk size)")
+    parser.add_argument("--reps", type=int, default=5)
+    parser.add_argument("--output-dir", default="results/interference")
+    args = parser.parse_args()
+
+    decode_sizes = [int(x) for x in args.decode_batch_sizes.split(",")]
+    prefill_tokens = [int(x) for x in args.prefill_tokens.split(",")]
+
+    output_dir = Path(args.output_dir) / f"chunk{args.chunk_size}"
+    output_dir.mkdir(parents=True, exist_ok=True)
+
+    url = f"http://{args.host}:{args.port}/v1/chat/completions"
+    print(f"Target: {url}")
+    print(f"Model: {args.model}")
+    print(f"Chunk size: {args.chunk_size}")
+    print(f"Decode batch sizes: {decode_sizes}")
+    print(f"Prefill tokens: {prefill_tokens}")
+    print(f"Repetitions: {args.reps}")
+    print(f"Output: {output_dir}")
+    print()
+
+    async with httpx.AsyncClient(timeout=httpx.Timeout(600.0)) as client:
+        # Sanity check: is the server up?
+        try:
+            resp = await client.get(f"http://{args.host}:{args.port}/v1/models")
+            resp.raise_for_status()
+            models = resp.json()
+            print(f"Server ready. Models: {[m['id'] for m in models.get('data', [])]}")
+        except Exception as e:
+            print(f"ERROR: Cannot reach server at {args.host}:{args.port}: {e}")
+            return
+
+        total_configs = len(decode_sizes) * len(prefill_tokens)
+        done = 0
+
+        for D in decode_sizes:
+            for P in prefill_tokens:
+                done += 1
+                print(f"\n[{done}/{total_configs}] D={D}, P={P}")
+
+                for rep in range(args.reps):
+                    try:
+                        await run_single_config(
+                            client, url, args.model, D, P,
+                            args.chunk_size, rep, output_dir,
+                        )
+                    except Exception as e:
+                        print(f"  [rep {rep}] ERROR: {e}")
+
+                    # Cooldown between reps
+                    await asyncio.sleep(1.0)
+
+                # Cooldown between configs
+                await asyncio.sleep(3.0)
+
+    print("\n\nDone! Results in:", output_dir)
+    # Generate summary CSV
+    await generate_summary(output_dir, args.chunk_size)
+
+
+async def generate_summary(output_dir: Path, chunk_size: int):
+    """Aggregate all per-run JSONs into a summary CSV."""
+    import csv
+
+    rows = []
+    for f in sorted(output_dir.glob("D*_P*_rep*.json")):
+        data = json.loads(f.read_text())
+        cfg = data["config"]
+        bl = data["baseline"]
+        itf = data["interference"]
+        drv = data["derived"]
+        rows.append({
+            "chunk_size": cfg["chunk_size"],
+            "decode_batch_size": cfg["decode_batch_size"],
+            "new_prefill_tokens": cfg["new_prefill_tokens"],
+            "repetition": cfg["repetition"],
+            "tpot_baseline_p50_ms": bl["tpot_p50_ms"],
+            "tpot_baseline_p90_ms": bl["tpot_p90_ms"],
+            "tpot_during_prefill_p50_ms": itf["tpot_during_prefill_p50_ms"],
+            "tpot_during_prefill_p90_ms": itf["tpot_during_prefill_p90_ms"],
+            "tpot_after_prefill_p50_ms": itf["tpot_after_prefill_p50_ms"],
+            "prefill_ttft_ms": itf["prefill_ttft_ms"],
+            "num_tokens_during_prefill": itf["num_tokens_during_prefill"],
+            "tpot_penalty_p50_ms": drv["tpot_penalty_p50_ms"],
+            "tpot_penalty_ratio": drv["tpot_penalty_ratio"],
+        })
+
+    if not rows:
+        return
+
+    csv_path = output_dir / "summary.csv"
+    with open(csv_path, "w", newline="") as f:
+        writer = csv.DictWriter(f, fieldnames=rows[0].keys())
+        writer.writeheader()
+        writer.writerows(rows)
+    print(f"Summary CSV written: {csv_path}")
+
+
+if __name__ == "__main__":
+    asyncio.run(main())
--- a/microbench/fresh_setup/mb1_launch.sh
+++ b/microbench/fresh_setup/mb1_launch.sh
@@ -0,0 +1,80 @@
+#!/usr/bin/env bash
+# Launch a SINGLE vLLM instance on dash1 for MB1 (prefill-decode interference).
+# No kv_connector — MB1 measures intra-GPU phase interference, not transfer.
+# chunked_prefill is enabled by default in vLLM 0.18.1 (this is the regime
+# we want to characterize: how much benefit can PD-disagg buy on top of
+# the existing chunked-prefill colocated baseline?).
+#
+# Usage:
+#   GPU=0 PORT=8000 CHUNK_TOKENS=8192 bash mb1_launch.sh start
+#   bash mb1_launch.sh status
+#   bash mb1_launch.sh stop
+
+set -eo pipefail
+
+FRESH_ROOT="/home/admin/cpfs/wjh/agentic-kv-fresh"
+VENV="${FRESH_ROOT}/.venv"
+MODEL="${MODEL:-/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct}"
+LOGS_DIR="${LOGS_DIR:-${FRESH_ROOT}/mb1_logs}"
+
+GPU="${GPU:-0}"
+PORT="${PORT:-8000}"
+MASTER="${MASTER:-29500}"
+# max_num_batched_tokens — controls the chunked-prefill chunk granularity.
+# vLLM 0.18.1 default is 8192; we keep that as the headline run and
+# optionally repeat at 32768 to expose the chunk-size effect.
+CHUNK_TOKENS="${CHUNK_TOKENS:-8192}"
+
+mkdir -p "${LOGS_DIR}"
+
+stop_local() {
+    pkill -9 -f "vllm serve.*--port ${PORT} " 2>/dev/null || true
+    pkill -9 -f "EngineCore" 2>/dev/null || true
+    sleep 2
+}
+
+case "${1:-start}" in
+    stop)
+        stop_local; exit 0;;
+    status)
+        if curl -sf "http://127.0.0.1:${PORT}/health" >/dev/null 2>&1; then
+            echo "port ${PORT}: UP"
+        else
+            echo "port ${PORT}: DOWN"
+        fi
+        exit 0;;
+    start) ;;
+    *) echo "Unknown command: $1"; exit 1;;
+esac
+
+stop_local
+source "${VENV}/bin/activate"
+
+echo "[mb1] launching: gpu=${GPU} port=${PORT} chunk_tokens=${CHUNK_TOKENS} (no kv_connector)"
+
+PYTHONHASHSEED=42 \
+CUDA_VISIBLE_DEVICES="${GPU}" \
+MASTER_PORT="${MASTER}" \
+nohup vllm serve "${MODEL}" \
+    --host 0.0.0.0 --port "${PORT}" \
+    --tensor-parallel-size 1 \
+    --trust-remote-code --enable-prefix-caching \
+    --dtype auto --gpu-memory-utilization 0.9 \
+    --max-model-len 200000 \
+    --max-num-batched-tokens "${CHUNK_TOKENS}" \
+    --enable-prompt-tokens-details \
+    > "${LOGS_DIR}/vllm_gpu${GPU}_chunk${CHUNK_TOKENS}.log" 2>&1 &
+disown
+
+echo "[mb1] waiting for /health on port ${PORT}..."
+tries=0
+while ! curl -sf "http://127.0.0.1:${PORT}/health" >/dev/null 2>&1; do
+    tries=$((tries+1))
+    if [ ${tries} -gt 180 ]; then
+        echo "[mb1] FATAL port ${PORT} did not come up in 6 min"
+        tail -40 "${LOGS_DIR}/vllm_gpu${GPU}_chunk${CHUNK_TOKENS}.log" || true
+        exit 1
+    fi
+    sleep 2
+done
+echo "[mb1] UP on $(hostname -s):${PORT} (GPU ${GPU}, chunk_tokens=${CHUNK_TOKENS})"
--- a/microbench/fresh_setup/plot_mb1.py
+++ b/microbench/fresh_setup/plot_mb1.py
@@ -0,0 +1,129 @@
+#!/usr/bin/env python3
+"""Plot MB1 interference results + the §3.2 cost-vs-benefit headline figure.
+
+Two outputs:
+
+  mb1_interference.png
+    Effective TPOT during prefill vs prefill size, one line per D.
+    Log-log. Annotates typical agentic decode duration (~100 ms) as a
+    horizontal band so reader can spot when decode would be stalled.
+
+  pd_cost_vs_benefit.png
+    The §3.2 headline. X axis: KV size (MiB). Two stacked curves:
+      - benefit ceiling (MB1) — at most one decode-duration per request
+        of phase isolation can be recovered. Drawn as a flat 100 ms line.
+      - cost (MB2) — Mooncake pure_transfer p50 at that size.
+    Anywhere the cost curve sits ABOVE the benefit ceiling, PD-disagg
+    structurally loses.
+"""
+from __future__ import annotations
+
+import argparse
+import json
+from pathlib import Path
+
+import matplotlib
+matplotlib.use("Agg")
+import matplotlib.pyplot as plt
+import numpy as np
+
+
+def main() -> None:
+    p = argparse.ArgumentParser()
+    p.add_argument("--mb1", type=Path, required=True)
+    p.add_argument("--mb2-intra", type=Path, required=True)
+    p.add_argument("--mb2-inter", type=Path, default=None)
+    p.add_argument("--out-interf", type=Path, default=Path("figs/mb1_interference.png"))
+    p.add_argument("--out-cb", type=Path, default=Path("figs/pd_cost_vs_benefit.png"))
+    args = p.parse_args()
+
+    mb1 = json.loads(args.mb1.read_text())["summary"]
+
+    # ---- mb1_interference.png ----
+    fig, ax = plt.subplots(figsize=(9, 5.5))
+    Ds = sorted({s["decode_batch_size"] for s in mb1})
+    colors = {1: "#1f77b4", 4: "#ff7f0e", 8: "#d62728"}
+    for D in Ds:
+        rows = [s for s in mb1 if s["decode_batch_size"] == D]
+        rows.sort(key=lambda s: s["new_prefill_tokens"])
+        xs = [s["new_prefill_tokens"] for s in rows]
+        ys = [s["effective_tpot_during_ms"] for s in rows]
+        ax.plot(xs, ys, "o-", lw=2, markersize=7,
+                color=colors.get(D, "gray"),
+                label=f"D={D} (baseline {rows[0]['baseline_tpot_ms']:.1f} ms)")
+
+    for tdec, lbl in [(50, "tool-call decode (~50 ms)"),
+                       (100, "agentic decode (~100 ms)"),
+                       (200, "long agentic decode (~200 ms)")]:
+        ax.axhline(tdec, color="#444", lw=0.6, ls=":", alpha=0.6)
+        ax.text(2200, tdec * 1.1, lbl, fontsize=8, color="#444")
+
+    ax.set_xscale("log"); ax.set_yscale("log")
+    ax.set_xlabel("Prefill burst size (tokens, log)")
+    ax.set_ylabel("Per-stream effective TPOT during prefill burst (ms, log)")
+    ax.set_title("MB1: each ongoing decode is essentially halted while prefill runs\n"
+                 "(chunked-prefill ON, vLLM 0.18.1 default, single H20)")
+    ax.grid(True, which="both", alpha=0.3)
+    ax.legend(loc="upper left", fontsize=9)
+    args.out_interf.parent.mkdir(parents=True, exist_ok=True)
+    fig.tight_layout(); fig.savefig(args.out_interf, dpi=150); plt.close(fig)
+    print(f"wrote {args.out_interf}")
+
+    # ---- pd_cost_vs_benefit.png ----
+    mb2_intra = json.loads(args.mb2_intra.read_text())["summary"]
+    mb2_intra = [s for s in mb2_intra if s["input_tokens"] >= 64]
+    intra_x_mib = [s["kv_mib"] for s in mb2_intra]
+    intra_y_ms  = [s["pure_transfer_ms_p50"] for s in mb2_intra]
+
+    fig, ax = plt.subplots(figsize=(9, 5.5))
+    ax.plot(intra_x_mib, intra_y_ms, "o-", color="#d62728", lw=2.4,
+            markersize=8, label="MB2 PD-disagg KV transfer cost (Mooncake, p50)")
+    if args.mb2_inter:
+        mb2_inter = json.loads(args.mb2_inter.read_text())["summary"]
+        mb2_inter = [s for s in mb2_inter if s["input_tokens"] >= 64]
+        inter_x = [s["kv_mib"] for s in mb2_inter]
+        inter_y = [s["pure_transfer_ms_p50"] for s in mb2_inter]
+        ax.plot(inter_x, inter_y, "s--", color="#7a1d1d", lw=2, markersize=7,
+                alpha=0.7, label="MB2 inter-node (same numbers)")
+
+    # Benefit ceiling: typical agentic decode duration (PD-disagg max savings).
+    ax.axhline(100, color="#2ca02c", lw=2.4, ls="-",
+               label="MB1 max benefit ≤ agentic decode (~100 ms)")
+    ax.axhspan(50, 200, alpha=0.15, color="#2ca02c",
+               label="benefit range (50–200 ms decode)")
+
+    # Mark agentic-tail request sizes
+    for kv_mib, lbl in [(192, "trace mean\n(~2k tok)"),
+                         (3072, "p90\n(~33k tok)"),
+                         (6144, "p95\n(~65k tok)"),
+                         (11500, "p99\n(11.5 GiB)")]:
+        ax.axvline(kv_mib, color="#666", lw=0.5, ls=":", alpha=0.5)
+        ax.text(kv_mib, 2, lbl, fontsize=8, color="#444",
+                ha="center", va="bottom")
+
+    ax.set_xscale("log"); ax.set_yscale("log")
+    ax.set_xlim(40, 14000)
+    ax.set_ylim(1, 12000)
+    ax.set_xlabel("Per-request KV size (MiB, log)")
+    ax.set_ylabel("Time per request (ms, log)")
+    ax.set_title("§3.2 headline — PD-disagg KV transfer cost vs phase-isolation benefit\n"
+                 "(both measured on vanilla vLLM 0.18.1 + Mooncake 0.3.11, agentic regime)")
+    ax.grid(True, which="both", alpha=0.3)
+    ax.legend(loc="upper left", fontsize=9)
+
+    # Add explanatory annotation
+    ax.text(10000, 5000,
+            "Cost > benefit for ANY KV size above\n"
+            "the green band (~80 MiB / ~830 tokens).\n"
+            "Below: cost is marginal (<10 ms) but\n"
+            "benefit is also small (decode is short).",
+            fontsize=9, color="#333",
+            ha="right", va="top",
+            bbox=dict(boxstyle="round,pad=0.4", facecolor="#fffacd", alpha=0.9, edgecolor="#888"))
+
+    fig.tight_layout(); fig.savefig(args.out_cb, dpi=150); plt.close(fig)
+    print(f"wrote {args.out_cb}")
+
+
+if __name__ == "__main__":
+    main()