diff --git a/analysis/mb1/README.md b/analysis/mb1/README.md new file mode 100644 index 0000000..1d56b0d --- /dev/null +++ b/analysis/mb1/README.md @@ -0,0 +1,177 @@ +# MB1 — Prefill–Decode Interference (chunked-prefill on, vLLM 0.18.1 default) + +Persistent record of the phase-interference microbench used to put a +quantitative upper bound on **what PD-disaggregation can buy** under the +chunked-prefill-on baseline. Re-runs append a dated section at the +bottom; the **Summary** block is what gets cited. + +--- + +## Summary (latest) + +| Headline | Value | +|---|---| +| Baseline single-stream TPOT (D=1, idle GPU) | **4.8 ms** | +| Effective per-stream TPOT during **8k-token** prefill burst (D=8) | **114 ms (≈15× baseline)** | +| Effective per-stream TPOT during **32k-token** prefill burst (D=8) | **388 ms (≈52×)** | +| Effective per-stream TPOT during **131k-token** prefill burst (D=8) | **1419 ms (≈183×)** | +| Maximum PD-disagg benefit per agentic decode | **≤ 50–200 ms** (= decode duration) | + +**§3.2 headline (cost vs benefit, this run + MB2)**: + +> Under chunked-prefill, every ongoing decode stream is essentially +> **halted while a prefill chunk is in flight** — per-stream effective +> TPOT during the burst is 15× to 2000× baseline, scaling with prefill +> size. PD-disagg can recover this stall, but the recovery is bounded by +> the **decode duration** of the request being protected. For agentic, +> decode is 50–200 ms (tool-call output). MB2 shows PD-disagg pays +> 300 ms – 10 s of KV-transfer cost per request to do that recovery. The +> cost exceeds the benefit ceiling for any per-request KV ≥ ~80 MiB +> (~830 tokens) — well below all agentic operating points. The benefit +> never beats the cost in this workload. + +## Setup + +| Component | Value | +|---|---| +| Host | dash1, H20 96 GiB, driver 570.133.20 | +| Venv | `/home/admin/cpfs/wjh/agentic-kv-fresh/.venv` | +| vLLM | 0.18.1 official wheel (chunked-prefill default-on, V1 engine) | +| Model | `/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct` | +| Launch flags | `--tensor-parallel-size 1 --enable-prefix-caching --gpu-memory-utilization 0.9 --max-model-len 200000 --max-num-batched-tokens 8192` | +| kv_connector | **none** (this measures pure single-GPU phase interference; PD-disagg cost lives in MB2) | + +## Method + +Adapted from `microbench/interference/driver.py`: + +1. Start D streaming decode requests on `/v1/chat/completions` with a + long max_tokens cap. Discard the first 32 tokens as warmup. +2. After 1 s, inject one prefill-only request with `max_tokens=1` and + an input of `P` synthetic tokens (uuid-seeded for zero prefix-cache + reuse). Measure the prefill's TTFT. +3. Bin the *during-prefill* tokens from each decode stream by whether + their wall-clock falls inside `[prefill_inject_ts, prefill_inject_ts + + prefill_ttft]`. Report inter-token p50 / p90. +4. Bin a baseline run (D streams, no prefill injection) the same way. + +We additionally compute the **effective per-stream TPOT during the +prefill burst** as the single most informative summary: + +``` +eff_TPOT_during = prefill_ttft_ms / (num_tokens_during_prefill / D) +``` + +This is the average rate at which each decode stream produces tokens +while a prefill is in flight. Compared to baseline TPOT it gives the +real per-stream throughput penalty (chunked-prefill p50 looks deceptively +fine because most decode-token intervals during the burst are at normal +speed; p90 sees the stall but is itself noisy; the effective TPOT is +the cleanest "average over the whole burst window" number). + +## Results — 2026-05-27, dash1 GPU 0, chunk_tokens=8192 + +3 D × 5 P × 3 reps. Aggregated by `analyze_mb1.py`. + +| D | P (tok) | base TPOT (ms) | prefill_ttft (ms) | per-stream tokens during | effective TPOT during (ms) | penalty | max PD-disagg benefit per stream (ms) | +|--:|--:|--:|--:|--:|--:|--:|--:| +| 1 | 2 048 | 4.79 | 163 | 4.0 | 41 | 8× | 144 | +| 1 | 8 192 | 4.78 | 584 | 5.0 | 117 | 24× | 560 | +| 1 | 32 768 | 4.78 | 4 515 | 5.0 | 903 | 189× | 4 491 | +| 1 | 65 536 | 4.78 | 15 568 | 5.3 | 2 919 | 610× | 15 542 | +| 1 | 131 072 | 4.78 | 56 765 | 5.7 | 10 017 | 2 094× | 56 738 | +| 4 | 2 048 | 5.62 | 138 | 3.9 | 36 | 6× | 117 | +| 4 | 8 192 | 6.08 | 574 | 4.5 | 128 | 21× | 547 | +| 4 | 32 768 | 6.09 | 4 529 | 11.9 | 381 | 63× | 4 457 | +| 4 | 65 536 | 5.85 | 15 587 | 19.8 | 789 | 135× | 15 471 | +| 4 | 131 072 | 6.27 | 56 697 | 37.4 | 1 517 | 242× | 56 463 | +| 8 | 2 048 | 7.71 | 143 | 4.5 | 32 | 4× | 109 | +| 8 | 8 192 | 7.69 | 583 | 5.1 | 114 | 15× | 544 | +| 8 | 32 768 | 7.42 | 4 520 | 11.7 | 387 | 52× | 4 434 | +| 8 | 65 536 | 7.67 | 15 615 | 20.6 | 757 | 99× | 15 457 | +| 8 | 131 072 | 7.74 | 56 991 | 40.2 | 1 419 | 183× | 56 680 | + +**Reading the table**: + +- *Baseline TPOT* grows mildly with D (4.8 ms → 7.7 ms as D goes 1 → 8). + Multi-stream decoding has small but nonzero contention even without + prefill. +- *Effective TPOT during* grows mostly with P: a single 8k prefill stalls + decode for ~580 ms regardless of D, so each stream emits only a handful + of tokens during that 580 ms window — effective per-stream TPOT + collapses to 100–130 ms. Larger prefill = more chunks = larger stall. +- *Penalty* is the eff/baseline ratio. Above 50× for P ≥ 32k. Above + 500× for D=1 at P ≥ 65k. +- *Max PD-disagg benefit per stream* = `prefill_ttft − per_stream_tokens + × baseline_TPOT` ≈ `prefill_ttft` (since interference essentially + halts decode). This is the entire prefill duration's worth of decode + time that could in principle be recovered. + +Two big caveats for **agentic** application: + +1. **Decode is short** (~50–200 ms for tool-call output). The actual + recoverable benefit per request is bounded by the decode duration, + not by `prefill_ttft`. If a decode lasts 100 ms and a 5-second prefill + collides with it, PD-disagg can save at most 100 ms — not 5 s. +2. **PD-disagg pays KV-transfer cost** (MB2: 300 ms – 10 s per request + for agentic sizes). For any KV ≥ ~80 MiB the cost already exceeds the + ~100 ms benefit ceiling. Cost > benefit across the whole agentic + distribution. + +## §3.2 cost-vs-benefit figure + +`figs/pd_cost_vs_benefit.png` overlays MB1 benefit ceiling (50–200 ms +band, capped by decode duration) on top of MB2 transfer cost curve. The +cost curve crosses the benefit ceiling somewhere around **80 MiB / 830 +tokens** of KV — well below the trace mean (192 MiB / 2k tok ≈ trace +mean per request KV, and we know agentic averages 33k tokens, p99 +125k). For anything bigger PD-disagg pays more than it can recover. + +## Reproduction + +```bash +# vllm pair-free single-instance launch +ssh dash1 'GPU=0 PORT=8000 CHUNK_TOKENS=8192 \ + bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb1_launch.sh start' + +# sweep +ssh dash1 'source /home/admin/cpfs/wjh/agentic-kv-fresh/.venv/bin/activate && \ + python /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb1_driver.py \ + --host 127.0.0.1 --port 8000 \ + --model /home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct \ + --decode-batch-sizes 1,4,8 --prefill-tokens 2048,8192,32768,65536,131072 \ + --reps 3 --output-dir /home/admin/cpfs/wjh/agentic-kv-fresh/mb1_results' + +# pull + analyze +scp dash1:/home/admin/cpfs/wjh/agentic-kv-fresh/mb1_results/chunk8192/summary.csv \ + analysis/mb1/summary.csv +.venv/bin/python microbench/fresh_setup/analyze_mb1.py \ + --summary analysis/mb1/summary.csv --out analysis/mb1/breakdown.json +.venv/bin/python microbench/fresh_setup/plot_mb1.py \ + --mb1 analysis/mb1/breakdown.json \ + --mb2-intra analysis/mb2/intra_kvboth_breakdown.json \ + --mb2-inter analysis/mb2/inter_kvboth_breakdown.json + +# teardown +ssh dash1 'bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb1_launch.sh stop' +``` + +## Open questions / next runs + +- **Chunk size sensitivity**: this run uses `--max-num-batched-tokens + 8192`. Sarathi-Serve goes smaller (e.g. 1024) and recovers more + decode interleaving inside each prefill burst. Worth running + chunk_tokens ∈ {1024, 2048, 4096, 16384} to map the chunk-size axis. +- **Higher D**: 12, 16 streams to see whether the penalty saturates or + keeps shrinking per-stream. +- **Cross-validate effective_TPOT_during with token-time-series plot**: + raw per-token timestamps could reveal whether the stall is a few big + spikes or many small ones (currently inferred from p50/p90 spread). + +## Run log + +### 2026-05-27 — dash1 GPU 0, chunk_tokens=8192 + +3 × 5 × 3 sweep. CSV: `analysis/mb1/summary.csv`. Per-config JSONs on +dash1 at `/home/admin/cpfs/wjh/agentic-kv-fresh/mb1_results/chunk8192/`. +Figures: `figs/mb1_interference.png`, `figs/pd_cost_vs_benefit.png`. diff --git a/analysis/mb1/breakdown.json b/analysis/mb1/breakdown.json new file mode 100644 index 0000000..100dddd --- /dev/null +++ b/analysis/mb1/breakdown.json @@ -0,0 +1,199 @@ +{ + "summary": [ + { + "decode_batch_size": 1, + "new_prefill_tokens": 2048, + "baseline_tpot_ms": 4.79, + "during_tpot_p50_ms_raw": 35.43, + "during_tpot_p90_ms_raw": 79.91, + "prefill_ttft_ms": 163.3, + "num_tokens_during_prefill_total": 4.0, + "per_stream_tokens_during": 4.0, + "effective_tpot_during_ms": 40.8, + "interference_penalty_x": 8.5, + "max_pd_disagg_benefit_ms_per_stream": 144.2 + }, + { + "decode_batch_size": 1, + "new_prefill_tokens": 8192, + "baseline_tpot_ms": 4.78, + "during_tpot_p50_ms_raw": 6.56, + "during_tpot_p90_ms_raw": 328.57, + "prefill_ttft_ms": 583.9, + "num_tokens_during_prefill_total": 5.0, + "per_stream_tokens_during": 5.0, + "effective_tpot_during_ms": 116.8, + "interference_penalty_x": 24.4, + "max_pd_disagg_benefit_ms_per_stream": 560.0 + }, + { + "decode_batch_size": 1, + "new_prefill_tokens": 32768, + "baseline_tpot_ms": 4.78, + "during_tpot_p50_ms_raw": 4.75, + "during_tpot_p90_ms_raw": 4.9, + "prefill_ttft_ms": 4515.3, + "num_tokens_during_prefill_total": 5.0, + "per_stream_tokens_during": 5.0, + "effective_tpot_during_ms": 903.1, + "interference_penalty_x": 188.8, + "max_pd_disagg_benefit_ms_per_stream": 4491.4 + }, + { + "decode_batch_size": 1, + "new_prefill_tokens": 65536, + "baseline_tpot_ms": 4.78, + "during_tpot_p50_ms_raw": 4.69, + "during_tpot_p90_ms_raw": 4.97, + "prefill_ttft_ms": 15567.6, + "num_tokens_during_prefill_total": 5.3, + "per_stream_tokens_during": 5.33, + "effective_tpot_during_ms": 2918.9, + "interference_penalty_x": 610.2, + "max_pd_disagg_benefit_ms_per_stream": 15542.0 + }, + { + "decode_batch_size": 1, + "new_prefill_tokens": 131072, + "baseline_tpot_ms": 4.78, + "during_tpot_p50_ms_raw": 4.71, + "during_tpot_p90_ms_raw": 4.9, + "prefill_ttft_ms": 56765.2, + "num_tokens_during_prefill_total": 5.7, + "per_stream_tokens_during": 5.67, + "effective_tpot_during_ms": 10017.4, + "interference_penalty_x": 2094.5, + "max_pd_disagg_benefit_ms_per_stream": 56738.1 + }, + { + "decode_batch_size": 4, + "new_prefill_tokens": 2048, + "baseline_tpot_ms": 5.62, + "during_tpot_p50_ms_raw": 22.18, + "during_tpot_p90_ms_raw": 84.85, + "prefill_ttft_ms": 138.3, + "num_tokens_during_prefill_total": 15.5, + "per_stream_tokens_during": 3.88, + "effective_tpot_during_ms": 35.7, + "interference_penalty_x": 6.3, + "max_pd_disagg_benefit_ms_per_stream": 116.6 + }, + { + "decode_batch_size": 4, + "new_prefill_tokens": 8192, + "baseline_tpot_ms": 6.08, + "during_tpot_p50_ms_raw": 8.45, + "during_tpot_p90_ms_raw": 515.39, + "prefill_ttft_ms": 574.1, + "num_tokens_during_prefill_total": 18.0, + "per_stream_tokens_during": 4.5, + "effective_tpot_during_ms": 127.6, + "interference_penalty_x": 21.0, + "max_pd_disagg_benefit_ms_per_stream": 546.8 + }, + { + "decode_batch_size": 4, + "new_prefill_tokens": 32768, + "baseline_tpot_ms": 6.09, + "during_tpot_p50_ms_raw": 9.83, + "during_tpot_p90_ms_raw": 1314.87, + "prefill_ttft_ms": 4529.1, + "num_tokens_during_prefill_total": 47.5, + "per_stream_tokens_during": 11.88, + "effective_tpot_during_ms": 381.4, + "interference_penalty_x": 62.7, + "max_pd_disagg_benefit_ms_per_stream": 4456.9 + }, + { + "decode_batch_size": 4, + "new_prefill_tokens": 65536, + "baseline_tpot_ms": 5.85, + "during_tpot_p50_ms_raw": 6.41, + "during_tpot_p90_ms_raw": 2077.47, + "prefill_ttft_ms": 15586.5, + "num_tokens_during_prefill_total": 79.0, + "per_stream_tokens_during": 19.75, + "effective_tpot_during_ms": 789.2, + "interference_penalty_x": 135.0, + "max_pd_disagg_benefit_ms_per_stream": 15471.0 + }, + { + "decode_batch_size": 4, + "new_prefill_tokens": 131072, + "baseline_tpot_ms": 6.27, + "during_tpot_p50_ms_raw": 6.3, + "during_tpot_p90_ms_raw": 4405.18, + "prefill_ttft_ms": 56697.1, + "num_tokens_during_prefill_total": 149.5, + "per_stream_tokens_during": 37.38, + "effective_tpot_during_ms": 1517.0, + "interference_penalty_x": 241.8, + "max_pd_disagg_benefit_ms_per_stream": 56462.6 + }, + { + "decode_batch_size": 8, + "new_prefill_tokens": 2048, + "baseline_tpot_ms": 7.71, + "during_tpot_p50_ms_raw": 8.38, + "during_tpot_p90_ms_raw": 98.98, + "prefill_ttft_ms": 143.1, + "num_tokens_during_prefill_total": 35.7, + "per_stream_tokens_during": 4.46, + "effective_tpot_during_ms": 32.1, + "interference_penalty_x": 4.2, + "max_pd_disagg_benefit_ms_per_stream": 108.8 + }, + { + "decode_batch_size": 8, + "new_prefill_tokens": 8192, + "baseline_tpot_ms": 7.69, + "during_tpot_p50_ms_raw": 9.34, + "during_tpot_p90_ms_raw": 519.29, + "prefill_ttft_ms": 583.3, + "num_tokens_during_prefill_total": 41.0, + "per_stream_tokens_during": 5.12, + "effective_tpot_during_ms": 113.8, + "interference_penalty_x": 14.8, + "max_pd_disagg_benefit_ms_per_stream": 543.9 + }, + { + "decode_batch_size": 8, + "new_prefill_tokens": 32768, + "baseline_tpot_ms": 7.42, + "during_tpot_p50_ms_raw": 11.61, + "during_tpot_p90_ms_raw": 1315.48, + "prefill_ttft_ms": 4520.3, + "num_tokens_during_prefill_total": 93.3, + "per_stream_tokens_during": 11.67, + "effective_tpot_during_ms": 387.5, + "interference_penalty_x": 52.2, + "max_pd_disagg_benefit_ms_per_stream": 4433.7 + }, + { + "decode_batch_size": 8, + "new_prefill_tokens": 65536, + "baseline_tpot_ms": 7.67, + "during_tpot_p50_ms_raw": 19.09, + "during_tpot_p90_ms_raw": 2471.4, + "prefill_ttft_ms": 15615.5, + "num_tokens_during_prefill_total": 165.0, + "per_stream_tokens_during": 20.62, + "effective_tpot_during_ms": 757.1, + "interference_penalty_x": 98.8, + "max_pd_disagg_benefit_ms_per_stream": 15457.4 + }, + { + "decode_batch_size": 8, + "new_prefill_tokens": 131072, + "baseline_tpot_ms": 7.74, + "during_tpot_p50_ms_raw": 11.51, + "during_tpot_p90_ms_raw": 4895.27, + "prefill_ttft_ms": 56991.4, + "num_tokens_during_prefill_total": 321.3, + "per_stream_tokens_during": 40.17, + "effective_tpot_during_ms": 1418.9, + "interference_penalty_x": 183.3, + "max_pd_disagg_benefit_ms_per_stream": 56680.4 + } + ] +} \ No newline at end of file diff --git a/analysis/mb1/summary.csv b/analysis/mb1/summary.csv new file mode 100644 index 0000000..04e055c --- /dev/null +++ b/analysis/mb1/summary.csv @@ -0,0 +1,46 @@ +chunk_size,decode_batch_size,new_prefill_tokens,repetition,tpot_baseline_p50_ms,tpot_baseline_p90_ms,tpot_during_prefill_p50_ms,tpot_during_prefill_p90_ms,tpot_after_prefill_p50_ms,prefill_ttft_ms,num_tokens_during_prefill,tpot_penalty_p50_ms,tpot_penalty_ratio +8192,1,131072,0,4.777565016411245,4.900234832894057,4.701301048044115,4.948397364933044,0.0,56719.25117995124,7,-0.07626396836712956,0.9840370632099913 +8192,1,131072,1,4.779465030878782,4.883405601140112,4.707481013610959,4.85471700085327,0.0,56696.089847013354,5,-0.07198401726782322,0.9849388965495606 +8192,1,131072,2,4.790953011251986,4.880544205661863,4.728371975943446,4.907831805758178,0.0,56880.19039196661,5,-0.06258103530853987,0.9869376645603573 +8192,1,2048,0,4.77885699365288,4.894876398611814,41.434570477576926,88.97331730695441,0.0,183.2046649651602,4,36.655713483924046,8.670393471202205 +8192,1,2048,1,4.788161953911185,4.949774022679776,41.68213551747613,83.5143867880106,0.0,175.55483896285295,4,36.89397356356494,8.705247633369687 +8192,1,2048,2,4.7893429873511195,4.874200583435595,23.186982492916286,67.25202781381086,0.0,131.23180496040732,4,18.397639505565166,4.841370215946989 +8192,1,32768,0,4.789774015080184,4.870833398308605,4.738486022688448,4.886626999359578,0.0,4500.839321000967,5,-0.051287992391735315,0.9892921895207875 +8192,1,32768,1,4.776834975928068,4.891659819986671,4.729953012429178,4.9245511763729155,0.0,4496.073378017172,5,-0.0468819634988904,0.9901855593221991 +8192,1,32768,2,4.784431017469615,4.866032593417913,4.782894975505769,4.8977664206177,0.0,4549.013931944501,5,-0.0015360419638454914,0.9996789499193871 +8192,1,65536,0,4.778854956384748,4.9255444086156785,4.633405013009906,4.895579582080245,0.0,15530.37424501963,5,-0.1454499433748424,0.9695638506080803 +8192,1,65536,1,4.784283053595573,4.8808404128067195,4.754905996378511,4.985795798711479,0.0,15584.887631004676,5,-0.02937705721706152,0.99385967408534 +8192,1,65536,2,4.787993966601789,4.9004736240021884,4.6836750116199255,5.0271204963792115,0.0,15587.390075030271,6,-0.1043189549818635,0.9782123879625725 +8192,1,8192,0,4.785028984770179,4.878618801012635,7.490115996915847,324.06569679733366,0.0,573.2795029762201,5,2.7050870121456683,1.565323014919123 +8192,1,8192,1,4.778591974172741,4.899543372448534,5.9131429879926145,336.8099076091312,0.0,606.6823820001446,5,1.1345510138198733,1.237423705550061 +8192,1,8192,2,4.78826800826937,4.90188361145556,6.276679981965572,324.8370993998833,0.0,571.7499859747477,5,1.488411973696202,1.310845585736994 +8192,4,131072,0,6.113810988608748,6.309205386787653,0.0,0.0,0.0,56702.702289039735,0,-6.113810988608748,0.0 +8192,4,131072,1,6.630807969486341,7.086459483252838,6.2820459716022015,4400.500871409893,0.0,56807.70832300186,150,-0.3487619978841394,0.9474027902045915 +8192,4,131072,2,6.073819473385811,6.344516028184444,6.326125003397465,4409.856556192978,0.0,56580.784838995896,149,0.2523055300116539,1.0415398467335428 +8192,4,2048,0,5.402160517405719,5.543816485442221,6.210724503034726,84.62208869168535,6.125201500253752,140.3041940066032,18,0.8085639856290072,1.1496741873966574 +8192,4,2048,1,6.067108013667166,6.381415005307645,0.0,0.0,0.0,140.06177097326145,0,-6.067108013667166,0.0 +8192,4,2048,2,5.400336522143334,5.536347016459331,38.15686801681295,85.07051098858938,5.25214200024493,134.67552902875468,13,32.756531494669616,7.065646346363043 +8192,4,32768,0,6.115561525803059,6.369604001520202,7.216634490760043,1314.6978712815326,5.17624247004278,4522.433568025008,50,1.101072964956984,1.1800444587649532 +8192,4,32768,1,6.070095987524837,6.3612310332246125,0.0,0.0,0.0,4508.074064040557,0,-6.070095987524837,0.0 +8192,4,32768,2,6.0734800063073635,6.312666402664036,12.442811043001711,1315.0411327951588,4.754714027512819,4556.892123946454,45,6.369331036694348,2.0487119460473635 +8192,4,65536,0,5.406292999396101,5.540905491216108,0.0,0.0,0.0,15581.590663990937,0,-5.406292999396101,0.0 +8192,4,65536,1,6.076910009142011,6.315114628523588,0.0,0.0,0.0,15574.196094006766,0,-6.076910009142011,0.0 +8192,4,65536,2,6.060379033442587,6.384042033459991,6.411670008674264,2077.4700703914277,4.8022730043157935,15603.720718005206,79,0.3512909752316773,1.0579651822589267 +8192,4,8192,0,6.110575021011755,6.416070973500609,8.451583969872445,515.3855616226792,5.358011490898207,574.6672929963097,18,2.34100894886069,1.3831077993169092 +8192,4,8192,1,6.051429023500532,6.398122606333345,0.0,0.0,0.0,573.6081749782898,0,-6.051429023500532,0.0 +8192,4,8192,2,6.064729997888207,6.366449000779539,0.0,0.0,0.0,574.1707819979638,0,-6.064729997888207,0.0 +8192,8,131072,0,7.737616979284212,7.99839201499708,10.740376019384712,4742.438135773409,7.792441989295185,57010.66731195897,335,3.0027590401005,1.388072845701685 +8192,8,131072,1,7.744895527139306,8.013638522243127,8.647068490972742,5123.228083999129,7.672236970392987,56970.40947602363,310,0.9021729638334364,1.116486137310966 +8192,8,131072,2,7.740180502878502,8.016240986762568,15.140031988266855,4820.136589207682,7.68946303287521,56993.02393599646,319,7.3998514853883535,1.9560308680962177 +8192,8,2048,0,7.741285488009453,8.022559515666217,8.103576023131609,124.87094267853536,7.6825070136692375,141.97922096354887,30,0.36229053512215614,1.046799789993963 +8192,8,2048,1,7.728310010861605,8.021069981623441,8.17067950265482,84.82906777062453,7.745136506855488,144.1582590341568,38,0.4423694917932153,1.0572401328584768 +8192,8,2048,2,7.662211020942777,8.034424972720444,8.87883099494502,87.23540699575096,7.592331967316568,143.27958395006135,39,1.216619974002242,1.1587818412566437 +8192,8,32768,0,7.295333489309996,7.422819995554164,11.429400008637458,1315.43214758276,7.8034960315562785,4523.641717038117,94,4.134066519327462,1.5666727265292526 +8192,8,32768,1,7.278127042809501,7.490781514206901,12.640403030673042,1315.491412486881,7.821676495950669,4519.993302994408,90,5.362275987863541,1.736765922925357 +8192,8,32768,2,7.684049021918327,8.047712198458612,10.752685484476388,1315.5166705255397,7.80402502277866,4517.200137954205,96,3.068636462558061,1.3993514947399404 +8192,8,65536,0,7.708174001891166,8.017168991500512,26.662671996746212,2496.8427699001018,7.768569514155388,15603.601168957539,160,18.954497994855046,3.459012729889679 +8192,8,65536,1,7.594842027174309,7.9874323040712625,13.054963492322713,2459.1690181812737,7.54699349636212,15620.474929979537,174,5.460121465148404,1.7189249553331216 +8192,8,65536,2,7.693717983784154,7.933055714238435,17.5579380011186,2458.176895044744,7.808708498487249,15622.32490995666,161,9.864220017334446,2.2821135422594123 +8192,8,8192,0,7.636573514901102,7.904737605713308,10.151655005756766,514.8188057704829,7.7977380133233964,575.7745200535282,37,2.515081490855664,1.3293468577167538 +8192,8,8192,1,7.687711506150663,7.965393498307094,9.002390026580542,524.0793236298487,7.753994490485638,592.1044679707848,45,1.3146785204298794,1.1710103870804793 +8192,8,8192,2,7.756220467854291,8.035426988499239,8.864110975991935,518.9726910321042,7.770269992761314,581.98908099439,41,1.1078905081376433,1.1428389655411813 diff --git a/figs/mb1_interference.png b/figs/mb1_interference.png new file mode 100644 index 0000000..8e660d6 Binary files /dev/null and b/figs/mb1_interference.png differ diff --git a/figs/pd_cost_vs_benefit.png b/figs/pd_cost_vs_benefit.png new file mode 100644 index 0000000..b1f181f Binary files /dev/null and b/figs/pd_cost_vs_benefit.png differ diff --git a/microbench/fresh_setup/analyze_mb1.py b/microbench/fresh_setup/analyze_mb1.py new file mode 100644 index 0000000..781cc02 --- /dev/null +++ b/microbench/fresh_setup/analyze_mb1.py @@ -0,0 +1,98 @@ +#!/usr/bin/env python3 +"""Aggregate MB1 results: per-(D, P) baseline vs during-prefill effective TPOT. + +The driver's `tpot_during_prefill_p50_ms` is computed per-token and can be +misleading: chunked-prefill schedules decode alongside each prefill chunk, +so most decode-token intervals during the prefill burst look "normal" — but +each chunk completion creates a long-stall token. p50 hides this, p90 +exposes it, but the BEST single-number summary of "how much was decode +slowed by prefill" is the *effective TPOT during the prefill burst*: + + effective_TPOT_during = prefill_ttft_ms / (num_tokens_during_prefill / D) + +i.e. wall-clock time divided by per-stream tokens emitted in that window. +This captures the true average throughput of each decode stream while a +prefill burst is underway. Compared to baseline_TPOT it gives the +"phase-interference penalty" PD-disagg could in principle recover. +""" +from __future__ import annotations + +import argparse +import csv +import json +import statistics +from collections import defaultdict +from pathlib import Path + + +def main() -> None: + p = argparse.ArgumentParser() + p.add_argument("--summary", type=Path, required=True) + p.add_argument("--out", type=Path, required=True) + args = p.parse_args() + + rows = list(csv.DictReader(args.summary.open())) + by_dp: dict[tuple[int, int], list[dict]] = defaultdict(list) + for r in rows: + D = int(r["decode_batch_size"]) + P = int(r["new_prefill_tokens"]) + by_dp[(D, P)].append(r) + + summary = [] + for (D, P) in sorted(by_dp): + rs = by_dp[(D, P)] + base = statistics.mean(float(r["tpot_baseline_p50_ms"]) for r in rs) + during_p50_vals = [float(r["tpot_during_prefill_p50_ms"]) for r in rs + if float(r["tpot_during_prefill_p50_ms"]) > 0] + during_p90_vals = [float(r["tpot_during_prefill_p90_ms"]) for r in rs + if float(r["tpot_during_prefill_p90_ms"]) > 0] + ttft_vals = [float(r["prefill_ttft_ms"]) for r in rs] + n_tok_vals = [float(r["num_tokens_during_prefill"]) for r in rs + if float(r["num_tokens_during_prefill"]) > 0] + + if not n_tok_vals or D == 0: + continue + ttft = statistics.mean(ttft_vals) + n_tok_total = statistics.mean(n_tok_vals) + per_stream_tokens = n_tok_total / D + eff_tpot_during = ttft / per_stream_tokens if per_stream_tokens > 0 else 0 + penalty_x = eff_tpot_during / base if base > 0 else 0 + + # PD-disagg potential benefit (per stream, ms): + # if decode ran at baseline rate throughout the prefill window, + # it would emit ttft/baseline tokens. Actual is per_stream_tokens. + # Time saved if no interference = ttft - per_stream_tokens * baseline + time_saved_per_stream = ttft - per_stream_tokens * base + + summary.append({ + "decode_batch_size": D, + "new_prefill_tokens": P, + "baseline_tpot_ms": round(base, 2), + "during_tpot_p50_ms_raw": (round(statistics.mean(during_p50_vals), 2) + if during_p50_vals else None), + "during_tpot_p90_ms_raw": (round(statistics.mean(during_p90_vals), 2) + if during_p90_vals else None), + "prefill_ttft_ms": round(ttft, 1), + "num_tokens_during_prefill_total": round(n_tok_total, 1), + "per_stream_tokens_during": round(per_stream_tokens, 2), + "effective_tpot_during_ms": round(eff_tpot_during, 1), + "interference_penalty_x": round(penalty_x, 1), + "max_pd_disagg_benefit_ms_per_stream": round(time_saved_per_stream, 1), + }) + + args.out.parent.mkdir(parents=True, exist_ok=True) + args.out.write_text(json.dumps({"summary": summary}, indent=2)) + + print(f"{'D':>3} {'P':>7} {'base_ms':>9} {'eff_during_ms':>15} " + f"{'penalty':>10} {'pd_benefit_ms':>15}") + for s in summary: + print(f"{s['decode_batch_size']:>3} {s['new_prefill_tokens']:>7} " + f"{s['baseline_tpot_ms']:>9.2f} " + f"{s['effective_tpot_during_ms']:>15.1f} " + f"{s['interference_penalty_x']:>9.1f}x " + f"{s['max_pd_disagg_benefit_ms_per_stream']:>15.0f}") + print(f"\nwrote {args.out}") + + +if __name__ == "__main__": + main() diff --git a/microbench/fresh_setup/mb1_driver.py b/microbench/fresh_setup/mb1_driver.py new file mode 100644 index 0000000..4eeb909 --- /dev/null +++ b/microbench/fresh_setup/mb1_driver.py @@ -0,0 +1,422 @@ +#!/usr/bin/env python3 +"""Prefill-Decode Interference Microbenchmark Driver. + +Measures TPOT degradation caused by prefill chunks interfering with ongoing decode batches. +Produces: f(decode_batch_size, new_prefill_tokens, chunk_size) -> TPOT_penalty_ms + +Usage: + python driver.py --host 127.0.0.1 --port 8000 \ + --decode-batch-sizes 0,1,2,4,6,8,12 \ + --prefill-tokens 512,1024,2048,4096,8192,16384,32768 \ + --reps 5 --output-dir results/ +""" + +import argparse +import asyncio +import json +import os +import time +from dataclasses import dataclass, asdict +from pathlib import Path +from typing import Optional + +import httpx +import numpy as np + + +FIXED_SEED_PROMPT = ( + "You are a helpful assistant. Please analyze the following document carefully " + "and provide a comprehensive summary covering all key points, main arguments, " + "supporting evidence, and conclusions. The document discusses various aspects " + "of distributed systems, including consensus protocols, fault tolerance mechanisms, " + "and performance optimization strategies for large-scale deployments.\n\n" +) * 50 # ~4k tokens worth of repeated text for prefix cache sharing + +WARMUP_TOKENS = 32 +MEASURE_WINDOW_TOKENS = 500 + + +@dataclass +class Config: + decode_batch_size: int + new_prefill_tokens: int + chunk_size: int + model: str + repetition: int + + +@dataclass +class BaselineResult: + tpot_p50_ms: float + tpot_p90_ms: float + tpot_p99_ms: float + tokens_collected: int + + +@dataclass +class InterferenceResult: + tpot_during_prefill_p50_ms: float + tpot_during_prefill_p90_ms: float + tpot_after_prefill_p50_ms: float + prefill_ttft_ms: float + num_tokens_during_prefill: int + + +async def stream_tokens(client: httpx.AsyncClient, url: str, payload: dict) -> list[float]: + """Send a streaming request, return list of timestamps (seconds) for each token.""" + timestamps = [] + async with client.stream("POST", url, json=payload, timeout=300.0) as resp: + resp.raise_for_status() + async for line in resp.aiter_lines(): + if line.startswith("data: "): + data = line[6:] + if data.strip() == "[DONE]": + break + try: + chunk = json.loads(data) + choices = chunk.get("choices", []) + if not choices: + continue + delta = choices[0].get("delta", {}) + if "role" in delta: + continue + timestamps.append(time.perf_counter()) + except json.JSONDecodeError: + continue + return timestamps + + +def compute_tpot(timestamps: list[float], skip_first: int = 0) -> np.ndarray: + """Compute inter-token intervals in ms, skipping first N tokens.""" + if len(timestamps) < skip_first + 2: + return np.array([]) + ts = np.array(timestamps[skip_first:]) + return np.diff(ts) * 1000.0 # seconds → ms + + +def make_decode_payload(model: str) -> dict: + return { + "model": model, + "messages": [{"role": "user", "content": FIXED_SEED_PROMPT}], + "max_tokens": WARMUP_TOKENS + MEASURE_WINDOW_TOKENS + 50, + "temperature": 0, + "stream": True, + } + + +def make_prefill_payload(model: str, num_tokens: int) -> dict: + import hashlib + import uuid + # Generate UNIQUE content every call to guarantee zero prefix cache hits. + # Calibration: each "Block N: <32-hex>" → ~35 tokens after tokenization + unique_id = f"{uuid.uuid4().hex}_{time.time_ns()}" + n_parts = max(1, num_tokens // 35) + content_parts = [] + for i in range(n_parts): + seed = hashlib.md5(f"{unique_id}_{i}".encode()).hexdigest() + content_parts.append(f"Block {i}: {seed}") + content = " ".join(content_parts) + return { + "model": model, + "messages": [{"role": "user", "content": content}], + "max_tokens": 1, + "temperature": 0, + "stream": True, + } + + +async def wait_for_steady_state(decode_streams: list[asyncio.Task], min_tokens: int = 32): + """Wait until all decode streams have emitted at least min_tokens.""" + # We don't directly control this — we wait a fixed time based on expected TPOT + # At ~50ms/token, 32 tokens ≈ 1.6s. Wait 3s to be safe. + await asyncio.sleep(3.0) + + +async def run_baseline( + client: httpx.AsyncClient, url: str, model: str, decode_batch_size: int +) -> Optional[BaselineResult]: + """Measure decode-only TPOT (no prefill interference).""" + if decode_batch_size == 0: + return BaselineResult(tpot_p50_ms=0, tpot_p90_ms=0, tpot_p99_ms=0, tokens_collected=0) + + payloads = [make_decode_payload(model) for _ in range(decode_batch_size)] + tasks = [asyncio.create_task(stream_tokens(client, url, p)) for p in payloads] + + all_timestamps = await asyncio.gather(*tasks, return_exceptions=True) + + all_tpots = [] + for ts in all_timestamps: + if isinstance(ts, Exception): + print(f" [WARN] decode stream error: {ts}") + continue + tpot = compute_tpot(ts, skip_first=WARMUP_TOKENS) + if len(tpot) > 0: + all_tpots.extend(tpot.tolist()) + + if not all_tpots: + return None + + arr = np.array(all_tpots) + return BaselineResult( + tpot_p50_ms=float(np.percentile(arr, 50)), + tpot_p90_ms=float(np.percentile(arr, 90)), + tpot_p99_ms=float(np.percentile(arr, 99)), + tokens_collected=len(arr), + ) + + +async def run_interference( + client: httpx.AsyncClient, + url: str, + model: str, + decode_batch_size: int, + new_prefill_tokens: int, +) -> Optional[InterferenceResult]: + """Measure TPOT while a prefill request is being processed.""" + if decode_batch_size == 0: + # No decode to interfere with; just measure prefill TTFT + prefill_payload = make_prefill_payload(model, new_prefill_tokens) + t_start = time.perf_counter() + ts = await stream_tokens(client, url, prefill_payload) + prefill_ttft = (ts[0] - t_start) * 1000.0 if ts else 0 + return InterferenceResult( + tpot_during_prefill_p50_ms=0, + tpot_during_prefill_p90_ms=0, + tpot_after_prefill_p50_ms=0, + prefill_ttft_ms=prefill_ttft, + num_tokens_during_prefill=0, + ) + + # Phase 1: Start decode streams + decode_payloads = [make_decode_payload(model) for _ in range(decode_batch_size)] + + decode_timestamps: list[list[float]] = [[] for _ in range(decode_batch_size)] + prefill_done_event = asyncio.Event() + prefill_ttft_ms = 0.0 + prefill_inject_time = 0.0 + + async def decode_stream_with_tracking(idx: int, payload: dict): + timestamps = await stream_tokens(client, url, payload) + decode_timestamps[idx] = timestamps + + async def prefill_after_warmup(): + nonlocal prefill_ttft_ms, prefill_inject_time + # Wait for decode streams to stabilize + await asyncio.sleep(1.0) + prefill_inject_time = time.perf_counter() + prefill_payload = make_prefill_payload(model, new_prefill_tokens) + ts = await stream_tokens(client, url, prefill_payload) + if ts: + prefill_ttft_ms = (ts[0] - prefill_inject_time) * 1000.0 + prefill_done_event.set() + + # Launch all + decode_tasks = [ + asyncio.create_task(decode_stream_with_tracking(i, p)) + for i, p in enumerate(decode_payloads) + ] + prefill_task = asyncio.create_task(prefill_after_warmup()) + + await asyncio.gather(*decode_tasks, prefill_task, return_exceptions=True) + + # Analyze: split decode tokens into "during prefill" and "after prefill" + prefill_end_time = prefill_inject_time + prefill_ttft_ms / 1000.0 + + tpot_during = [] + tpot_after = [] + + for ts_list in decode_timestamps: + if len(ts_list) < WARMUP_TOKENS + 5: + continue + for i in range(WARMUP_TOKENS + 1, len(ts_list)): + t_prev = ts_list[i - 1] + t_curr = ts_list[i] + interval_ms = (t_curr - t_prev) * 1000.0 + + if prefill_inject_time <= t_prev <= prefill_end_time: + tpot_during.append(interval_ms) + elif t_curr > prefill_end_time + 0.05: # 50ms after prefill settles + tpot_after.append(interval_ms) + + during_arr = np.array(tpot_during) if tpot_during else np.array([0.0]) + after_arr = np.array(tpot_after) if tpot_after else np.array([0.0]) + + return InterferenceResult( + tpot_during_prefill_p50_ms=float(np.percentile(during_arr, 50)), + tpot_during_prefill_p90_ms=float(np.percentile(during_arr, 90)), + tpot_after_prefill_p50_ms=float(np.percentile(after_arr, 50)), + prefill_ttft_ms=prefill_ttft_ms, + num_tokens_during_prefill=len(tpot_during), + ) + + +async def run_single_config( + client: httpx.AsyncClient, + url: str, + model: str, + decode_batch_size: int, + new_prefill_tokens: int, + chunk_size: int, + rep: int, + output_dir: Path, +): + """Run one (D, P) configuration.""" + config = Config( + decode_batch_size=decode_batch_size, + new_prefill_tokens=new_prefill_tokens, + chunk_size=chunk_size, + model=model, + repetition=rep, + ) + + print(f" [rep {rep}] Running baseline (D={decode_batch_size})...") + baseline = await run_baseline(client, url, model, decode_batch_size) + if baseline is None: + print(f" [rep {rep}] Baseline failed, skipping") + return + + # Brief cooldown between baseline and interference + await asyncio.sleep(2.0) + + print(f" [rep {rep}] Running interference (D={decode_batch_size}, P={new_prefill_tokens})...") + interference = await run_interference( + client, url, model, decode_batch_size, new_prefill_tokens + ) + if interference is None: + print(f" [rep {rep}] Interference measurement failed, skipping") + return + + # Compute derived metrics + tpot_penalty_p50 = interference.tpot_during_prefill_p50_ms - baseline.tpot_p50_ms + penalty_ratio = ( + interference.tpot_during_prefill_p50_ms / baseline.tpot_p50_ms + if baseline.tpot_p50_ms > 0 else 0 + ) + + result = { + "config": asdict(config), + "baseline": asdict(baseline), + "interference": asdict(interference), + "derived": { + "tpot_penalty_p50_ms": tpot_penalty_p50, + "tpot_penalty_ratio": penalty_ratio, + }, + } + + # Save + fname = f"D{decode_batch_size}_P{new_prefill_tokens}_rep{rep}.json" + out_path = output_dir / fname + out_path.write_text(json.dumps(result, indent=2)) + print(f" [rep {rep}] Done. penalty={tpot_penalty_p50:.1f}ms ratio={penalty_ratio:.2f}") + + +async def main(): + parser = argparse.ArgumentParser(description="Prefill-Decode Interference Microbenchmark") + parser.add_argument("--host", default="127.0.0.1") + parser.add_argument("--port", type=int, default=8000) + parser.add_argument("--model", default="Qwen3-Coder-30B-A3B-Instruct") + parser.add_argument("--decode-batch-sizes", default="0,1,2,4,6,8,12", + help="Comma-separated decode batch sizes") + parser.add_argument("--prefill-tokens", default="512,1024,2048,4096,8192,16384,32768", + help="Comma-separated prefill token counts") + parser.add_argument("--chunk-size", type=int, default=8192, + help="vLLM max_num_batched_tokens (effective chunk size)") + parser.add_argument("--reps", type=int, default=5) + parser.add_argument("--output-dir", default="results/interference") + args = parser.parse_args() + + decode_sizes = [int(x) for x in args.decode_batch_sizes.split(",")] + prefill_tokens = [int(x) for x in args.prefill_tokens.split(",")] + + output_dir = Path(args.output_dir) / f"chunk{args.chunk_size}" + output_dir.mkdir(parents=True, exist_ok=True) + + url = f"http://{args.host}:{args.port}/v1/chat/completions" + print(f"Target: {url}") + print(f"Model: {args.model}") + print(f"Chunk size: {args.chunk_size}") + print(f"Decode batch sizes: {decode_sizes}") + print(f"Prefill tokens: {prefill_tokens}") + print(f"Repetitions: {args.reps}") + print(f"Output: {output_dir}") + print() + + async with httpx.AsyncClient(timeout=httpx.Timeout(600.0)) as client: + # Sanity check: is the server up? + try: + resp = await client.get(f"http://{args.host}:{args.port}/v1/models") + resp.raise_for_status() + models = resp.json() + print(f"Server ready. Models: {[m['id'] for m in models.get('data', [])]}") + except Exception as e: + print(f"ERROR: Cannot reach server at {args.host}:{args.port}: {e}") + return + + total_configs = len(decode_sizes) * len(prefill_tokens) + done = 0 + + for D in decode_sizes: + for P in prefill_tokens: + done += 1 + print(f"\n[{done}/{total_configs}] D={D}, P={P}") + + for rep in range(args.reps): + try: + await run_single_config( + client, url, args.model, D, P, + args.chunk_size, rep, output_dir, + ) + except Exception as e: + print(f" [rep {rep}] ERROR: {e}") + + # Cooldown between reps + await asyncio.sleep(1.0) + + # Cooldown between configs + await asyncio.sleep(3.0) + + print("\n\nDone! Results in:", output_dir) + # Generate summary CSV + await generate_summary(output_dir, args.chunk_size) + + +async def generate_summary(output_dir: Path, chunk_size: int): + """Aggregate all per-run JSONs into a summary CSV.""" + import csv + + rows = [] + for f in sorted(output_dir.glob("D*_P*_rep*.json")): + data = json.loads(f.read_text()) + cfg = data["config"] + bl = data["baseline"] + itf = data["interference"] + drv = data["derived"] + rows.append({ + "chunk_size": cfg["chunk_size"], + "decode_batch_size": cfg["decode_batch_size"], + "new_prefill_tokens": cfg["new_prefill_tokens"], + "repetition": cfg["repetition"], + "tpot_baseline_p50_ms": bl["tpot_p50_ms"], + "tpot_baseline_p90_ms": bl["tpot_p90_ms"], + "tpot_during_prefill_p50_ms": itf["tpot_during_prefill_p50_ms"], + "tpot_during_prefill_p90_ms": itf["tpot_during_prefill_p90_ms"], + "tpot_after_prefill_p50_ms": itf["tpot_after_prefill_p50_ms"], + "prefill_ttft_ms": itf["prefill_ttft_ms"], + "num_tokens_during_prefill": itf["num_tokens_during_prefill"], + "tpot_penalty_p50_ms": drv["tpot_penalty_p50_ms"], + "tpot_penalty_ratio": drv["tpot_penalty_ratio"], + }) + + if not rows: + return + + csv_path = output_dir / "summary.csv" + with open(csv_path, "w", newline="") as f: + writer = csv.DictWriter(f, fieldnames=rows[0].keys()) + writer.writeheader() + writer.writerows(rows) + print(f"Summary CSV written: {csv_path}") + + +if __name__ == "__main__": + asyncio.run(main()) diff --git a/microbench/fresh_setup/mb1_launch.sh b/microbench/fresh_setup/mb1_launch.sh new file mode 100644 index 0000000..7248355 --- /dev/null +++ b/microbench/fresh_setup/mb1_launch.sh @@ -0,0 +1,80 @@ +#!/usr/bin/env bash +# Launch a SINGLE vLLM instance on dash1 for MB1 (prefill-decode interference). +# No kv_connector — MB1 measures intra-GPU phase interference, not transfer. +# chunked_prefill is enabled by default in vLLM 0.18.1 (this is the regime +# we want to characterize: how much benefit can PD-disagg buy on top of +# the existing chunked-prefill colocated baseline?). +# +# Usage: +# GPU=0 PORT=8000 CHUNK_TOKENS=8192 bash mb1_launch.sh start +# bash mb1_launch.sh status +# bash mb1_launch.sh stop + +set -eo pipefail + +FRESH_ROOT="/home/admin/cpfs/wjh/agentic-kv-fresh" +VENV="${FRESH_ROOT}/.venv" +MODEL="${MODEL:-/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct}" +LOGS_DIR="${LOGS_DIR:-${FRESH_ROOT}/mb1_logs}" + +GPU="${GPU:-0}" +PORT="${PORT:-8000}" +MASTER="${MASTER:-29500}" +# max_num_batched_tokens — controls the chunked-prefill chunk granularity. +# vLLM 0.18.1 default is 8192; we keep that as the headline run and +# optionally repeat at 32768 to expose the chunk-size effect. +CHUNK_TOKENS="${CHUNK_TOKENS:-8192}" + +mkdir -p "${LOGS_DIR}" + +stop_local() { + pkill -9 -f "vllm serve.*--port ${PORT} " 2>/dev/null || true + pkill -9 -f "EngineCore" 2>/dev/null || true + sleep 2 +} + +case "${1:-start}" in + stop) + stop_local; exit 0;; + status) + if curl -sf "http://127.0.0.1:${PORT}/health" >/dev/null 2>&1; then + echo "port ${PORT}: UP" + else + echo "port ${PORT}: DOWN" + fi + exit 0;; + start) ;; + *) echo "Unknown command: $1"; exit 1;; +esac + +stop_local +source "${VENV}/bin/activate" + +echo "[mb1] launching: gpu=${GPU} port=${PORT} chunk_tokens=${CHUNK_TOKENS} (no kv_connector)" + +PYTHONHASHSEED=42 \ +CUDA_VISIBLE_DEVICES="${GPU}" \ +MASTER_PORT="${MASTER}" \ +nohup vllm serve "${MODEL}" \ + --host 0.0.0.0 --port "${PORT}" \ + --tensor-parallel-size 1 \ + --trust-remote-code --enable-prefix-caching \ + --dtype auto --gpu-memory-utilization 0.9 \ + --max-model-len 200000 \ + --max-num-batched-tokens "${CHUNK_TOKENS}" \ + --enable-prompt-tokens-details \ + > "${LOGS_DIR}/vllm_gpu${GPU}_chunk${CHUNK_TOKENS}.log" 2>&1 & +disown + +echo "[mb1] waiting for /health on port ${PORT}..." +tries=0 +while ! curl -sf "http://127.0.0.1:${PORT}/health" >/dev/null 2>&1; do + tries=$((tries+1)) + if [ ${tries} -gt 180 ]; then + echo "[mb1] FATAL port ${PORT} did not come up in 6 min" + tail -40 "${LOGS_DIR}/vllm_gpu${GPU}_chunk${CHUNK_TOKENS}.log" || true + exit 1 + fi + sleep 2 +done +echo "[mb1] UP on $(hostname -s):${PORT} (GPU ${GPU}, chunk_tokens=${CHUNK_TOKENS})" diff --git a/microbench/fresh_setup/plot_mb1.py b/microbench/fresh_setup/plot_mb1.py new file mode 100644 index 0000000..488437e --- /dev/null +++ b/microbench/fresh_setup/plot_mb1.py @@ -0,0 +1,129 @@ +#!/usr/bin/env python3 +"""Plot MB1 interference results + the §3.2 cost-vs-benefit headline figure. + +Two outputs: + + mb1_interference.png + Effective TPOT during prefill vs prefill size, one line per D. + Log-log. Annotates typical agentic decode duration (~100 ms) as a + horizontal band so reader can spot when decode would be stalled. + + pd_cost_vs_benefit.png + The §3.2 headline. X axis: KV size (MiB). Two stacked curves: + - benefit ceiling (MB1) — at most one decode-duration per request + of phase isolation can be recovered. Drawn as a flat 100 ms line. + - cost (MB2) — Mooncake pure_transfer p50 at that size. + Anywhere the cost curve sits ABOVE the benefit ceiling, PD-disagg + structurally loses. +""" +from __future__ import annotations + +import argparse +import json +from pathlib import Path + +import matplotlib +matplotlib.use("Agg") +import matplotlib.pyplot as plt +import numpy as np + + +def main() -> None: + p = argparse.ArgumentParser() + p.add_argument("--mb1", type=Path, required=True) + p.add_argument("--mb2-intra", type=Path, required=True) + p.add_argument("--mb2-inter", type=Path, default=None) + p.add_argument("--out-interf", type=Path, default=Path("figs/mb1_interference.png")) + p.add_argument("--out-cb", type=Path, default=Path("figs/pd_cost_vs_benefit.png")) + args = p.parse_args() + + mb1 = json.loads(args.mb1.read_text())["summary"] + + # ---- mb1_interference.png ---- + fig, ax = plt.subplots(figsize=(9, 5.5)) + Ds = sorted({s["decode_batch_size"] for s in mb1}) + colors = {1: "#1f77b4", 4: "#ff7f0e", 8: "#d62728"} + for D in Ds: + rows = [s for s in mb1 if s["decode_batch_size"] == D] + rows.sort(key=lambda s: s["new_prefill_tokens"]) + xs = [s["new_prefill_tokens"] for s in rows] + ys = [s["effective_tpot_during_ms"] for s in rows] + ax.plot(xs, ys, "o-", lw=2, markersize=7, + color=colors.get(D, "gray"), + label=f"D={D} (baseline {rows[0]['baseline_tpot_ms']:.1f} ms)") + + for tdec, lbl in [(50, "tool-call decode (~50 ms)"), + (100, "agentic decode (~100 ms)"), + (200, "long agentic decode (~200 ms)")]: + ax.axhline(tdec, color="#444", lw=0.6, ls=":", alpha=0.6) + ax.text(2200, tdec * 1.1, lbl, fontsize=8, color="#444") + + ax.set_xscale("log"); ax.set_yscale("log") + ax.set_xlabel("Prefill burst size (tokens, log)") + ax.set_ylabel("Per-stream effective TPOT during prefill burst (ms, log)") + ax.set_title("MB1: each ongoing decode is essentially halted while prefill runs\n" + "(chunked-prefill ON, vLLM 0.18.1 default, single H20)") + ax.grid(True, which="both", alpha=0.3) + ax.legend(loc="upper left", fontsize=9) + args.out_interf.parent.mkdir(parents=True, exist_ok=True) + fig.tight_layout(); fig.savefig(args.out_interf, dpi=150); plt.close(fig) + print(f"wrote {args.out_interf}") + + # ---- pd_cost_vs_benefit.png ---- + mb2_intra = json.loads(args.mb2_intra.read_text())["summary"] + mb2_intra = [s for s in mb2_intra if s["input_tokens"] >= 64] + intra_x_mib = [s["kv_mib"] for s in mb2_intra] + intra_y_ms = [s["pure_transfer_ms_p50"] for s in mb2_intra] + + fig, ax = plt.subplots(figsize=(9, 5.5)) + ax.plot(intra_x_mib, intra_y_ms, "o-", color="#d62728", lw=2.4, + markersize=8, label="MB2 PD-disagg KV transfer cost (Mooncake, p50)") + if args.mb2_inter: + mb2_inter = json.loads(args.mb2_inter.read_text())["summary"] + mb2_inter = [s for s in mb2_inter if s["input_tokens"] >= 64] + inter_x = [s["kv_mib"] for s in mb2_inter] + inter_y = [s["pure_transfer_ms_p50"] for s in mb2_inter] + ax.plot(inter_x, inter_y, "s--", color="#7a1d1d", lw=2, markersize=7, + alpha=0.7, label="MB2 inter-node (same numbers)") + + # Benefit ceiling: typical agentic decode duration (PD-disagg max savings). + ax.axhline(100, color="#2ca02c", lw=2.4, ls="-", + label="MB1 max benefit ≤ agentic decode (~100 ms)") + ax.axhspan(50, 200, alpha=0.15, color="#2ca02c", + label="benefit range (50–200 ms decode)") + + # Mark agentic-tail request sizes + for kv_mib, lbl in [(192, "trace mean\n(~2k tok)"), + (3072, "p90\n(~33k tok)"), + (6144, "p95\n(~65k tok)"), + (11500, "p99\n(11.5 GiB)")]: + ax.axvline(kv_mib, color="#666", lw=0.5, ls=":", alpha=0.5) + ax.text(kv_mib, 2, lbl, fontsize=8, color="#444", + ha="center", va="bottom") + + ax.set_xscale("log"); ax.set_yscale("log") + ax.set_xlim(40, 14000) + ax.set_ylim(1, 12000) + ax.set_xlabel("Per-request KV size (MiB, log)") + ax.set_ylabel("Time per request (ms, log)") + ax.set_title("§3.2 headline — PD-disagg KV transfer cost vs phase-isolation benefit\n" + "(both measured on vanilla vLLM 0.18.1 + Mooncake 0.3.11, agentic regime)") + ax.grid(True, which="both", alpha=0.3) + ax.legend(loc="upper left", fontsize=9) + + # Add explanatory annotation + ax.text(10000, 5000, + "Cost > benefit for ANY KV size above\n" + "the green band (~80 MiB / ~830 tokens).\n" + "Below: cost is marginal (<10 ms) but\n" + "benefit is also small (decode is short).", + fontsize=9, color="#333", + ha="right", va="top", + bbox=dict(boxstyle="round,pad=0.4", facecolor="#fffacd", alpha=0.9, edgecolor="#888")) + + fig.tight_layout(); fig.savefig(args.out_cb, dpi=150); plt.close(fig) + print(f"wrote {args.out_cb}") + + +if __name__ == "__main__": + main()