agentic-kvc

Author	SHA1	Message	Date
Gahow Wang	54e1f5266a	MB5 PD ablation v2 results: concurrency axis + reuse 3-way writeup - fig3_conc32k.json + fig3_concurrency_axis.png: agentic-corner concurrency sweep (in=32768, reuse=0.984, out=128), N 8->128, PD capped 600s / colo uncapped. colo completes 100% at every N (graceful, E2E 2.4s->81s); every static PD split collapses, earlier as N rises (viable only N<=16; <27% by N=32). - analysis/mb5_pd_ablation/README.md: self-contained v2 writeup. Reuse axis 3-way (A=d2048/o256, C=d2048/o128, B=d1024/o128) decomposes shape: output ~negligible, delta (real prefill/turn) dominant; crossover to colo at reuse ~90-95% robust. Run on dash2 (dash0 NICs faulted for Mooncake). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-01 09:35:25 +08:00
Gahow Wang	9c105cf05a	MB5 PD ablation: controlled-variable reuse/conc redo + campaign tooling Reuse and concurrency axes redone with proper controlled variables, plus the orchestration used to run them on dash0: - run_reuse_fixed.sh: hold REAL prefill work (delta) constant, vary only cached prefix -> reuse = C/(C+U). Supersedes old fig1 (which held input=8192 and sliced prefix out, confounding "more reuse" with "less prefill"). - run_conc.sh: agentic-corner config (in=32768, delta=512, reuse=0.984, out=128) that exposes PD's structural KV-transfer tax. Supersedes old fig3. - run_campaign{,2,3}.sh, backfill_d2048o128.sh: serial campaign drivers (strictly one driver at a time), out=128 sweeps, PD wall-cap for collapse-draining high-reuse arms, and flaked-arm backfill. - mb5_run_gpu.sh: per-config bring-up / replay / teardown orchestrator. - plot_pd_crossover.py: render the reuse_compare figures from fig_agg dumps. - fig_agg.py: tolerate null stats from fully-collapsed arms (0 successes write the stat keys as null; `dict.get(k, {})` returns null, not {}). Data: fig1_reuse_fixed.json, fig1_reuse_d{1024,2048}_o128.json Figs: reuse_compare_AB.png, reuse_compare_ABC.png Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-06-01 01:03:27 +08:00
Gahow Wang	32f7f55990	v2: linear (default cache-aware) baseline + 2x wall-cap on first600s Follow-up to the LMetric sweep: rerun with --policy linear (cache-aware load + sticky session affinity, the cache_aware_proxy default) and cap each PD-disagg arm at 2x the colo bench wall (SIGTERM bench.sh once cap is exceeded; the cleanup trap clears vLLM and proxy; capped runs lack metrics.summary.json so the analysis script computes from raw metrics.jsonl). Headline: the success-rate ceiling is policy-invariant. arm linear (capped at 2x) lmetric (uncapped) colo 807/807 = 100%, 964s 807/807 = 100%, 1021s pd6 (6:2) 472/807 = 58%, 2280s ⊗ 474/807 = 59%, 3325s pd4 (4:4) 349/807 = 43%, 2281s ⊗ 348/807 = 43%, 6850s pd2 (2:6) 176/807 = 22%, 2280s ⊗ 180/807 = 22%, 19275s Routing affects only how much wall is wasted timing out unreachable requests at 600s each. Linear hits the same ceiling in 2280s as LMetric does in 3300-19000s. This strengthens the §5 D-pool capacity-ceiling thesis -- the cap is structural, not a routing artifact. Artifacts: analysis/v2/fig4r_linear.json -- 4-arm linear summary analysis/v2/PD_DISAGG_LMETRIC.md -- extended with wall-cap section figs/v2/fig4_linear_vs_lmetric.png -- 3-panel side-by-side comparison microbench/fresh_setup/plot_fig4_linear_vs_lmetric.py	2026-06-01 00:55:40 +08:00
Gahow Wang	7529284cee	v2: LMetric PD-colo vs PD-disagg on the real agentic trace Anchor experiment for the clean-stack PD comparison using the canonical cache-aware proxy with --policy lmetric (scripts/bench.sh harness). Two traces x four arms = eight runs on dash1. Headline: with the right routing baseline (LMetric), PD-colo holds 100% completion on both traces while every static PD-disagg ratio fails (14-65% completion), and the failure mode rotates with the split -- no static partition has a working operating point on this workload. LMetric improves colo dramatically (TTFT p50 1.0s vs original §3 RR 7.0s; 7x) but does NOT rescue PD-disagg, confirming the bottleneck is structural (D-pool admission + multi-turn KV accumulation), not routing. Completion matrix: first600s full colo 100% 100% pd6 (6:2) 58.7% 65.3% (decode-bound) pd4 (4:4) 43.1% 43.9% (both bottlenecks) pd2 (2:6) 22.3% 13.9% (prefill-bound) The original §3 RR "100% PD completion" appears to be a measurement artifact of `e13391e`: producer-KV eviction acted as a relief valve, letting more requests squeeze under the 600s timeout at the (uncosted) price of cross-turn re-prefill. With the eviction off, PD-disagg is worse than §3 advertised, not better. Artifacts: analysis/v2/fig4l_lmetric.json -- 8-arm summary data analysis/v2/PD_DISAGG_LMETRIC.md -- writeup + reproduce recipe figs/v2/fig4_lmetric_pd_vs_colo.png -- 4-panel comparison figure microbench/fresh_setup/plot_fig4l_lmetric.py -- plot script	2026-05-31 20:15:10 +08:00
Gahow Wang	fafc44da79	MB5 PD reuse-centric ablation: tooling, data, Fig 1-3 Three-axis controlled ablation of PD-colo vs PD-disagg on synthetic regular traces (closed-loop, controlled reuse via REPLAY_NO_REALIZED_PREFIX) on the clean stack (`e13391e` gated off). Axis 1 (Fig 1) -- reuse 6%->94% at N=8, in8192/out256 Axis 2 (Fig 2) -- shape in2048/out2048 -> in32768/out64 at N=8, reuse~70% Axis 3 (Fig 3) -- concurrency N=8/16/32/64 at reuse~71%, in8192/out256 Findings: * APC parity colo=PD at every reuse (5.5/22/44/66/77/82%) -- contamination fix validated. * PD edge erodes 1.57x->1.10x with reuse; prefill GPUs strand 26%->9%. * Shape: PD-best peaks mid-sweep (1.34x at in8192/out512); wrong PD ratio catastrophic at prefill extreme (in32768/out64 pd2 = 378/400, p99 432s). * Concurrency: PD wins N<=32 (1.23-1.29x), TIPS at N=64 -- pd2/pd4 crater (APC 71%->1.4%, TPS -30%) while colo scales cleanly. Infrastructure: * replayer: --max-inflight-sessions, --inter-turn-think, --no-realized-prefix (env-defaulted via REPLAY_MAX_INFLIGHT, REPLAY_INTER_TURN_THINK_S, REPLAY_NO_REALIZED_PREFIX). * mb5_run.sh: writes bench_config.json + gpu_util.csv + run_window.json + instance_apc.txt + metrics.jsonl for bench_report/fig_agg ingest. * fig_agg.py: per-arm GPU role split + producer-side APC; --json mode. * gpu_util_report.py: companion per-GPU util report from gpu_util.csv. * partial_summary.py: stats from in-flight replay_metrics.jsonl (works before metrics.summary.json exists). Data: analysis/mb5_pd_ablation/fig{1,2,3}.json (24 + 20 + 16 rows). Figures: figs/mb5_pd_ablation/fig{1_reuse,2_shape,3_concurrency}_axis.png.	2026-05-31 20:14:46 +08:00
Gahow Wang	a2111b6e18	PD-disagg docs: annotated corrections for `e13391e` contamination Adds dated, non-destructive correction notes to the contaminated PD-vs-colo artifacts after the producer-eviction bug (`evict_blocks(sent_block_ids)` on `finished_sending`, deployed over the "fresh" pip vLLM by `scripts/deploy_vllm_patches.sh`) was found and gated behind `VLLM_EVICT_SENT_BLOCKS` (default off). PD_DISAGG_RESULTS.md top CORRECTION banner + §6 RETRACTED marker. §6 (session-affinity hot-pin) was an `e13391e` artifact under controlled concurrency; §3 RR, §4 TPOT win, §5 D-pool ceiling, §5.1 consumer crash stand. RESULTS_SUMMARY.md §4 confirm+refine note: clean ablation confirms the D-pool capacity thesis and adds regime- dependence. pd_separation_analysis.md scoped caution: thesis confirmed; flags only reuse-dependent figures for cross-check (this study used a different stack). figs/mb5/CORRECTION.md flags mb5_producer_hotspot.png as retracted; §3 RR and §5 D-pool figures stand.	2026-05-31 20:14:14 +08:00
Gahow Wang	8d422c4301	Migration trigger validation: unified_v4 fires at 2x QPS, not at 1x Ran unified vs unified_v4 A/B on dash2 (8×H20, kv_both+DR-fix substrate, w600_r0.0015_st30_first600s trace). Key findings: - At 1x QPS (~1.3 req/s): zero migrations. pending_prefill_tokens is 0 for 95% of routing decisions because instances complete prefill before the next request arrives. The relative arm (src_pp > fleet_median*1.5) never fires. - At 2x QPS (~2.7 req/s): 4 migrations (0.5%). src_pp>0 rises to 24% of eligible decisions. Trigger correctly identifies genuinely overloaded instances (src_pp 13k–73k vs fleet median 3.8k–33k). Conclusion: mechanism is correct but migration benefit requires higher concurrency (scale-out or >3x QPS) where queue pressure makes the signal non-trivial. At single-node 8-instance scale, Pillar 1 (affinity routing) is sufficient and Pillar 2 gracefully degrades to no-op. Next: scale-out validation (16+ GPU) where session skew naturally concentrates load and triggers migration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-30 15:36:58 +08:00
Gahow Wang	cf812b6264	Workload characterization C1-C3 on full production trace Joint/temporal characterizations of the full 051315 cluster trace (2.11M req / 1.31M sessions / 2h), beyond the existing single-variable marginals: - C1 mixture: 90.3% sessions single-turn, but multi-turn (9.7%) = 44% reqs / 67% prefill mass; continuation hazard rises 10%->94% (Lindy); heaviness unpredictable at turn 1 (corr 0.04-0.15) => reactive routing justified. - C2 resident/delta: resident context 11k->56k while new-prefill 2.7k->~200; per-turn reuse ->99.6%; resident/delta ("PD tax") ->~250-450x. - C3 prefill/decode: token mass 98.7% input / 1.3% output, BUT decode ~70% of TIME (robust 68-71%); "decode negligible" is wrong (tokens != time). Correct colo argument = roofline complementarity, not "no decode". Maps each to (1) PD-colocation and (2) routing. compute_chars.py + chars.json + figs/workload_chars/. Raw-file exact validation (cached_tokens, real timings) pending. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 18:19:39 +08:00
Gahow Wang	847f52f03b	PD-disagg crossover: regular synthetic trace + goodput sweep + figure gen_synthetic_trace.py --mode regular: maximally-regular multi-turn trace (fixed prefix/delta/turns, constant arrivals, zero session skew) to isolate the structural PD cost (per-turn full-context transfer + P/D capacity split) from the skew/hot-pin artifact. analysis/crossover/: SLO-goodput PD_advantage sweeps bracketing the prefill<->decode bottleneck axis (D1 grow input -> prefill-bound; D2 grow output -> decode-bound). figs/crossover_pd_advantage.png shows the crossover (y=1) with the agentic operating region annotated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 18:19:23 +08:00
Gahow Wang	a0db3cbe77	Add leastwork_kappa decode-aware ablation (net-negative, documented) --policy leastwork_kappa + --kappa (default 2.5e-6, derived from KV ~100KB/tok / HBM 4TB/s / TPOT 10ms on H20+Qwen3-30B-A3B): score = prefill_work * (1 + kappa * ongoing_decode_tokens), modelling decode as a fractional throughput tax on a new prefill. Result on the 600s trace: NET-NEGATIVE vs plain leastwork — TTFT p90 +18%, E2E p90 +14%, balance 1.55x->1.97x, and it does NOT fix the E2E-p99 it targeted. Decode is too cheap in agentic (output p50~80) for the term to help; it just bounces heavy reqs off their cache-owner into cold re-prefill. The E2E-p99 tail is the structural HEAVY+>50k floor (per-class p99 ~51-52k for ALL policies), not decode interference. Kept in-tree as a documented ablation justifying LPWL's omission of any decode term; do not revive without a decode-heavy regime. See analysis/lpwl_5policy_600s.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 17:07:23 +08:00
Gahow Wang	71b0747b3b	600s-truncated trace + LPWL 5-policy results traces/w600_r0.0015_st30_first600s.jsonl: first-600s cut of the shipped w600 trace (807 reqs, 274 sessions, all turn-1s + early later-turns; theoretical APC ceiling ~70% vs 80% full). Faster iteration (~18 min/arm) but a colder, lower-locality regime; whitelisted alongside the parent anonymized trace. analysis/lpwl_5policy_600s.md: LPWL vs LMetric/sticky/unified/unified+A+B on the 600s trace (dash1 8xH20, cold APC, n=1). LPWL is overall best with zero knobs — TTFT p90 7983ms vs tuned A+B 11562 (-31%), E2E p90 -16%, best request balance; APC 0.648 (emergent affinity, far above LMetric 0.507); only loss is E2E p99 from heavy-class decode concentration. Demonstrates anti-overfit: A+B was tuned on full w600 yet is beaten by the knob-free policy on this regime. Includes the run_5policy_600s.sh repro driver. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 16:08:35 +08:00
Gahow Wang	a2f2645fda	PD_DISAGG_RESULTS §6.3: producer hot-pinning figure Direct per-producer KV-pool evidence for the session-affinity backfire. At the same 4P+4D ratio: - round-robin: 4 producers within 1pp of each other (spread 0pp, CV 0.01) - session-affinity: spread 49pp (one producer ~93%, another 45%; CV 0.25) A 25x jump in producer load imbalance — heavy multi-turn sessions concentrate onto single producers, the same hot-pinning pathology as sticky routing in the colocated §3.3 study. plot_producer_hotspot.py: reduce (numpy, per-producer KV timeline from snapshots, runs on the serving host) + plot (matplotlib, 2-panel rr vs session comparison) — same two-stage pattern as aggregate_mb5.py. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 00:38:20 +08:00
Gahow Wang	dae98c6472	Working-set sizing tool + GLM-5.1-FP8/B300 result Configurable KV working-set analyzer (GPU model x TP/PP/EP x model config.json with MLA/GQA auto x KV/weight dtype). Computes Denning W(T), oracle [first,last], and retain-forever footprints vs a per-replica KV pool, plus the APC captured at each retention window. GLM-5.1-FP8 (MLA, 43.9 KiB/token) on 1x B300 node (1528 GB KV pool): live KV fits trivially (~533 GB), but the full 80.4% APC ceiling needs ~14 nodes (oracle) -> long-tail reuse motivates DRAM offload, not HBM. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 16:03:25 +08:00
Gahow Wang	8596135680	MB5 analysis: per-role KV split proves static-partition mismatch aggregate_mb5.py: - Split the cluster KV timeline by role (P-pool vs D-pool) using a PID->role map parsed from vllm_logs filenames. The cluster average hid the result — 6P+2D/4P+4D look ~45% utilized but the decode pool is actually pegged at ~100% while prefill idles at ~30%. - Two-stage reduce/plot: --reduce-to (numpy-only, runs on the serving host over multi-GB snapshot dirs) dumps a compact JSON; --from-reduced (matplotlib) renders locally. matplotlib import is now lazy. - New plot_role_split figure + p/d peak/steady columns in the CSV. PD_DISAGG_RESULTS.md: consolidated writeup with figures inline. Verdict: no static P:D ratio beats 8C colocation. The binding constraint moves with the ratio (D-pool saturates at 6P+2D/4P+4D, P-pool jams at 2P+6D -> 91% request loss); 8C's shared pool stays elastic at 34% steady, 100% completion. PD wins TPOT (10-35x cleaner, the MB1 phase-isolation benefit is real) but loses TTFT and sheds load. Round-robin P routing also zeroes prefix-cache reuse; a session-affinity re-run of 6P+2D is in flight to test the fix. Figures (rep1): mb5_kv_timeline, mb5_role_split, mb5_peak_utilization, mb5_latency_compare + mb5_summary.csv. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 12:05:17 +08:00
Gahow Wang	da39ab6804	Correct PD-disagg cost/benefit framing across repo The §3.2 cost-vs-benefit math in commits `029821c` (MB1 plot + pd_cost_vs_benefit.png) and `abde010` (RESULTS_SUMMARY.md) was wrong. What was wrong: I framed PD-disagg's max phase-isolation benefit as "≤ decode duration of the new request (~50–200 ms)" — implicitly treating the benefit as per-request and bounded by that request's own decode. The correct accounting is per-prefill-event across all stalled streams: benefit_per_prefill = D × T_prefill × (1 − TPOT_baseline/TPOT_during) ≈ D × T_prefill which follows from the chunked-prefill math (each of L/N chunks slows D ongoing decode steps from ~10 ms to t ms, summing to D × T_prefill). Plug MB1 + MB2 numbers in: prefill size \| T_prefill \| T_transfer \| D=8 benefit \| cost/benefit 2k tok \| 0.14 s \| 8 ms \| 1.1 s \| 0.7 % 33k tok \| 4.5 s \| 320 ms \| 36 s \| 0.9 % 125k tok \| 57 s \| 1.9 s \| 456 s \| 0.4 % On the phase-isolation axis alone, PD-disagg WINS by 100×–250× — the opposite of what the deleted figure showed. The actual dominant reason static PD-disagg fails in agentic is the D-side KV pool capacity wall (figs/f4b_pdsep_kv_wall.png) — p99 single-request KV is 11.5 GiB, per-D-instance pool is 38 GiB, so 4P+4D halves system decode capacity. Colleague's 4P+4D experiment showed TTFT p50 62× worse and success rate 99.5% → 52%, driven by pool overflow + queueing, not by transfer latency. Changes (all touched files explicitly listed; no `git add -u`): - figs/pd_cost_vs_benefit.png : DELETED (figure built on wrong math) - microbench/fresh_setup/plot_mb1.py : drop the pd_cost_vs_benefit function; keep mb1_interference.png and update its title to note per-prefill aggregate stall = D × T_prefill (not capped by decode) - figs/mb1_interference.png : regenerated, no misleading band annotation - analysis/mb1/README.md : Summary block rewritten ("what MB1 measures"; no more "max benefit = decode duration" claim); §3.2 implications section replaced with the corrected per-prefill-event table; explicit ⚠ Correction note documents what was wrong - analysis/mb2/README.md : Summary block + §3.2 implications section rewritten the same way; ⚠ Correction note links to RESULTS_SUMMARY §4 - RESULTS_SUMMARY.md §4 + §6 : §4 reordered to lead with the D-side capacity argument (the real failure mode), MB1/MB2 demoted from "kill-shot for PD-disagg" to "supporting context inputs to a cost-benefit table that actually favors PD-disagg on this axis"; §6 paper-claims list reordered to remove the wrong "PD-disagg loses on cost-vs-benefit" claim and replace with the corrected ones PAPER_OUTLINE.md and MEETING.md were checked and never picked up this specific wrong claim — they already (correctly) frame §3.2 around the D-side KV memory wall. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 22:04:49 +08:00
Gahow Wang	029821c1b6	MB1: prefill-decode interference under chunked-prefill default; §3.2 headline Single-GPU bench on dash1 GPU 0 (vanilla vLLM 0.18.1, chunked-prefill on, no kv_connector). 3 decode batch sizes × 5 prefill sizes × 3 reps. Method recap (driver: microbench/interference/driver.py, repurposed): - Pin D streaming decode requests at constant max_tokens - Inject one prefill-only request (max_tokens=1) of varying input length - Bin decode-stream token timestamps into "during prefill" vs baseline - Headline metric: effective per-stream TPOT during the prefill burst, = prefill_ttft / (num_tokens_during_prefill / D). This is the average rate at which each decode stream produces tokens during the burst. p50 of inter-token intervals is deceptive (chunked-prefill makes most intervals look normal); the burst-average gives the true cost. Results (D=8 row, the most agentic-realistic case): P (tokens) \| prefill_ttft \| per-stream TPOT during \| penalty 2048 \| 143 ms \| 32 ms \| 4× 8192 \| 583 ms \| 114 ms \| 15× 32768 \| 4520 ms \| 388 ms \| 52× 65536 \| 15615 ms \| 757 ms \| 99× 131072 \| 56991 ms \| 1419 ms \| 183× Baseline TPOT at D=8: ~7.7 ms. So during a 131k-token prefill burst each ongoing decode is running ~183× slower (i.e. essentially halted) for ~57 seconds. §3.2 implication: PD-disagg's promised phase-isolation benefit per agentic request is bounded by the decode duration, which is 50–200 ms for tool-call output. MB2 says the KV-transfer cost of PD-disagg is 300 ms – 10 s for agentic-size requests. Cost > benefit for every KV size above ~80 MiB (well below trace mean 192 MiB). The new figs/pd_cost_vs_benefit.png overlays MB1 benefit ceiling (50–200 ms band, capped by decode) onto MB2 transfer cost curve and marks the agentic-distribution waypoints (trace mean, p90, p95, p99) on the x-axis. Across the entire agentic distribution, the cost curve sits above the benefit band. Adds: - microbench/fresh_setup/mb1_launch.sh: single-GPU vLLM launcher (no kv_connector, default chunked_prefill=on, max_num_batched_tokens=8192) - microbench/fresh_setup/mb1_driver.py: copy of the existing microbench/interference/driver.py for cpfs deployment - microbench/fresh_setup/analyze_mb1.py: aggregator emitting per-(D, P) effective-TPOT-during + max PD-disagg-benefit table - microbench/fresh_setup/plot_mb1.py: mb1 standalone + pd_cost_vs_benefit headline figure - analysis/mb1/summary.csv: 45 raw rows from the sweep - analysis/mb1/breakdown.json: per-(D, P) aggregate - analysis/mb1/README.md: persistent doc - figs/mb1_interference.png: effective TPOT during prefill, one line per D - figs/pd_cost_vs_benefit.png: §3.2 headline (cost > benefit everywhere) Caveats noted in README: - chunk_tokens=8192 only; Sarathi-Serve's smaller chunks would interleave decode more aggressively. Chunk-size sensitivity is flagged as next run. - D ≤ 8; higher D may saturate or shrink the penalty further. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 21:25:09 +08:00
Gahow Wang	90127c3389	MB2 inter-node: dash1↔dash2 transfer cost is identical to intra-node Sweep on dash1 GPU 0 → dash2 GPU 0 over 200 Gbps RoCE. remote_bootstrap_addr=http://172.27.123.142:8998. Same 9-size × 5-rep config as the 2026-05-27 intra-node run. Per-size pure_transfer (p50) lines up within 1–3% of the intra-node numbers across all sizes: size intra p50 inter p50 512 tok 5.3 ms 5.2 ms 2048 tok 20.6 20.0 8192 tok 83.7 80.9 32k tok 320.9 309.6 64k tok 1895 1734 (bimodal in both) 128k tok 2835 2818 (bimodal in both) => Mooncake's batch_transfer_sync_write does not use NVLink for intra-node peers; both paths go through the 200 Gbps RDMA NIC, with the 200 Gbps NIC (not the GPU interconnect) being the bottleneck. The ~9.7 GB/s steady-state ceiling and the 6+ GiB variance regime are identical across topologies. Operational implication for §3.2: PD-disaggregation does not get cheaper by co-locating P and D on the same node — every routed request pays the same ~10 GB/s ceiling for KV transfer, no matter where it lands. Halving the transfer cost cannot be bought back by topology. Caveat: B's receive_kv events did not log on dash2 — `MB2_LOG_DIR` env var did not propagate through vLLM's EngineCore subprocess on the consumer host (cat /proc/$ENGINE_PID/environ is empty on dash2 for that var, but the producer host on dash1 worked). For this run pure_transfer numbers are from A's send_blocks alone; full rx_total breakdown is not available, but pure_transfer is the dominant term. Adds: - analyze_mb2_send_only.py — analyzer that works from A's send_blocks alone when B's receive_kv events are absent - plot_mb2_compare.py — overlay intra vs inter on the same axes - plot_mb2.py — tolerate the `rows`-less send-only schema - figs/mb2_transfer_{time,bw}_inter.png — inter-node single-curve - figs/mb2_transfer_{time,bw}_compare.png — intra vs inter overlay - analysis/mb2/A_inter_kvboth.jsonl, inter_kvboth_client.json, inter_kvboth_breakdown.json - analysis/mb2/README.md — Summary block updated to reference both paths, dated 2026-05-27 run-log entry appended with the full table and the topology-independence framing Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 20:56:08 +08:00
Gahow Wang	3f791ee074	MB2 doc: analysis/mb2/README.md as persistent record Lifts the MB2 intra-node results out of commit messages into a single place the paper can cite. Structure: Summary — one-line table + headline numbers for §3.2 Setup — exact hardware/software/config Method — 3-step bench, instrumentation, pair-by-time-window Results — full per-size table (latest run dated) Known limitations — kv_both vs strict, serial-only, intra-only, sanity preamble in the logs §3.2 implications — transfer/decode ratio table at agentic sizes Open questions / next runs — inter-node, bandwidth-ceiling investigation, concurrent transfers, strict kv_producer/consumer Reproduction — exact commands Run log — dated entries; new runs append here The latest "intra-node" entry references `de164e5` for the raw artifacts + figures. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 20:23:50 +08:00
Gahow Wang	de164e5a64	MB2: pure KV-transfer cost on dash1 intra-node — Mooncake ~9.7 GB/s steady Full sweep result on dash1 GPU 0+1 with vanilla vLLM 0.18.1 + mooncake-transfer-engine 0.3.11, kv_both connector. Per-stage decomposition via the instrumentation patch (analyze_mb2.py pairs A's send_blocks with B's receive_kv enter/finish by time window). Steady-state (1k..32k tokens, 96 MiB..3 GiB KV): pure_transfer ≈ size / 9.7 GB/s rx_overhead ≈ 2–3 ms (ZMQ handshake + P-side setup) bandwidth ≈ 9.6–10.1 GB/s, very stable Large-size regime (65k..131k tokens, 6..12 GiB): p50 bandwidth collapses to 3.4–4.5 GB/s max bandwidth still hits ~9.7 GB/s (some runs achieve it) p99 agentic request (11.5 GiB) lands here Implication for §3.2 PD-disaggregation cost argument: median agentic decode = 50–200 ms (tool-call JSON output) median agentic-tail KV transfer (p99 11.5 GiB): best case (9.7 GB/s) ≈ 1.19 s observed range 1.5 – 10 s ⇒ KV transfer is 8–100× larger than the decode it enables. This is intra-node — the lower-bound transfer cost. Inter-node RDMA will be slower; that's MB2 phase 2. Adds: - analyze_mb2.py: pair A.send_blocks ↔ B.receive_kv by time window; per-size aggregation (n, ms_p50, ms_min/max, GB/s_p50/max) - plot_mb2.py: log-log transfer-time chart + bandwidth-vs-size chart - analysis/mb2/A_intra_kvboth.jsonl, B_intra_kvboth.jsonl: raw events (51 + 102 events including the sanity preamble) - analysis/mb2/intra_kvboth_breakdown.json: paired and aggregated - figs/mb2_transfer_time_intra.png, figs/mb2_transfer_bw_intra.png Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 19:04:03 +08:00
Gahow Wang	876d09db83	Add chatbot T_external CDF; overlay on f3a vs agentic User-requested comparison of inter-turn external gap distribution between the production agentic trace (Qwen3-Coder) and a production chatbot trace (qwen3-max chat). Both computed as T_external = next_turn.start_ms - prev_turn.end_ms on the same kind of pipeline (raw input + raw output join on request_id, session structure from the formatted trace's parent_chat_id chains). The chatbot trace lives as two files on dash0: input : bailian-trace/qwen-trace-260321-260327/qwen3-max-input-032309-032311.jsonl output : bailian-trace/qwen-trace-260321-260327/qwen3-max-output-032109-032711.jsonl The raw input has no session_id (uuid is per-record, user_id has only 4 distinct tenant values for 346 k requests). We recover session structure from the formatted file (qwen_chat_blksz_64_032309-032311.jsonl, which groups requests by parent_chat_id), matching each formatted record to a raw record by (timestamp, output_length) — prompt_token_num is anonymized to 0 in this trace, so we use generate_token_num as the join key. End time is derived from time_to_finish_token (ms duration) not the "time" string field (which is the log-write time, not request completion). Numbers (chatbot, 42 228 inter-turn gaps over 32 262 multi-turn sessions): p25 4.85 s p50 7.18 s p75 8.22 s p90 15.0 s p99 43 s 4% gaps < 1 s 29% < 5 s 78% < 10 s 98% < 30 s Compare to agentic (same metric, scripts/compute_inter_turn_gap_remote.py): p25 0.69 s p50 1.6 s p75 8.6 s p90 44 s p99 738 s 39% gaps < 1 s 67% < 5 s 77% < 10 s 87% < 30 s Distributions differ in shape, not just location: - Chatbot is tight, unimodal around 5–10 s (human interaction). - Agentic is bimodal: a sub-second autonomous tool-call mode (39 % < 1 s) plus a long-pause tail (13 % > 30 s, p99 = 738 s) for sessions where the operator steps away. - The sub-second tool-call mass is where dispatch coupling lives — those turns have W_turn ≫ T_external for any current scheduler. The earlier "chatbot has T_human ≈ 30 s" hand-wave was wrong empirically. The right framing for §2.3 is "agentic has a sub-second tool-call mode that chatbot doesn't", not "chatbot has think-time and agentic doesn't". Adds: - scripts/compute_inter_turn_gap_chatbot.py: dash0-side aggregator (raw input/output join + formatted alignment by ts + output_length) - analysis/characterization/data/chatbot_inter_turn_gap.json: CDF cache - scripts/plot_inter_turn_gap.py: overlays both curves on log-x Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 14:49:44 +08:00
Gahow Wang	41232f49d3	Measure inter-turn T_external on the raw production trace; add f3a CDF The earlier conversation suggested agentic might "have no human think-time" and therefore live in a strict closed-loop regime. The user pushed back: tool calls also take time and might restore a chatbot-like buffer between turns. To resolve this, we go to the actual data. The previously-published per-record formatted trace only carries arrival timestamps, so an arrival-to-arrival diff conflates W_turn + T_external. The raw trace (/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/ 051315-051317-raw.jsonl on dash0) additionally carries request_end_time_ms, which lets us compute the pure inter-turn external gap T_external = next.request_ready_time_ms - prev.request_end_time_ms for each session's consecutive turn pair. Headline numbers (n = 783 k inter-turn gaps over 127 k multi-turn sessions): p25 = 0.69 s p50 = 1.6 s p75 = 8.6 s p90 = 44 s mean = 37 s (heavy long-tail; paused/abandoned sessions) 39 % of gaps < 1 s 67 % of gaps < 5 s 87 % of gaps < 30 s The bulk of the distribution is dominated by sub-second to a-few-seconds tool-call latencies. Under any current scheduler (e.g. unified TTFT p90 = 7.3 s, lmetric 15.7 s), W_turn is already at or above the 75th percentile of T_external, so dispatch coupling is the dominant regime for the majority of turns — not a corner case. This corrects the earlier conflated arrival-to-arrival "median gap 11 s" figure (which folded W_turn into T_external). The true T_external median is 1.6 s. Adds: - scripts/compute_inter_turn_gap_remote.py: dash0-side aggregator - analysis/characterization/data/agentic_inter_turn_gap.json: 500-point CDF cache + summary stats, scp'd back from dash0 - scripts/plot_inter_turn_gap.py: local figure renderer - figs/f3a_inter_turn_gap.png: log-x CDF with p25/p50/p75/p90 anchors and unified/lmetric TTFT p90 reference lines Next step (per user): pull a chatbot trace through the same pipeline and compare distributions side by side; this will let §2.3 stop hand-waving about "no think-time" and instead present the regime split empirically. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 12:37:32 +08:00
Gahow Wang	555cabcf1f	f2c: switch to per-instance decode-concurrency view; correct KV pool ceiling Old f2c plotted per-request KV footprint MiB against an "H20 ~95 GiB usable" reference line. That ceiling was wrong — a 30B-A3B bf16 deployment burns roughly: ~50% HBM for model params (~48 GiB on 96 GiB H20) ~10% for runtime activation buffers ~40% left for the KV cache pool (~38.4 GiB) so 95 GiB was overstating the available pool by 2.5×. New f2c reframes the same data into the answer that actually motivates the paper: how many concurrent decodes does a single instance hold, and how does PD-disagg change that? Grouped bars per percentile show system-wide concurrent decode capacity for three 8-GPU deployments: Combined 8C, PD-disagg 4P+4D (N_D=4), PD-disagg 6P+2D (N_D=2) Key reads off the figure: p50 (1.8 GiB/req): 20 fit/inst → 160 / 80 / 40 system-wide p90 (8.0 GiB/req): 4 fit/inst → 32 / 16 / 8 p95 (9.6 GiB/req): 4 fit/inst → 32 / 16 / 8 p99 (11.5 GiB/req): 3 fit/inst → 24 / 12 / 6 PD-disagg 4P+4D literally halves the decode population at the same per-request KV pressure — this is the concrete §3.2 "KV memory wall" penalty stated in terms users care about (concurrency). - analysis/characterization/render_window1_figures.py: fig_kv_footprint_cdf rewritten; reads same kv_footprint_summary.json but computes floor(KV_pool / req_size) × N_D and annotates the per-instance fit count below each percentile group. - figs/f2c_kv_footprint_cdf.png: regenerated. - MEETING.md / PAPER_OUTLINE.md §2.1, §2.4: prose updated with the new ceiling and the "3 p99 decodes per instance / halved by PD-disagg" framing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 11:28:47 +08:00
Gahow Wang	922d79ac95	Add full latency grid (mean/p50/p90/p99 × TTFT/TPOT/E2E) as f6 companion The headline f6_e2e_latency_bars only shows p90, hiding three regimes: - mean: unified dominates (3.3s TTFT, 7.0s E2E vs sticky 5.6s / 12.1s) - p50: sticky and unified are tied on first-turn TTFT (0.5s each) — sticky's first turn of each session is free, after which queues accumulate. Unified beats sticky everywhere else. - p99: tail amplification reveals unified's biggest gap — TTFT 42.3s vs sticky 74.1s; E2E 68.8s vs sticky 139.7s. The 12-panel figure is the honest full picture; the 3-panel headline stays for slide-friendly summary. - analysis/characterization/window_1_results/raw_stats/{policy}.json: cached ttft/tpot/e2e {mean,p50,p90,p99} pulled from dash0 /home/admin/cpfs/wjh/agentic-kv/outputs/b3_sweep_20260525_095043/ (b3_policy_comparison.json doesn't record mean, only percentiles). - analysis/characterization/render_window1_figures.py: new fig_b3_latency_full_grid renders the 4×3 grid from the cache. - figs/f6_e2e_latency_full_grid.png: 12-panel companion. - PAPER_OUTLINE.md §5.2: both figures embedded; main table column renamed from "Hotspot idx" to "Worker p90 (median / max)" to match the new metric convention. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 11:15:18 +08:00
Gahow Wang	5e6e98aee7	Replace max/median hotspot index with (median, max) absolute pair The max/median ratio inverts the actual user-facing p90 ranking: sticky: hotspot=2.73 but system e2e p90 = 34.6s (worst) unified: hotspot=3.67 but system e2e p90 = 18.0s (best) because sticky's median is also high (everyone slow) while unified concentrates the damage on one worker and keeps the other 7 fast. Any "imbalance" metric structurally punishes the affinity-then-escape schemes that we actually want to advocate for. Changes: - analysis/characterization/render_window1_figures.py: fig_b3_per_worker_ttft now annotates each subplot with "median X.Xs · max Y.Ys" instead of "hotspot=Y.YY"; docstring documents why we drop the ratio. - figs/f4c_per_worker_ttft.png: regenerated with new titles. - figs/f4c_apc_vs_hotspot_tradeoff.png: deleted. The scatter's y-axis was the deprecated ratio; superseded by f4c per-worker bars + f6 e2e bars which together carry the same information honestly. - PAPER_OUTLINE.md: C3, §3.3, §4.1 wording, §5 metric list, §8 conclusion — replace "hotspot index" mentions with "worst-worker p90" or "(median, max) worker p90"; promote the §3.3 methodology note to a top-level sub-finding ("hot pin failure must be measured with per-worker absolute latency, not normalized ratio"). - MEETING.md: §3.3 narrative reworded to lead with the (median, max) pair directly; explicit one-line note on why the ratio is dropped. Conceptual uses of "hot session" / "hot instance" / "hot pin" remain unchanged — only the metric called hotspot index is retired. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 11:07:12 +08:00
Gahow Wang	09ff1069c3	Drop 'capped' from per-policy figures (f4a, f4c×2, f6) 'capped' is not a routing policy — it's lmetric run on a separately truncated trace (sessions capped to 8 turns via build_capped_trace.py). Putting it alongside lmetric/load_only/sticky/unified in per-policy comparison figures is misleading because the workload differs, not the routing decision. Comparing apples to a different-trace orange inflates/deflates apparent policy gaps for the wrong reasons. Regenerated 4 figures with --exclude-policies capped on analysis/characterization/render_window1_figures.py: - f4a_apc_loss.png (APC bars) - f4c_apc_vs_hotspot_tradeoff.png (APC vs hotspot scatter) - f4c_per_worker_ttft.png (per-worker TTFT panel) - f6_e2e_latency_bars.png (TTFT/TPOT/E2E bars) Added --exclude-policies CLI flag to the renderer so this is a reversible choice, not a permanent script mutation. capped data remains in b3_policy_comparison.json and can be brought back in workload- sensitivity sections (where it actually belongs) by omitting the flag. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 10:57:43 +08:00
Gahow Wang	1220da249c	f2b: regenerate CDF from production trace (1.3M sessions on dash0) Pulls 456 (rank%, cum%) sample points from the raw production trace at dash0:/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl, cached locally so the figure is reproducible without ssh access. Sampled anchors match the precomputed summary exactly: top 1% = 46.5%, top 5% = 66.5%, top 10% = 74.6% plus newly readable points: top 25% = 87.5%, top 50% = 96.0% Workload characterization is now consistent with the production distribution rather than the small replay subset. Replay window CDF kept as an overlay to show the same hockey-stick shape on the data §5 actually uses. - analysis/characterization/data/production_session_skew_cdf.json: cached sample points (29 KB), so the figure rebuilds locally - scripts/plot_session_skew_cdf.py: now plots from the cache + replay raw - MEETING.md / PAPER_OUTLINE.md: revert numbers to production trace, add top-25%/50% data points Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 10:41:53 +08:00
Gahow Wang	dc6d24d1ca	Add NIXL substrate isolation control + attribution decomposition Adds unified_nixl_both to elastic_migration_v2: same picker as unified_kv_both (never triggers PD-sep), but launches vLLM with NixlConnector instead of MooncakeConnector. Compared against plain unified and unified_kv_both (Mooncake) we can now attribute the substrate overhead between "v1 connector framework irreducible cost" (proxied by the leaner NIXL) and "Mooncake implementation extra" (Mooncake - NIXL). Result (vs plain unified, both substrates never PD-sep): metric plain NIXL Mooncake TTFT p90 7.35s +37.9% +45.3% (NIXL: +7pp better) TPOT p90 17.1ms +15.5% +24.5% (NIXL: +9pp better) E2E p90 18.03s +17.4% +27.0% (NIXL: +10pp better) hotspot 3.667 +0.2% +19.0% (NIXL: keeps it flat) APC 79.4% -0.3pp -1.1pp interference - 5.58 8.57 (NIXL: ~35% lower) The cleanest signal is hotspot: NIXL preserves plain-unified's distribution (3.674 vs 3.667), while Mooncake's per-scheduler-step O(\|cache\|) `set(self._block_pool.cache.keys())` diff against _known_hash_keys (mooncake_connector.py:432-456) inflates routing imbalance by 19%. The hash sync runs unconditionally even when no direct_read consumer is present. Attribution: NIXL-plain ~= v1 framework irreducible cost (kv_buffer GPU memory, per-step SchedulerOutput.kv_connector_metadata round-trip, altered kv_cache_manager block-lifecycle). Mooncake-NIXL ~= Mooncake-specific overhead (the hash-sync loop and stricter delay_free semantics). Practical implication: NIXL is meaningfully better than Mooncake on this stack, but even NIXL imposes 16-38% across metrics — too expensive for selective-PD-sep on agentic workloads where the trigger rate is < 0.5%. Launch fixes required for NIXL multi-instance: - VLLM_NIXL_SIDE_CHANNEL_PORT must be unique per instance (default 5600; we use 5600+i). Without this, 7 of 8 instances silently hang in `zmq.error.ZMQError: Address already in use` and the launcher trap kills all of them at health-check timeout. - Health-check timeout raised from 180s to 360s; NIXL initialization (UCX agent + memory registration) is ~100-150s per instance under 8-way concurrent load, vs Mooncake's ~30-60s. New figure: fig_connector_substrate_attribution.png stacks plain / framework / Mooncake-extra / v2-branch overhead per metric. Existing figures (fig_kv_both_overhead, fig_three_way_hotspot) updated to include NIXL as a fourth bar. README updated with 4-way table, Result 1 reframed as "the cost is mostly framework, not Mooncake — but Mooncake adds the hotspot penalty", and the substrate-vs-PD-sep tradeoff math. Refs: nixl_connector.py:700 handshake listener bind, factory.py register_connector for the NixlConnector entry. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 16:02:12 +08:00
Gahow Wang	d76eb02637	Elastic migration v2 section: PD-sep on agentic workload is net negative New analysis/characterization/elastic_migration_v2/ packages the unified_v2 + unified_kv_both experiments into a self-contained results section that the paper can cite as the "we tried selective PD-sep migration" case study. The section finds three independent reasons PD-sep doesn't help on agentic w600: 1. Mooncake kv_both substrate alone (no PD-sep ever firing) imposes TTFT p90 +45%, TPOT p90 +25%, hotspot index +19% vs plain unified. Per-step KVConnectorMetadata maintenance and block reservation semantics dominate even when no transfer is pending. 2. PD-sep gate fires only 0.16-0.41% of requests across two gate-tightness configurations. 88-76% are killed by new_local < threshold because 93% intra-session reuse on agentic traces leaves a small uncached tail; 19% are killed by chosen_no_active_decode (snapshot-time gate). Even relaxed thresholds can't grow trigger rate past 0.5%. 3. When PD-sep fires, the calibrated cost model (0.3s + bytes / 2.7 GB/s) is wrong by 10-20x. 5 triggered requests in v2.1 saw realized TTFT 12-45s vs model-predicted migrate cost 0.7-2.2s, consistent with the E2 audit's finding that D-side block pre-reservation and missing layerwise pipelining dominate the decode_sent -> first_token clock. Three-way comparison (unified vs unified_kv_both vs unified_v2): v2 vs the kv_both control is roughly net-zero (-10% hotspot, -14% TPOT p90, +3% TTFT p90, +9% TTFT p99). v2 vs plain unified is strictly worse by 27-49% across latency percentiles because the kv_both substrate tax is unavoidable when the policy is enabled. Contents: - README.md: the four results sections, the three-way comparison table, an explicit "what this claims for the paper" list, and a cross-reference index to the earlier characterization documents. - data/: b3_policy_comparison.json + per-policy breakdown.json + per-policy hotspot_index.json for the four policies in scope. - figures/: 4 PNGs rendered by render_figures.py: * fig_kv_both_overhead.png — 4-metric bar chart with delta annotations showing kv_both alone costs +45% TTFT p90. * fig_v2_trigger_funnel.png — per-reason request count for the two gate configurations on log scale. * fig_v2_predicted_vs_actual.png — scatter of model-predicted migrate cost vs realized TTFT for the 5 triggered requests, with y=x, 10x, and 20x reference lines. * fig_three_way_hotspot.png — per-worker TTFT p90 grouped bars across the three policies. The section is intentionally self-contained: it lists what the experiment validates (cost model picks correct candidates; shadow-drift fix is necessary; same-worker interference is real) alongside what it disproves (per-request PD-sep on agentic via Mooncake is not a net win in current implementation). Refs: E1/E2 subagent audits, B2 microbench, unified_v2 commits `19f69a9` / `4b833d3` / `95c8ef8`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 13:28:37 +08:00
Gahow Wang	c63dc151a0	Agentic PD / Unified routing story plan draft User's 2026-05-25 draft aligning three threads (agentic-kv vLLM experiments, dash0 artifacts, agentic-pd-hybrid SGLang work) into a single story for the paper. Tracked so future iterations and review history are in version control. Co-Authored-By: Gahow Wang <chiahaco@gmail.com>	2026-05-26 01:12:42 +08:00
Gahow Wang	0881942cf3	Window 1 results: recompute with fixed metrics + reframe limitations After the B3 audit bug fixes (joined_analysis hotspot median + b3_analyze percentile interp), regenerate b3_policy_comparison.json and the per-policy hotspot_index.json from the same raw run on dash0 and re-render the three affected figures (apc-vs-hotspot, latency-bars, per-worker TTFT). Key number changes in window_1_results.md: - hotspot_index magnitudes corrected (all five policies; lmetric smallest delta at +0.7%, sticky largest at +16.1%) - "capped reduces hotspot 13%" -> "~10% (2.253 -> 2.020)" - TTFT/E2E/TPOT percentiles shift by <1% from floor->interp (unified TTFT p90 7.24 -> 7.35 s) Restructured "Caveats" into "Limitations (read this before quoting B3 numbers)": 1. Agentic dispatch coupling is by design — promoted from caveat to top-level methodology framing, tied to agentic_dispatch_coupling.md 2. B3 interference_index is binary (not size-graded) — added 3. Hot-sweep cache contamination (<1%) — kept 4. Unified interference unrecoverable — kept with explicit warning not to read unified's failure attribution as causal 5. w600 is a sample, not full trace — kept 6. Reuse decomposition is per-token in expectation — added current_results/characterization_claim_matrix.md updates: - The "heavy-tail not sole cause" claim now cites the corrected ~10% drop with the median bug noted - New supported claim: "B3 saturated-replay latency gaps include an agentic dispatch-coupling feedback term, which is intentional and matches production"; cited against agentic_dispatch_coupling.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 01:08:55 +08:00
Gahow Wang	0e82612100	Fix B3 analysis bugs from subagent audit (median + percentile + sweep) Three fixes from the B3 audit: 1) joined_analysis.hotspot_index used sorted[n//2] as median, which returns the ~60th percentile for n=8 (even-length). Systematically under-states the hotspot index. Recomputed values: lmetric 2.238 -> 2.253 (+0.7%) load_only 1.140 -> 1.294 (+13.5%) sticky 2.349 -> 2.728 (+16.1%) unified 3.350 -> 3.667 (+9.5%) capped 1.937 -> 2.020 (+4.3%) Qualitative ranking preserved; "capped only modestly reduces hotspot" story holds with ~10% drop instead of the previously reported 13%. Added test_hotspot_index_uses_true_median_for_even_n to lock in the fix. 2) b3_analyze.sh's pct() helper used floor-indexed percentile sorted[int(p*(n-1))], inconsistent with metrics._percentile and joined_analysis._percentile which both use linear interpolation. Now matches. 3) b3_sweep.sh's capped step called run_policy "capped", but the proxy's argparse has no "capped" choice, so the hot-sweep variant would have crashed on this step. The actual capped data was produced via b3_isolated_policy.sh with --policy lmetric. Replace the broken inline call with an explicit launch_proxy lmetric + inline replayer block so the sweep script matches the data path it documents. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 01:08:37 +08:00
Gahow Wang	8ac41a8684	Agentic dispatch coupling: trace-replay session-sequentiality is realistic The B3 audit flagged the trace replayer's "fire turn N+1 immediately if turn N is behind schedule" semantics as a potential benchmark crime, because under saturation the effective arrival process becomes policy-dependent (slow policy -> longer session lifetimes -> more concurrent in-flight -> harder system -> still slower). The audit called this dispatch slip. But in agentic workloads, turn N+1 is generated by a tool-call response or an autonomous-loop step, not by a human reading the previous reply. There is no inter-turn think-time. So the replayer's "no think-time, sequential within session, fire-immediately-when- ready" behavior is the correct model of agentic production, and the feedback amplification is a real property of production systems under saturation rather than an artifact of the replayer. The note (analysis/characterization/agentic_dispatch_coupling.md) lays out: - The dispatch rule and the apparent feedback loop - Why agentic workloads do not have user think-time - Application of Little's Law: slower policy carries higher concurrent in-flight load, so the policy x feedback gap is real, not artifact - Reframes B3 as the "production-replay" experiment and B4 as the orthogonal "controlled-load" experiment, complementary not hierarchical - Calls the feedback amplification itself out as a finding worth reporting (e.g. unified's ~2x latency-p90 gap over lmetric in B3 reflects both the routing improvement and the in-flight reduction) - Contrasts with chat workloads (human think-time partially breaks the feedback loop, agentic removes that floor) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 01:00:25 +08:00
Gahow Wang	559faa1e26	B2 finding: TPOT idx peaks at 32k, not 65k — cost migrates to TTFT The B2 same-worker TPOT p90 idx is non-monotone: 7.89x at 32k drops to 2.26x at 65k. The naive reading is "interference gets weaker for huge prefills"; the actual mechanism is a regime shift, and reading TPOT p90 alone is misleading. Three superimposed effects: 1. Cost migration TPOT -> TTFT. A 32k prefill is short enough that chunked-prefill keeps interleaving decode steps, so overlapping decodes trickle tokens out at painful per-token rates. A 65k prefill is long enough that overlapping decodes are fully blocked for ~10s; once they break through, the injection is winding down and subsequent iterations run unobstructed. The cost lands on the TTFT clock (14s) instead of inflating TPOT. 2. Bimodal TPOT distribution. At 65k overlap, decodes split into "blocked entire prefill then normal rate" and "trickled slowly through prefill chunks". p99 sits on the second population and grows 59 -> 169.5 ms; p90 sits on the first and shrinks. 3. "Clean" stops being clean. With 4x ~10s injections in 60s, the 110 "clean" decodes at 65k are squeezed into 2-3s recovery pockets. TPOT p90 clean rises 6.9 -> 9.6 ms (40%), shrinking the denominator of the ratio. window_1_results.md adds a new B2 subsection laying out the mechanism with the per-cell data table and the explicit reading rule: headline interference metric is TTFT idx (monotone); TPOT p99 is the right tail indicator; TPOT p90 alone is unsafe across regime shifts. Direct implication: TTFT and TPOT need separate SLO thresholds under PD-colo, because they measure costs from different points in the request lifecycle and the cost migration between them is workload-dependent. current_results/characterization_claim_matrix.md adds a new supported claim for the cost migration, listed against the existing B2 evidence. current_results/reviewer_risk_register.md adds a low-severity entry warning future readers off TPOT p90 alone. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 00:35:45 +08:00
Gahow Wang	4722883903	Audit package refresh: Window 1 supported claims + risk register Refresh the standing audit package now that B1' / B2 / B3 are complete. current_results/characterization_claim_matrix.md Flips seven entries from "not_yet_supported" / "partially_supported" to "supported" with pointers into window_1_results/. New entries cover per-session sequentiality, KV per request, real reuse decomposition, theoretical APC ceiling, the LMetric locality gap, Unified breaking the locality-vs-latency tradeoff, B2 causal interference proof, sticky's interference inflation, and the partial heavy-tail / hot-spot story. B4 SRR + B5 attribution stay "not_yet_supported" (Window 2 work). current_results/main_claim_allowed_runs.md New "Allowed For Routing-Policy Comparison" section pins the five B3 policy directories. New "Allowed For PD-colo Interference" section pins the B2 sweep. Legacy section retained for the pre-instrumentation 200/500/1000-req runs. current_results/reviewer_risk_register.md Marks the two old "high"-severity risks (sequentiality / reuse decomposition) as resolved; adds new entries for the APC contamination empirics, the b3_analyze.sh truncate-write bug that cost unified's interference index, the GPU-0 EngineCore ghost cleanup, the saturated-replay caveat for trace-timestamp dispatch, and the synthetic B2 decode workload. current_results/all_figures_index.md Adds the 8 new Window 1 figures alongside the existing 6 from the legacy summarize_runs run. current_results/reproduction_commands.sh Records the full B3 + B2 + figure pipeline. analysis/characterization_todo_for_interns.md Updates the Progress Snapshot table: B0, B1, B2, B3, B6 all DONE; only B4 and B5 remain (Window 2). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 23:25:27 +08:00
Gahow Wang	0c3220cbb8	Window 1 results: combined B1' + B2 + B3 report and artifacts analysis/characterization/window_1_results.md is the headline write-up for Window 1: workload characterization (KV per request, real reuse decomposition, APC theoretical ceilings), B3 5-policy sweep with per-policy interpretation, B2 same-vs-different-worker interference microbench with causal reading, and an explicit list of what Window 1 does not answer (deferred to B4 SRR sweep + B5 attribution). Under window_1_results/: - 5 raw result JSONs from the B3 sweep, the B2 microbench, the APC upper bound, and the KV footprint - per-policy hotspot_index.json snapshots so render_window1_figures.py can plot per-worker TTFT p90 distributions - 8 PNG figures (figures/) covering the headline claims Three takeaways the figures pin down: 1) intra-session reuse dominates (93.2%), so session-affinity routing is the right primary lever 2) unified hybrid affinity hits 79.4% APC (97% of the 79.6% intra- session ceiling) AND cuts TTFT p90 from lmetric's 15.6s to 7.24s 3) B2 different-worker control sits at idx ≈ 1.0 across 32× prefill- size variation; same-worker TTFT idx scales 2.15× -> 218×, which is the cleanest causal evidence for same-worker prefill-decode interference Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 23:25:09 +08:00
Gahow Wang	b7902061d1	Window 1 analysis: APC upper bound, B2 window-overlap, figure renderer Three CPU-only analysis pieces that turn raw Window 1 artifacts into publishable numbers and figures. scripts/compute_apc_upper_bound.py Block-level trie walk over hash_ids to compute the theoretical APC ceiling on a trace, decomposed into intra-session / any-session / shared-prefix-only. Gives a fixed reference for what each routing policy could possibly achieve. w600 result: 79.6% intra-session, 80.3% any-session, 0.1% shared-prefix. analysis/characterization/b2_sweep_analysis.py (rewrite) Previous version used joined_analysis.interference_index() which labeled overlap = "any prefill in any other request during this decode". With short-prompt decode load this is always true (everyone's prefill overlaps everyone else's decode); n_overlap was 239/240 even in the different-worker control. New version labels overlap iff the decode's [t_first_token, t_finish] intersects an actual large injection window, computed from the cell's "prefill"-tagged metric rows. Different-worker control now cleanly sits at idx ≈ 1.0, same-worker scales monotonically. analysis/characterization/render_window1_figures.py Renders 8 PNGs from the result JSONs: B3 latency / APC vs ceiling / APC vs hotspot scatter / per-worker TTFT / failure breakdown, B2 TPOT and TTFT curves (overlap vs clean and idx), reuse decomposition, KV footprint. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 23:24:54 +08:00
Gahow Wang	08530b3915	B3 policies: pseudocode reference for the five-policy sweep Documents each pick_instance_* function from cache_aware_proxy.py in pseudocode so the policy semantics can be cited without re-reading implementation details. Covers lmetric (main baseline), load_only (no cache / no affinity control), sticky (hard affinity control), unified (gated affinity + LMetric fallback), and capped (lmetric on a per-session turn-capped trace). Includes a decision matrix that maps each policy to whether it uses session affinity, cache awareness, load awareness, and overload break, plus a one-liner per control explaining what comparison isolates which factor. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 19:57:02 +08:00
Gahow Wang	e23128ad65	B2: PD-colo interference microbench harness + sweep aggregator scripts/b2_interference.py is the controlled microbench. It runs two coroutines against the open proxy bypass (direct vLLM endpoints): - decode_load: continuous short-prompt requests at fixed QPS into a designated decode instance, to keep it decode-saturated. - prefill_injections: N large one-token requests at fixed interval, pointed at either the same instance (same-worker variant) or a paired one (different-worker control). Each cell (variant × prefill_size) gets its own metrics.jsonl plus a run_window.json containing t_start_unix/t_end_unix. The shared engine_*.jsonl from the scheduler patch is sliced by that window in the aggregator. analysis/characterization/b2_sweep_analysis.py walks the cell tree, slices the per-worker step log by each cell's window, runs the A5 interference_index() against the slice, and emits a single b2_sweep_summary.json with one row per cell. This is what feeds the "interference vs uncached prefill size" figure. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 17:54:51 +08:00
Gahow Wang	763355b825	A5 fix: worker-id resolution and vLLM cmpl- rid stripping Smoke validation on dash0 surfaced three real bugs that broke interference and failure-attribution labels end-to-end: 1. endpoint_url in metrics is the proxy URL (e.g. http://h:9200); the vLLM worker URL lives in breakdown's routed_to. The interference index and label path were taking endpoint_url first, so every request looked routed to a non-existent worker and the overlap counter stayed at zero. 2. _normalize_worker hard-coded base port 8000, so a smoke run on port 9100 resolved to engine_1100 instead of engine_0. Added a --worker-map URL=engine_id CLI flag and _resolve_worker() that prefers the explicit map and falls back to the heuristic. 3. vLLM rewrites the per-step rid as cmpl-<proxy_id>-<i>-<hash>, so the str equality check between per_req rid and our proxy request_id never matched -> every prefill step looked like "other request prefill", which would have flipped overlap to 100%. Added _vllm_rid_matches() that strips the cmpl-/chatcmpl- prefix. After the fix, the same smoke run reports interference_index = 22.9 across 24 overlap / 6 clean requests on a single instance, which is the expected shape for serial dispatch into a cold engine. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 16:47:23 +08:00
Gahow Wang	cd82b8c2a2	PD-sep matrix results: C2/C3/C4 figures + empirical mechanism refined Captures 5 runs from the experiment matrix (combined-ca x3 seeds, pdsep-4p4d seed1, pdsep-6p2d seed1) on traces/w600_r0.0015_st30.jsonl with cuda graphs enabled. The headline: combined-ca: TTFT p50 0.91s success 99.5% pdsep-4p4d: TTFT p50 62.8s success 52% (69x worse, half dropped) pdsep-6p2d: TTFT p50 51.1s success 68% (56x worse, third dropped) C2 (fig_c2): headline bars per config with error bars. C3 (fig_c3): per-instance KV utilization time-series. Both PD-sep splits hit the memory wall, but the side differs by P:D ratio -- 4P+4D pins the P-side, 6P+2D pins both sides (D-side back-pressures P-side). C4 (fig_c4): TTFT stacked breakdown. 99% of PD-sep TTFT is P-side prefill compute; D-side wait + first token is <=1.2s. The bottleneck is P-side prefill queueing, not D-side decode wait as the original analytical model assumed. system_analysis.md gains a Layer 5b that reconciles the analytical KV-wall model (which considered D-side only) with the empirical finding that the wall hits whichever side has fewer GPUs, and co-saturates both at extreme splits via D-side back-pressure. plot_pd_matrix.py ingests outputs/pd_matrix/* into all four figures. bench.sh gained AGENTIC_STEP_LOG_DIR hooks for future runs (set during this work but not used by the current matrix's data). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 16:23:52 +08:00
Gahow Wang	25445e3d18	A5: joined analysis with reuse decomp, interference, hot-spot, labels New analysis/characterization/joined_analysis.py joins replayer metrics.jsonl + proxy breakdown.json + worker_state.jsonl by request_id, plus engine_*.jsonl by worker_id, and emits: - joined.jsonl per-request merged record - reuse_decomposition.json real intra/cross/shared classification using session_id + hash_ids + cached_tokens - interference_index.json TPOT_p90(same-worker prefill overlap) / TPOT_p90(clean), per Batch 2 - hotspot_index.json max/median worker TTFT-p90, per Batch 3 - failure_label.jsonl per-slow-request cause label, per Batch 5 - failure_breakdown.json label histogram - window_summary.json SRR warmup/steady/drain aggregates Closes the analyzer side of Phase A; replaces the status: unavailable placeholders the existing scaffold emits when join sources are missing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 16:19:33 +08:00
Gahow Wang	e5761fa6f3	Characterization plan: progress snapshot + Claude work plan - Add Progress Snapshot table to the intern TODO so per-batch status (DONE / partial / blocked-on-instrumentation) is visible at a glance. - New analysis/claude_characterization_work_plan.md scopes the Phase A instrumentation tasks (A1-A5) plus Window 1 (B1'+B2+B3) and Window 2 (B4+B5) on dash0, with locked decisions for model, topology, trace, SLO style, and GPU phasing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 16:18:41 +08:00
Gahow Wang	5ed6f6fe5b	Add characterization result figures	2026-05-25 15:15:10 +08:00
Gahow Wang	0f64fb3261	Add agentic workload characterization audit scaffold	2026-05-25 15:01:18 +08:00
Gahow Wang	21ffb3d4f7	PD-sep matrix infrastructure: bench.sh pdsep mode + matrix driver Adds the experiment harness that gates the empirical claims (C2/C3/C4/C5) in the PD-sep paper section. Three pieces: 1. scripts/bench.sh: new --mode pdsep with --pd-ratio P:D, and an --eager flag to re-enable --enforce-eager for the cuda-graph ablation. pdsep reuses the elastic-mode Mooncake kv_both launch and swaps the proxy command from --combined to --prefill/--decode. baseline and elastic flows are unchanged. 2. analysis/pd_sep_paper_section/scripts/bench_pd_matrix.sh: matrix driver that runs {combined-ca, pdsep-4p4d, pdsep-6p2d} x cudagraph x 3 seeds by default (~2 h on dash0). --with-rr adds combined-rr; --with-eager doubles to ~5 h with the cuda-graph ablation. Skips completed runs, captures per-instance vLLM logs (needed for C3 step-level KV-utilization mining). 3. fig_kv_memory_wall.pdf: empirical anchor (star) at REPORT.md §3.3's observed 6P+2D 97% KV utilization. The marker lands on the model's predicted curve at p90 input, confirming the steady-state analysis. README updated with the run command, output layout, and the followup plotters that consume outputs/pd_matrix/. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 11:47:33 +08:00
Gahow Wang	4028c587b1	Paper section: system analysis + workload figures + KV-wall model Adds the system-level argument resolving the roofline/PD-sep paradox. Even at 95% cache reuse prefill stays compute-bound (the C6 roofline fact), yet PD separation regresses TTFT 72%. The new system_analysis.md walks through six layers showing why the roofline claim is necessary but not sufficient, with the falsifiable condition being decode-side KV memory budget: concurrent_decode * KV_per_req / (N_D * HBM_pool). For chatbot this ratio is << 1 at any layout; for agentic at p90+ context it goes >> 1 under 4P+4D and 6P+2D, predicting the empirical 97% decode KV occupancy. fig_kv_memory_wall.pdf visualizes the model with audit-able constants; fig_c1a/b ground the per-request KV-size inputs in the actual sampled trace (input p50=33.5k, p90=101k, intra-session reuse 79.2%). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 11:41:31 +08:00
Gahow Wang	d71a111099	Paper section: PD-sep scaffold + drop --enforce-eager from launch scripts Adds analysis/pd_sep_paper_section/ as the home for the "PD separation is net negative under agentic workloads" paper section: plot scripts for C1 (workload chars), C6 (roofline), C7 (routing-vs-PD-sep lever), the C6/C7 PDFs already rendered, and a README mapping candidate claims to required figures plus open re-run items. Removes --enforce-eager from bench.sh and all active launch scripts so cuda graphs are captured -- the prior methodology suppressed one of PD-sep's structural advantages (D-node fixed-shape decode). Legacy scripts under scripts/legacy/ are intentionally untouched as historical records. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 11:24:16 +08:00
Gahow Wang	6a27f75337	Docs: reconcile routing docs with current hybrid direction Per analysis/unified_routing_fix_review.md #2, several docs still presented the retired single-argmin + PUSH-migration design as the final algorithm. Mark them superseded and document the current hybrid direction (commit `255c8e6`). - REPORT.md §1.1 / §3.9: add errata callout and section header noting the "Final Design" framing was retired after `cc6e562` / 4c583f2; point readers to docs/migration-policy-design.md. - docs/migration-policy-design.md: rewrite. Opens with the current hybrid algorithm (LMetric base + cache_ratio>0.5 affinity gate + tie-breaker), then a "What Was Retired" commit table, then the old Approach A numbers preserved as "Historical Baseline-Mode Comparison". - analysis/research_findings.md §2.2 / §5: correct the LMetric framing. LMetric isn't "neutralized by affinity constraints" (pure --policy lmetric has no affinity at all); it converges to similar placements because P_tokens includes new_uncached_tokens, giving it implicit soft affinity. - analysis/elastic_hypotheses.md: same LMetric correction in the "DOESN'T work" summary, plus a footer cross-referencing the current routing direction. - analysis/unified_routing_fix_review.md: track this file (was untracked); it is the review handoff cited from the updated docs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 10:47:14 +08:00
Gahow Wang	8e0c6e78b0	Add comprehensive research findings document Synthesizes all experiments into a paper-ready analysis: - Agentic workload characteristics vs chatbot/API - Why PD-Sep, LMetric, elastic RDMA, chunk-size tuning don't work - Why cache-aware session-sticky routing IS the key optimization (-60% TTFT, +24pp APC vs round-robin) - System-level insights: prefill-decode interference threshold, Mooncake limitations, effective request weight after cache - GPU balance → HEAVY TTFT -10.5% (demonstrated) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 07:16:31 +08:00
Gahow Wang	baf7ffb08c	16-session contention: TPOT +45% from prefill-decode interference Key finding: at 16 concurrent sessions (2 per GPU), TPOT p90 degrades from 0.073 to 0.106 (+45%), with MEDIUM TPOT at 0.197 (+149%). This is the first time we've reproduced real prefill-decode interference in controlled experiments. Elastic RDMA at 16 sessions doesn't help: only 13/500 offloaded (cache-gate correct for cold turn-1), kv_both adds ~16% TPOT overhead at high concurrency. Load scaling: 1000req_ts20, 200req_ts10, 200req_ts5, 500req_ts10 all show ~30% GPU util at 8 sessions. The bottleneck is max_inflight_sessions, not arrival rate. Updated elastic_hypotheses.md with H8, H9, and comprehensive final analysis. The real bottleneck is vLLM's chunked prefill scheduling, not routing or PD disaggregation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 05:51:47 +08:00

1 2

61 Commits