agentic-kvc

Author	SHA1	Message	Date
Gahow Wang	837df6bc9e	v2 exp(a): three-tier KV-hit latency microbench (GPU >> CPU >> miss) Measures TTFT to serve a reused prefix of length L from each KV tier on a single H20 (Qwen3-Coder-30B-A3B, vLLM 0.18.1): miss (recompute), CPU-tier hit (native DRAM offload), GPU-tier hit (HBM prefix cache). Each measured request is bracketed by /metrics scrapes so the tier is verified (vllm:prefix_cache_hits vs external_prefix_cache_hits), not assumed. Result: GPU hit is ~flat (42->111 ms over 1k->64k tokens); CPU hit is transfer-bound (PCIe H2D ~54 GB/s, 57->272 ms); miss grows superlinearly (78 ms -> 15.2 s). GPU beats CPU 1.4-2.5x (gap grows with context); miss/CPU up to 56x, miss/GPU up to 137x. pcie_transfer.py is the independent CPU-hit floor backstop. Evidence for the GPU-hit-first principle (paper section 2.2). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 11:23:04 +08:00
Gahow Wang	cf812b6264	Workload characterization C1-C3 on full production trace Joint/temporal characterizations of the full 051315 cluster trace (2.11M req / 1.31M sessions / 2h), beyond the existing single-variable marginals: - C1 mixture: 90.3% sessions single-turn, but multi-turn (9.7%) = 44% reqs / 67% prefill mass; continuation hazard rises 10%->94% (Lindy); heaviness unpredictable at turn 1 (corr 0.04-0.15) => reactive routing justified. - C2 resident/delta: resident context 11k->56k while new-prefill 2.7k->~200; per-turn reuse ->99.6%; resident/delta ("PD tax") ->~250-450x. - C3 prefill/decode: token mass 98.7% input / 1.3% output, BUT decode ~70% of TIME (robust 68-71%); "decode negligible" is wrong (tokens != time). Correct colo argument = roofline complementarity, not "no decode". Maps each to (1) PD-colocation and (2) routing. compute_chars.py + chars.json + figs/workload_chars/. Raw-file exact validation (cached_tokens, real timings) pending. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 18:19:39 +08:00
Gahow Wang	847f52f03b	PD-disagg crossover: regular synthetic trace + goodput sweep + figure gen_synthetic_trace.py --mode regular: maximally-regular multi-turn trace (fixed prefix/delta/turns, constant arrivals, zero session skew) to isolate the structural PD cost (per-turn full-context transfer + P/D capacity split) from the skew/hot-pin artifact. analysis/crossover/: SLO-goodput PD_advantage sweeps bracketing the prefill<->decode bottleneck axis (D1 grow input -> prefill-bound; D2 grow output -> decode-bound). figs/crossover_pd_advantage.png shows the crossover (y=1) with the agentic operating region annotated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 18:19:23 +08:00
Gahow Wang	48ae72467a	Replayer: closed-loop inter-turn think-time mode Add --inter-turn-think (env REPLAY_INTER_TURN_THINK_S): turn 1 fires on session admission, each later turn a FIXED think-time after the previous turn COMPLETES, ignoring absolute trace timestamps. Combined with --max-inflight-sessions (env REPLAY_MAX_INFLIGHT) this is a stable N-user closed loop, removing the open-loop "fire immediately because timestamp is in the past" retrigger artifact. Needed for the dispatch-coupling (wall-clock amplification) sweep. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 18:19:12 +08:00
Gahow Wang	657cd36f3d	Gate evict_sent_blocks behind VLLM_EVICT_SENT_BLOCKS Fork commit `e13391e` unconditionally evicts sent blocks from the prefix cache on every KV transfer. That is correct only for session MIGRATION (source won't see the session again); for plain PD-disagg producer-> consumer transfers it destroys cross-turn producer reuse and contaminates PD reuse experiments. Default OFF; enable for migration runs via VLLM_EVICT_SENT_BLOCKS=1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 18:18:59 +08:00
Gahow Wang	a0db3cbe77	Add leastwork_kappa decode-aware ablation (net-negative, documented) --policy leastwork_kappa + --kappa (default 2.5e-6, derived from KV ~100KB/tok / HBM 4TB/s / TPOT 10ms on H20+Qwen3-30B-A3B): score = prefill_work * (1 + kappa * ongoing_decode_tokens), modelling decode as a fractional throughput tax on a new prefill. Result on the 600s trace: NET-NEGATIVE vs plain leastwork — TTFT p90 +18%, E2E p90 +14%, balance 1.55x->1.97x, and it does NOT fix the E2E-p99 it targeted. Decode is too cheap in agentic (output p50~80) for the term to help; it just bounces heavy reqs off their cache-owner into cold re-prefill. The E2E-p99 tail is the structural HEAVY+>50k floor (per-class p99 ~51-52k for ALL policies), not decode interference. Kept in-tree as a documented ablation justifying LPWL's omission of any decode term; do not revive without a decode-heavy regime. See analysis/lpwl_5policy_600s.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 17:07:23 +08:00
Gahow Wang	71b0747b3b	600s-truncated trace + LPWL 5-policy results traces/w600_r0.0015_st30_first600s.jsonl: first-600s cut of the shipped w600 trace (807 reqs, 274 sessions, all turn-1s + early later-turns; theoretical APC ceiling ~70% vs 80% full). Faster iteration (~18 min/arm) but a colder, lower-locality regime; whitelisted alongside the parent anonymized trace. analysis/lpwl_5policy_600s.md: LPWL vs LMetric/sticky/unified/unified+A+B on the 600s trace (dash1 8xH20, cold APC, n=1). LPWL is overall best with zero knobs — TTFT p90 7983ms vs tuned A+B 11562 (-31%), E2E p90 -16%, best request balance; APC 0.648 (emergent affinity, far above LMetric 0.507); only loss is E2E p99 from heavy-class decode concentration. Demonstrates anti-overfit: A+B was tuned on full w600 yet is beaten by the knob-free policy on this regime. Includes the run_5policy_600s.sh repro driver. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 16:08:35 +08:00
Gahow Wang	160c29133d	Unified bench report: mean+TPS+per-worker GPU util, auto-captured scripts/bench_report.py is now the canonical analyzer: per run + per input- class it emits TTFT/TPOT/E2E mean+p50+p90+p99, decode/prefill TPS (aggregate and per-worker), APC, per-worker GPU util mean/max, and load-spread ratios. b3_isolated_policy.sh auto-captures the inputs for every run: gpu_util.csv (via gpu_monitor.sh, 5s, replay-window only) + bench_config.json (worker->GPU map); teardown stops the sampler. Future runs populate per-worker GPU util automatically. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 16:08:22 +08:00
Gahow Wang	d9046322c6	Add parameter-free LPWL routing policy (--policy leastwork) Least-Prefill-Work-Left: score = pending_prefill_tokens + max(0, input - cache_hit_here), pure argmin with (num_requests, round-robin) tie-break. Zero hyperparameters — derived from the agentic pattern: decode is cheap (I/O ~217x) so outstanding prefill-token-work is the only load worth modelling. Dropping LMetric's x num_requests factor (a) un-swallows the cache signal so affinity emerges with no gate, and (b) makes an idle-but- decoding host score `input` (its true marginal cost) instead of 0, removing the empty-batch degeneracy. Stick-vs-spill crossover is computed from real token-work, replacing overload_factor + cache_ratio gate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 16:08:10 +08:00
Gahow Wang	8a876e90d1	traces/README: clarify w600 is the session-start window, not span The trace actually spans ~2912 s (~48.5 min): all 274 sessions START within the 600 s --window-seconds window, but their later multi-turn requests (34% of rows, inter-turn gaps up to ~700 s) extend well past t=600 s. Remove the misleading "~600 s span". Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 12:04:14 +08:00
Gahow Wang	e532e83d3e	mb5_run: scrape per-instance prefix-cache counters before teardown Per-port vllm:prefix_cache_{queries,hits}_total -> instance_apc.txt. For PD this is the only honest reuse signal: producer ports show cross-turn prefix hits, while the consumer's per-request cached_tokens just counts transferred KV. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 11:56:43 +08:00
Gahow Wang	d376d91fe1	Engine-state ablation: full sweep harness + results Real-time engine state is NOT the routing lever. Across 6 policies × es0/es1, real state reshuffles 44-76% of decisions but never beats the champion (unified+A+B, p90 7.62s). The effect's SIGN is set by reactivity: one-shot placement (sticky) HELPS -26%; per-request affinity-dominated is a wash; per-request pure-load (lmetric +17%, load_only +27%) HURTS via herding (stale shadow was a dampener). Feed verified fresh (median 25ms, <=92ms during prefills). Prior shadow-state results stand. ES_ABLATION_RESULTS.md has the table + mechanism; run_full_ablation.sh / fresh_sampler.py / cmp_es.py are the harness. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 11:55:49 +08:00
Gahow Wang	08c3cf48aa	Ship anonymized benchmark trace w600_r0.0015_st30 + provenance Whitelist the sampled replay trace (1214 reqs / 274 sessions / ~600 s) past the traces/ ignore so the repo is runnable without dash0 access. Metadata only (token counts, opaque KV-block hashes, timing, session structure) — no prompts/outputs/PII. traces/README documents schema, provenance (sampled from the internal GLM-5.1 production trace via scripts/sample_trace.py), and the regeneration command. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 11:54:43 +08:00
Gahow Wang	8708b75520	Merge layerwise KV transfer + engine-state ablation onto main Brings the worktree-mooncake-layerwise line (layerwise Mooncake connector, write-mode proxy, real engine-state feed + eff_ accessors, mb7 microbench, v3 trace re-profile, A/B x migration matrix runner) into main so the repo is self-contained for these experiments. Disjoint paths (microbench/connector_tax/layerwise/*) => clean merge. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 11:53:40 +08:00
Gahow Wang	ee5db0b321	MB5 driver updates: PD-proxy + snapshot instrument + launcher tweaks Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 11:53:27 +08:00
Gahow Wang	bad512d3c5	PD-disagg crossover: synthetic-trace generator + morpher + plotter gen_synthetic_trace (vanilla Poisson, zero prefix reuse — the regime where PD-disagg is expected to win), mutate_trace (morph reuse/burst/skew toward the agentic regime), and plot_crossover. Emits the replayer's JSONL schema. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 11:53:21 +08:00
Gahow Wang	41a0c1c48f	Migration correctness smoke tests: direct-read, partial-transfer, NIXL Standalone smoke tests validating KV-migration correctness paths before trace replay: full migrate-cache, partial-prefill transfer, and a NIXL-connector variant, each with a runner. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 11:53:13 +08:00
Gahow Wang	1262c9c22e	Migration transfer-cost study: KV transfer is slow on busy GPUs MIGRATION_TRANSFER_COST.md: under real load, migration KV transfer runs at ~3 GB/s vs ~10 GB/s idle. Decomposed (instruments + MB6 microbench) into ~55% RDMA-actual (HBM/PCIe contention with running kernels: 7.6->4.0 GB/s) + ~45% control-plane GIL starvation during long prefills. Reproduced on a fresh upstream venv (byte-identical transfer path) -> upstream/hardware inherent, not our patch. Layerwise is the wrong lever; the tax is structural on a loaded agentic cluster. Includes mb6_transfer_under_load + run_mb6, instrument_dst_migration/mooncake, and the dst/transfer decomposition analyzers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 11:53:01 +08:00
Gahow Wang	67fcec7933	Unified-routing A+B ablation: decode-aware LMetric + v3 anti-hotspot cache_aware_proxy: add lmetric_decode_weight (decode-load penalty in the LMetric fallback score) and a v3 anti-hotspot recent-migration penalty (effective_load = num_req + recent-migration count over a sliding window), preventing back-to-back migration clustering. UNIFIED_ABLATION.md documents the A (overload_factor=1.3) + B' (decode-weight, max(num_req,1)) + RaceFix sweep: A+B'+RaceFix reaches TTFT p90 7770ms, beating v3 PD-sep migration by ~20%. Runners/analyzer for the b3 trace replay included. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 11:52:44 +08:00
Gahow Wang	a2f2645fda	PD_DISAGG_RESULTS §6.3: producer hot-pinning figure Direct per-producer KV-pool evidence for the session-affinity backfire. At the same 4P+4D ratio: - round-robin: 4 producers within 1pp of each other (spread 0pp, CV 0.01) - session-affinity: spread 49pp (one producer ~93%, another 45%; CV 0.25) A 25x jump in producer load imbalance — heavy multi-turn sessions concentrate onto single producers, the same hot-pinning pathology as sticky routing in the colocated §3.3 study. plot_producer_hotspot.py: reduce (numpy, per-producer KV timeline from snapshots, runs on the serving host) + plot (matplotlib, 2-panel rr vs session comparison) — same two-stage pattern as aggregate_mb5.py. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 00:38:20 +08:00
Gahow Wang	7947831e0f	run_v3_trace.sh: stage LAYERWISE conn + enhanced proxy from shared cpfs (dash1-ready)	2026-05-29 00:29:56 +08:00
Gahow Wang	6243b78bba	PD_DISAGG_RESULTS §6: session-affinity routing does not rescue PD Swept session-affinity P routing (MB5_P_ROUTING=session) across all four ratios on the metrics-fixed stack. Findings: - Strictly worse than round-robin at every ratio. 4P+4D: round-robin 100% vs session-affinity 36% completion. - Success DECREASES monotonically as decode capacity grows (6P+2D 59% -> 4P+4D 36% -> 3P+5D 24% -> 2P+6D 19%) — refutes the "session prefill is faster so it needs more D" hypothesis. - GPUs sit at ~0% utilization (2P+6D entirely idle) — the cluster stalls on KV-transfer/admission coordination, not compute. This is the deepest anti-PD argument: paid-for hardware does nothing while requests pile up; colocation keeps every GPU busy. - Mechanism: session-affinity pins heavy multi-turn sessions onto single producers (producer hot-pinning, same pathology as sticky routing in the colocated §3.3 study); fewer producers -> worse concentration -> the monotonic decline. Failed transfers also pin producer KV (kv_load_failure_policy=fail), compounding to deadlock. Verdict: neither ratio tuning nor routing policy rescues static PD-disagg for this agentic workload — the failure is structural. mb5_launch.sh: add 5P+3D / 3P+5D ratios for the sweep. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 00:25:10 +08:00
Gahow Wang	5b26c345f4	P2: all routing policies read real state via eff_ accessors + ablation harness InstanceState.eff_{num_requests,pending_prefill,ongoing_decode,ongoing_tokens} = max(shadow, real) when feed fresh (fixes 30s-stale under-count, keeps in-flight RaceFix), plus real-only r_max_prefill_remaining / r_kv_used_frac. Wired into load_only, lmetric, sticky, unified(_kv_both), unified_v3, and snapshot logging. Feed off => identical to before. run_v3_trace.sh gains ES=1 toggle (always deploys enhanced proxy); run_ablation_es.sh runs each config ES0-vs-ES1 to test whether real state changes policy performance/ranking. All unit-tested without GPU. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 20:21:12 +08:00
Gahow Wang	be948d32b8	P2: real engine-state feed replaces stale shadow counters for migration targeting vLLM scheduler publishes real state (running/waiting, KV free, and the max-in-progress-prefill signal /metrics lacks) to a tmpfs/redis store ~20Hz; router reads it and avoids GIL-stall (mid-large-prefill) + KV-capacity-wall targets, using real load over 30s-stale shadow counters. Components: engine_state.py (canonical+reader), instrument_engine_state.py (scheduler patch, file/redis writer), migration_target.py (scorer), proxy wiring (--engine-state-uri, off=unchanged). All unit-tested without GPU; not yet run live. See P2_ENGINE_STATE.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 20:01:26 +08:00
Gahow Wang	19191940e6	A/B x migration matrix runner (parameterized run_v3_trace.sh + wrapper)	2026-05-28 19:23:16 +08:00
Gahow Wang	63387f614d	Full v3 trace re-profile with layer-wise: matched migrations improve 1213/1214 success; matched migrations (4 common) improved -2.6 to -7.2s, scaling with prefill hidden behind transfer. Trace-level TTFT p90 -6% / p99 -5% (modest: migrations are 2% of reqs and partly queue-bound). Confirms layer-wise removes the transfer half of migration overhead but not the control-plane/queue residual. DESIGN.md updated with results. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 19:16:37 +08:00
Gahow Wang	21db2affb4	Trace runner (run_v3_trace.sh) + concurrent mb7 correctness test	2026-05-28 17:28:48 +08:00
Gahow Wang	e705bb33b6	Proxy write-mode: concurrent prefill+decode dispatch for v3 (EAR_WRITE_MODE=1)	2026-05-28 17:22:18 +08:00
Gahow Wang	4242bba034	Chunk-safe + concurrent layer-wise connector (per-step incremental shipping) Scheduler tracks per-producer block_ids (accumulated from scheduler_output) and emits per-step LWSendMeta with cumulative computed_tokens. Worker lw_wait_for_save records a CUDA event per step and enqueues progress; the sender-loop ship loop drains it, shipping only computed+dst-wanted+unshipped blocks in order (correct under chunked prefill). Per-transfer state = concurrent-safe. Keeps v1 single-transfer version as reference. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 17:15:54 +08:00
Gahow Wang	4cd71b6631	Working-set figure: extend left panel to ~50 nodes Include T=600s/1800s points so the diminishing-returns tail is visible: 14 -> 52 nodes buys only +6pp APC (74%->79.8%), still under the 80.4% ceiling that oracle/LRU reaches at 14 nodes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 17:11:12 +08:00
Gahow Wang	2247d1de08	Working-set figure: right panel = W(t) time series Replace the (redundant) nodes-vs-T cost curve with the working-set W(t) over wall-clock time for T=2/30/300s. Shows footprint is steady (peak ~ median) after a short warm-up, so peak-based sizing is sound; the 300s curve hugs the 14-node ceiling throughout. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 16:31:26 +08:00
Gahow Wang	e77bdcac5a	Layerwise under load: overlap benefit survives (bg=16) mb7 with background decode load (8/instance). Critical-path transfer overhead stays ~constant ~90ms for layerwise vs 158/239/749ms baseline (up to 7.9x at 32k), prefill not slowed, KV correct. Confirms the overlap holds on busy instances. DESIGN.md updated with idle-vs-load table + the two blockers (chunk-safety, concurrent-transfer safety) that the full 1200-req trace needs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 16:30:14 +08:00
Gahow Wang	c94b2e237a	Working-set figure: linear node axes + benefit/cost split Drop log node axis (decade ticks were unreadable). Left = APC vs #nodes (linear), right = #nodes vs retention window T. Mark the 1-node budget crossing (~7s reuse, ~8% APC) and the 14-node oracle ceiling. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 16:24:15 +08:00
Gahow Wang	3b8be5bb61	Working-set figure: express footprint in node count, not GB Both axes now in "# nodes" (footprint / per-node KV pool) so the cluster-size implication is direct: 1-node budget line + 14-node oracle ceiling, instead of raw GB. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 16:16:00 +08:00
Gahow Wang	dae98c6472	Working-set sizing tool + GLM-5.1-FP8/B300 result Configurable KV working-set analyzer (GPU model x TP/PP/EP x model config.json with MLA/GQA auto x KV/weight dtype). Computes Denning W(T), oracle [first,last], and retain-forever footprints vs a per-replica KV pool, plus the APC captured at each retention window. GLM-5.1-FP8 (MLA, 43.9 KiB/token) on 1x B300 node (1528 GB KV pool): live KV fits trivially (~533 GB), but the full 80.4% APC ceiling needs ~14 nodes (oracle) -> long-tail reuse motivates DRAM offload, not HBM. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 16:03:25 +08:00
Gahow Wang	fec50fa45d	Layerwise KV transfer on Mooncake: PoC + microbench (worktree exploration) Implements per-layer KV push during prefill (write mode) on vLLM's MooncakeConnector, env-gated by MOONCAKE_LAYERWISE=1. 2-instance microbench (mb7) shows correctness (KV lands, cached==prompt) and that the transfer is hidden behind prefill compute: critical-path overhead drops from O(KV size) (123/202/529ms for 8k/16k/32k) to a flat ~58ms (2-9x), with no prefill slowdown, on idle instances. Caveats: idle-only, chunked-prefill disabled, single concurrent transfer — see DESIGN.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 15:34:43 +08:00
Gahow Wang	2e6a369046	PD_DISAGG_RESULTS §5.1: D-pool pressure crashes consumers Document the consumer EngineCore crash chain (D-pool 97% -> 112k-token KV transfer fails -> negative prompt-token counter -> prometheus ValueError -> engine dead -> cliff failure). Explains the round-robin 6P+2D rep variance (100/56/80%) as intermittent consumer death, and notes the counter-clamp patch needed to compare routing arms fairly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 13:02:21 +08:00
Gahow Wang	3957c2df86	MB5 patch: clamp PD-consumer metrics counter underflow Root cause of the 6P+2D run-to-run collapse (rep1 100%, rep2 56%, rep3 80%, session-routing 6.6%): not load-shedding, but a consumer EngineCore crash. Failure chain observed in the consumer logs: 1. D-pool fills to ~97% (decode-side capacity ceiling, the H1 story) 2. a large request's KV transfer fails: "Mooncake transfer engine returned -1" (112k-token request, pool full) 3. scheduler fails the request (kv_load_failure_policy=fail) 4. PromptTokenStats.local_cache_hit = num_cached + recomputed - num_external_computed goes NEGATIVE (external transfer exceeded cached count) 5. loggers.record() calls Counter.inc(negative) -> prometheus raises "Counters can only be incremented by non-negative amounts." 6. EngineCore dies -> every subsequent request fails (the cliff: all successes in the first ~110s, zero after) This turns ONE failed request into a total config collapse, and is what made the round-robin 6P+2D reps look randomly variable. Fix: clamp the three per-source prompt-token counts to >= 0 in loggers.record() before they hit Counter.inc(). Pure insertion, revertible via the existing sentinel mechanism. Lets a transfer failure stay a single failed request instead of killing the engine, so routing arms can be compared on equal footing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 13:01:23 +08:00
Gahow Wang	8596135680	MB5 analysis: per-role KV split proves static-partition mismatch aggregate_mb5.py: - Split the cluster KV timeline by role (P-pool vs D-pool) using a PID->role map parsed from vllm_logs filenames. The cluster average hid the result — 6P+2D/4P+4D look ~45% utilized but the decode pool is actually pegged at ~100% while prefill idles at ~30%. - Two-stage reduce/plot: --reduce-to (numpy-only, runs on the serving host over multi-GB snapshot dirs) dumps a compact JSON; --from-reduced (matplotlib) renders locally. matplotlib import is now lazy. - New plot_role_split figure + p/d peak/steady columns in the CSV. PD_DISAGG_RESULTS.md: consolidated writeup with figures inline. Verdict: no static P:D ratio beats 8C colocation. The binding constraint moves with the ratio (D-pool saturates at 6P+2D/4P+4D, P-pool jams at 2P+6D -> 91% request loss); 8C's shared pool stays elastic at 34% steady, 100% completion. PD wins TPOT (10-35x cleaner, the MB1 phase-isolation benefit is real) but loses TTFT and sheds load. Round-robin P routing also zeroes prefix-cache reuse; a session-affinity re-run of 6P+2D is in flight to test the fix. Figures (rep1): mb5_kv_timeline, mb5_role_split, mb5_peak_utilization, mb5_latency_compare + mb5_summary.csv. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 12:05:17 +08:00
Gahow Wang	e8980ce957	MB5 proxy: session-affinity P routing (MB5_P_ROUTING=session) The upstream mooncake_connector_proxy round-robins both P and D selection. For agentic multi-turn sessions this destroys prefix-cache reuse on the producer side — every turn of a session lands on a different P, so the prefix-cache hit ratio collapses to 0 (observed in the 6P+2D round-robin baseline) and every turn re-prefills from scratch, piling extra load on the P pool. Add an env-gated routing mode so the same proxy serves both arms of a clean A/B: MB5_P_ROUTING=rr round-robin (default, = upstream behavior) MB5_P_ROUTING=session consistent md5 hash on X-Session-Id -> same producer for all turns of a session Decode side stays round-robin (load balance) in both modes — decode KV is freshly transferred per turn, so D gains nothing from affinity but everything from even load spreading. mb5_launch.sh threads MB5_P_ROUTING through to the proxy and logs the active mode. Default path is byte-for-byte the old behavior, so an in-flight round-robin sweep is unaffected if this is redeployed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 11:05:25 +08:00
Gahow Wang	b13ca10d19	PD_DISAGG_INVESTIGATION: snapshot Phase 0 done + sweep in flight Phase 0 infrastructure (vendored proxy, dual-file vLLM patcher, per-instance + cross-config plotters) is fully assembled and smoke-validated. Sweep RUN_TAG=20260527_164040 (4 configs × 3 reps on w600) is running on dash1. Also realigned the figure list with what `aggregate_mb5.py` actually produces (mb5_kv_timeline, mb5_peak_utilization, mb5_latency_compare, mb5_summary.csv). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 00:51:28 +08:00
Gahow Wang	a66f24d242	MB5 aggregate: cross-config KV-pool + latency comparison Reads sweep root + tag, for each (config, rep): - merges per-PID snapshots into cluster-wide KV timeline (carry-forward for PIDs without a sample in the bin) - computes peak (max) and steady-state (10-90% median) pool utilization - pulls latency p50/p90/p99 from replay_metrics.summary.json Produces 4 outputs in --out-dir: - mb5_kv_timeline.png — N-panel cluster KV % over time, one panel per config, faint per-rep lines + bold median - mb5_peak_utilization.png — bar chart (peak vs steady) with ±std error bars - mb5_latency_compare.png — bar chart p50/p90/p99 e2e latency per config - mb5_summary.csv — flat per-(config, rep) table for the writeup Validated on 4P+4D × 20-req smoke: 4P+4D rep1: peak=12.8% steady=10.7% peak_wait=1 p50=1.3s p90=10.5s p99=17.1s (vs. <1s for 8C — expected gap). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 00:49:21 +08:00
Gahow Wang	a9c7310f4a	MB5 PD-disagg pipeline: working end-to-end Three independent bugs were blocking PD-disagg smoke; each fix is isolated so the next PD experiment doesn't re-hit them. 1. mb5_launch.sh - stop_all() also kills mb5_pd_proxy.py (our vendored copy), not just the upstream filename, and asserts ports 8000-8007 + PROXY_PORT are free before launching — stale proxies were silently passing the readiness check. - Proxy readiness uses a generic "any HTTP response" probe; mooncake_connector_proxy only exposes /v1/completions so /v1/models 404 is expected. 2. mb5_pd_proxy.py (vendored from third_party so deploy.sh ships it) - Force min_tokens=1 on the prefill leg. Clients that set min_tokens == max_tokens (our replayer does) collide with vLLM's min_tokens<=max_tokens check after the proxy caps max_tokens=1. 3. instrument_kv_snapshot.py - Adds a second patch target: initialize MooncakeConnectorWorker.bootstrap_server = None in __init__. vLLM 0.18.1 only sets it under the is_kv_producer branch, so kv_consumer hits AttributeError as soon as the first remote prefill request lands. - apply/revert refactored to iterate over (path, patches) pairs. plot_kv_pool_timeline.py also handles snapshot files that never captured a running request (would otherwise IndexError on an empty stackplot input). Smoke: 4P+4D × 20 reqs → 20/20 success, mean 3.9s, p99 17s, 8 PIDs all writing snapshots (601 total), well above the 8C baseline. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 00:14:22 +08:00
Gahow Wang	e0d3b5150a	MB5 driver fixes: bash env-prefix + replayer flag names + python date math Two bugs caught by 8C smoke: mb5_launch.sh ${env_bp_arg} expanded as a literal command line prefix doesn't work when env_bp_arg is itself a variable — bash only treats VAR=val as an env assignment if it sees the literal in the parsed command, not after expansion. Fix: always export VLLM_MOONCAKE_BOOTSTRAP_PORT as a literal, defaulting to 9999 when caller passed no port (consumer mode ignores the var so the placeholder is harmless). mb5_run.sh replayer's actual CLI flags are --trace / --output / --endpoint / --model, not the ---path / ---name variants I had. Plus dash1 has no `bc`; compute wall_clock_s via python instead. Both fixed; 8C smoke (CONFIG=8C REPS=1 REQUEST_LIMIT=20) now runs end-to-end in ~30 s: - 8 vLLM kv_both instances on GPU 0-7 come up - replayer round-robins 20 reqs across them - MB5 instrumentation captures 8 snapshot files (one per EngineCore PID), ranging 7-139 snapshots each = ~10 Hz throttle works - plot_kv_pool_timeline.py renders the stacked-area + queue-depth chart cleanly (figs/mb5_smoke/*.png) Pipeline validated. Ready for the real PD-ratio sweep. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 23:23:23 +08:00
Gahow Wang	e9abd70c8d	MB5 driver: launcher, orchestrator, KV-pool timeline plotter Three new files to drive the PD ratio sweep + per-request KV occupancy capture, plus a deploy.sh update so the patched replayer rides along to the fresh-venv host. mb5_launch.sh One script handles all four configs we plan to sweep: CONFIG=8C / 6P+2D / 4P+4D / 2P+6D - For 8C: 8 vLLM instances with kv_role=kv_both on GPU 0-7. Replayer talks to them via the existing comma-separated round-robin in replayer/replay.py — no proxy. - For PD configs: kv_role=kv_producer for the P pool (with VLLM_MOONCAKE_BOOTSTRAP_PORT) + kv_role=kv_consumer for the D pool, routed by the official vLLM example third_party/vllm/examples/online_serving/disaggregated_serving/ mooncake_connector/mooncake_connector_proxy.py — no policy choice made by us, per user instruction to use the standard recipe. - Applies instrument_kv_snapshot.py before launching so every EngineCore writes its per-step KV snapshot to $RUN_ROOT/kv_snapshots/mb5_kv_snapshot_pid<pid>.jsonl - Reverts the patch on stop. - Emits ENDPOINTS= line on stdout for the orchestrator to read. mb5_run.sh For each CONFIG × rep: launch, replay w600 trace via the existing replayer, capture wall-clock, tear down, cool down 10 s. Defaults: CONFIGS="8C 6P+2D 4P+4D 2P+6D" REPS=3 TRACE=traces/w600_r0.0015_st30.jsonl All artefacts go under $FRESH_ROOT/mb5_runs/$RUN_TAG_${config}_rep${rep}/ (vllm_logs/, kv_snapshots/, replay_metrics.jsonl, wall_clock_s.txt). plot_kv_pool_timeline.py Reads one or more mb5_kv_snapshot_pid.jsonl files and renders a stacked-area chart per file: x = wall-clock since first snapshot y = KV block count, stacked by per-request contribution overlay: pool-total ceiling, 90% line, waiting-queue depth subplot Bands are colored by a deterministic hash of request_id so individual requests are visually tractable across the run. This is the figure the user asked for — turns headline "PD-disagg is 10× worse" into a system-level picture of where* the KV pool is blocked, when, and by which requests. deploy.sh Also tar-syncs the local replayer/ dir to /home/admin/cpfs/wjh/agentic-kv-fresh/replayer/ so mb5_run.sh can `python -m replayer` against the patched (trace_span_s/amplification) version, not the older copy under /home/admin/cpfs/wjh/agentic-kv/. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 23:02:57 +08:00
Gahow Wang	a4f5dd56aa	MB5 instrumentation: per-request KV-block snapshot from vLLM V1 scheduler The §3.2 H1 (D-pool capacity wall) argument needs system-level evidence, not just headline latency. This patch lets us record, every ~100 ms, the exact composition of each vLLM instance's KV pool: - total / free / used block counts - for each RUNNING request: blocks held, computed tokens, prompt tokens - for each WAITING request: prompt tokens, status Hook: inside Scheduler.schedule() right before the return. Per-request blocks come from coordinator.single_type_managers[*].req_to_blocks (vLLM 0.18.1's own per-request bookkeeping; no new tracking layer). Throttled by MB5_PERIOD_MS env var (default 100 ms = 10 Hz) so a 13-min trace replay produces ~8 k snapshots per instance instead of ~80 k unthrottled. Output: $MB5_LOG_DIR/mb5_kv_snapshot_pid<pid>.jsonl (default MB5_LOG_DIR=/tmp). One file per EngineCore PID. Apply/revert idempotent, same pattern as instrument_mooncake.py. Markers: # MB5_INSTRUMENT_START / # MB5_INSTRUMENT_END. Validated on dash1 venv: apply → py_compile ok → revert → py_compile ok. With this in place we can build the stacked-area "KV pool composition over time" figure the user asked for: x = wall-clock, y = block count, colored bands = per-request portions. Comparing 8C colo vs 4P+4D on the same trace will directly show whether (and when) the D pool hits its ceiling — turning "PD-disagg is X× worse" into "PD-disagg is X× worse BECAUSE these specific requests at this specific time filled the pool and forced this queue depth". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 22:30:53 +08:00
Gahow Wang	4a93096c1e	Add PD_DISAGG_INVESTIGATION.md — living TODO for proving H1–H4 We don't have paper-grade evidence yet that PD-disagg fails in agentic. MB1+MB2 corrected accounting puts phase-isolation cost-benefit on PD-disagg's side; the only direct support is colleague's one data point on a patched dash0 build (TTFT p50 62×, success 52%) and the f4b geometric capacity argument. To close §3.2 properly we need fresh-venv empirical replication PLUS system-level instrumentation that tells the reviewer which component is the bottleneck — not just headline latency. This document tracks the four candidate failure hypotheses (H1 D-pool capacity, H2 static-partition mismatch, H3 cache reuse + P-pool hotspot, H4 end-to-end throughput loss), their current evidence status, and the phased experiment plan to address each. Key findings already recorded: - Phase 0 TODO 0.1 (find standard PD-disagg deployment) is done — vLLM ships an official example at examples/online_serving/disaggregated_serving/mooncake_connector/ with a kv_producer+kv_consumer launcher and a Mooncake-aware proxy that supports arbitrary P:D ratios via env vars. Per user direction, we will NOT polish PD-disagg policy ourselves; we use the official recipe as the "PD-disagg" baseline in §3.2 / §5.2. - Phase 1 (MB5+3 combined: PD ratio sweep with D-pool occupancy logging) is the critical path. Designed to either confirm H1 with system breakdown evidence (D-pool ≥ 90% for ≥ 30% of trace + queue depth spike) or falsify it (some ratio matches 8C colo, in which case §3.2 needs rewriting). - D-pool occupancy timeline is the single most important new instrumentation — turns "PD-disagg is 10× worse" into "PD-disagg is 10× worse BECAUSE the D pool sits at >90% for X% of the trace". Configurations to run on dash1 8-GPU first: 8C (colo baseline), 6P+2D, 4P+4D, 2P+6D × 3 reps × w600 trace. Open question still in the doc: vLLM 0.18.1 had an AttributeError on self.bootstrap_server in kv_consumer mode when we hit it during MB2 sanity; likely the issue was bad kv_transfer_params from our side (missing transfer_id, wrong field names), which we have since fixed. Official proxy uses the same handshake we now have, so it should just work. If not, single-line patch to initialize self.bootstrap_server = None for consumer mode. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 22:24:31 +08:00
Gahow Wang	f739f7d461	Proxy/runner support for Nixl connector + unified_v3 (offload-decode) policy scripts/b3_isolated_policy.sh: Recognize unified_v3 as a kv_both-requiring policy; respect explicit KV_CONNECTOR=Nixl override (so unified_v2 / unified_v3 / unified_kv_both can run against either Mooncake or Nixl back-end). When Nixl is selected, skip the bootstrap-ports plumbing — Nixl uses its own UCX side-channel and the proxy forwards kv_transfer_params from the src response body instead of pre-baking engine_id/bootstrap_addr. scripts/cache_aware_proxy.py: - New unified_v3 policy (~250 lines): prefill stays on session-affinity host (preserves intra-session prefix-cache reuse), decode is migrated to a lower-load target when the affinity host is busy with concurrent decodes. KV transfer flows prefill_host → decode_target, opposite of v2. Knobs: v3_min_new_tokens, v3_min_prefill_decode_busy, v3_target_load_ratio, v3_min_load_gap, v3_rotate_affinity, v3_prefer_cache_target. cache_miss_audit found rotation hurts cross- turn locality (9.5% hit with vs ~80% without) so default v3_rotate_affinity=False. - New connector_type setting ("mooncake" \| "nixl") gating the PD-sep handshake form: mooncake uses pre-baked kv_transfer_params, nixl forwards them from the response body. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 22:05:19 +08:00
Gahow Wang	da39ab6804	Correct PD-disagg cost/benefit framing across repo The §3.2 cost-vs-benefit math in commits `029821c` (MB1 plot + pd_cost_vs_benefit.png) and `abde010` (RESULTS_SUMMARY.md) was wrong. What was wrong: I framed PD-disagg's max phase-isolation benefit as "≤ decode duration of the new request (~50–200 ms)" — implicitly treating the benefit as per-request and bounded by that request's own decode. The correct accounting is per-prefill-event across all stalled streams: benefit_per_prefill = D × T_prefill × (1 − TPOT_baseline/TPOT_during) ≈ D × T_prefill which follows from the chunked-prefill math (each of L/N chunks slows D ongoing decode steps from ~10 ms to t ms, summing to D × T_prefill). Plug MB1 + MB2 numbers in: prefill size \| T_prefill \| T_transfer \| D=8 benefit \| cost/benefit 2k tok \| 0.14 s \| 8 ms \| 1.1 s \| 0.7 % 33k tok \| 4.5 s \| 320 ms \| 36 s \| 0.9 % 125k tok \| 57 s \| 1.9 s \| 456 s \| 0.4 % On the phase-isolation axis alone, PD-disagg WINS by 100×–250× — the opposite of what the deleted figure showed. The actual dominant reason static PD-disagg fails in agentic is the D-side KV pool capacity wall (figs/f4b_pdsep_kv_wall.png) — p99 single-request KV is 11.5 GiB, per-D-instance pool is 38 GiB, so 4P+4D halves system decode capacity. Colleague's 4P+4D experiment showed TTFT p50 62× worse and success rate 99.5% → 52%, driven by pool overflow + queueing, not by transfer latency. Changes (all touched files explicitly listed; no `git add -u`): - figs/pd_cost_vs_benefit.png : DELETED (figure built on wrong math) - microbench/fresh_setup/plot_mb1.py : drop the pd_cost_vs_benefit function; keep mb1_interference.png and update its title to note per-prefill aggregate stall = D × T_prefill (not capped by decode) - figs/mb1_interference.png : regenerated, no misleading band annotation - analysis/mb1/README.md : Summary block rewritten ("what MB1 measures"; no more "max benefit = decode duration" claim); §3.2 implications section replaced with the corrected per-prefill-event table; explicit ⚠ Correction note documents what was wrong - analysis/mb2/README.md : Summary block + §3.2 implications section rewritten the same way; ⚠ Correction note links to RESULTS_SUMMARY §4 - RESULTS_SUMMARY.md §4 + §6 : §4 reordered to lead with the D-side capacity argument (the real failure mode), MB1/MB2 demoted from "kill-shot for PD-disagg" to "supporting context inputs to a cost-benefit table that actually favors PD-disagg on this axis"; §6 paper-claims list reordered to remove the wrong "PD-disagg loses on cost-vs-benefit" claim and replace with the corrected ones PAPER_OUTLINE.md and MEETING.md were checked and never picked up this specific wrong claim — they already (correctly) frame §3.2 around the D-side KV memory wall. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 22:04:49 +08:00
Gahow Wang	abde010b64	Add RESULTS_SUMMARY.md — concise Chinese summary of current findings One-page distillation of what the paper can claim today, with figure / data path next to each row. Sections: 1. Workload 性质 — intra-session reuse, skew, KV footprint 2. Dispatch Coupling — agentic vs chatbot inter-turn gap regime 3. 现有调度三类失败 — load-balance / static PD-disagg / pure sticky 4. PD-disagg cost vs benefit — MB2 (transfer 9.7 GB/s ceiling, topology-independent) + MB1 (decode halted during prefill 15-200x), joined into the §3.2 cost > benefit headline for any KV ≥ 80 MiB 5. EAR 实证状态 — Pillar 1 (affinity) validated, Pillar 2 (migration) substrate validated + strategy-layer pending 6. 已能写的 paper 主张（按 confidence 排序） 7. 待做（MB3-5, migration e2e, wall-clock sweep, scale-out) Designed to be the one doc to read when re-entering the project after a break. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 21:38:28 +08:00

1 2 3 4 5

235 Commits