agentic-kvc

Author	SHA1	Message	Date
Gahow Wang	e705bb33b6	Proxy write-mode: concurrent prefill+decode dispatch for v3 (EAR_WRITE_MODE=1)	2026-05-28 17:22:18 +08:00
Gahow Wang	4242bba034	Chunk-safe + concurrent layer-wise connector (per-step incremental shipping) Scheduler tracks per-producer block_ids (accumulated from scheduler_output) and emits per-step LWSendMeta with cumulative computed_tokens. Worker lw_wait_for_save records a CUDA event per step and enqueues progress; the sender-loop ship loop drains it, shipping only computed+dst-wanted+unshipped blocks in order (correct under chunked prefill). Per-transfer state = concurrent-safe. Keeps v1 single-transfer version as reference. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 17:15:54 +08:00
Gahow Wang	e77bdcac5a	Layerwise under load: overlap benefit survives (bg=16) mb7 with background decode load (8/instance). Critical-path transfer overhead stays ~constant ~90ms for layerwise vs 158/239/749ms baseline (up to 7.9x at 32k), prefill not slowed, KV correct. Confirms the overlap holds on busy instances. DESIGN.md updated with idle-vs-load table + the two blockers (chunk-safety, concurrent-transfer safety) that the full 1200-req trace needs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 16:30:14 +08:00
Gahow Wang	fec50fa45d	Layerwise KV transfer on Mooncake: PoC + microbench (worktree exploration) Implements per-layer KV push during prefill (write mode) on vLLM's MooncakeConnector, env-gated by MOONCAKE_LAYERWISE=1. 2-instance microbench (mb7) shows correctness (KV lands, cached==prompt) and that the transfer is hidden behind prefill compute: critical-path overhead drops from O(KV size) (123/202/529ms for 8k/16k/32k) to a flat ~58ms (2-9x), with no prefill slowdown, on idle instances. Caveats: idle-only, chunked-prefill disabled, single concurrent transfer — see DESIGN.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 15:34:43 +08:00
Gahow Wang	e0d3b5150a	MB5 driver fixes: bash env-prefix + replayer flag names + python date math Two bugs caught by 8C smoke: mb5_launch.sh ${env_bp_arg} expanded as a literal command line prefix doesn't work when env_bp_arg is itself a variable — bash only treats VAR=val as an env assignment if it sees the literal in the parsed command, not after expansion. Fix: always export VLLM_MOONCAKE_BOOTSTRAP_PORT as a literal, defaulting to 9999 when caller passed no port (consumer mode ignores the var so the placeholder is harmless). mb5_run.sh replayer's actual CLI flags are --trace / --output / --endpoint / --model, not the ---path / ---name variants I had. Plus dash1 has no `bc`; compute wall_clock_s via python instead. Both fixed; 8C smoke (CONFIG=8C REPS=1 REQUEST_LIMIT=20) now runs end-to-end in ~30 s: - 8 vLLM kv_both instances on GPU 0-7 come up - replayer round-robins 20 reqs across them - MB5 instrumentation captures 8 snapshot files (one per EngineCore PID), ranging 7-139 snapshots each = ~10 Hz throttle works - plot_kv_pool_timeline.py renders the stacked-area + queue-depth chart cleanly (figs/mb5_smoke/*.png) Pipeline validated. Ready for the real PD-ratio sweep. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 23:23:23 +08:00
Gahow Wang	e9abd70c8d	MB5 driver: launcher, orchestrator, KV-pool timeline plotter Three new files to drive the PD ratio sweep + per-request KV occupancy capture, plus a deploy.sh update so the patched replayer rides along to the fresh-venv host. mb5_launch.sh One script handles all four configs we plan to sweep: CONFIG=8C / 6P+2D / 4P+4D / 2P+6D - For 8C: 8 vLLM instances with kv_role=kv_both on GPU 0-7. Replayer talks to them via the existing comma-separated round-robin in replayer/replay.py — no proxy. - For PD configs: kv_role=kv_producer for the P pool (with VLLM_MOONCAKE_BOOTSTRAP_PORT) + kv_role=kv_consumer for the D pool, routed by the official vLLM example third_party/vllm/examples/online_serving/disaggregated_serving/ mooncake_connector/mooncake_connector_proxy.py — no policy choice made by us, per user instruction to use the standard recipe. - Applies instrument_kv_snapshot.py before launching so every EngineCore writes its per-step KV snapshot to $RUN_ROOT/kv_snapshots/mb5_kv_snapshot_pid<pid>.jsonl - Reverts the patch on stop. - Emits ENDPOINTS= line on stdout for the orchestrator to read. mb5_run.sh For each CONFIG × rep: launch, replay w600 trace via the existing replayer, capture wall-clock, tear down, cool down 10 s. Defaults: CONFIGS="8C 6P+2D 4P+4D 2P+6D" REPS=3 TRACE=traces/w600_r0.0015_st30.jsonl All artefacts go under $FRESH_ROOT/mb5_runs/$RUN_TAG_${config}_rep${rep}/ (vllm_logs/, kv_snapshots/, replay_metrics.jsonl, wall_clock_s.txt). plot_kv_pool_timeline.py Reads one or more mb5_kv_snapshot_pid.jsonl files and renders a stacked-area chart per file: x = wall-clock since first snapshot y = KV block count, stacked by per-request contribution overlay: pool-total ceiling, 90% line, waiting-queue depth subplot Bands are colored by a deterministic hash of request_id so individual requests are visually tractable across the run. This is the figure the user asked for — turns headline "PD-disagg is 10× worse" into a system-level picture of where* the KV pool is blocked, when, and by which requests. deploy.sh Also tar-syncs the local replayer/ dir to /home/admin/cpfs/wjh/agentic-kv-fresh/replayer/ so mb5_run.sh can `python -m replayer` against the patched (trace_span_s/amplification) version, not the older copy under /home/admin/cpfs/wjh/agentic-kv/. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 23:02:57 +08:00
Gahow Wang	a4f5dd56aa	MB5 instrumentation: per-request KV-block snapshot from vLLM V1 scheduler The §3.2 H1 (D-pool capacity wall) argument needs system-level evidence, not just headline latency. This patch lets us record, every ~100 ms, the exact composition of each vLLM instance's KV pool: - total / free / used block counts - for each RUNNING request: blocks held, computed tokens, prompt tokens - for each WAITING request: prompt tokens, status Hook: inside Scheduler.schedule() right before the return. Per-request blocks come from coordinator.single_type_managers[*].req_to_blocks (vLLM 0.18.1's own per-request bookkeeping; no new tracking layer). Throttled by MB5_PERIOD_MS env var (default 100 ms = 10 Hz) so a 13-min trace replay produces ~8 k snapshots per instance instead of ~80 k unthrottled. Output: $MB5_LOG_DIR/mb5_kv_snapshot_pid<pid>.jsonl (default MB5_LOG_DIR=/tmp). One file per EngineCore PID. Apply/revert idempotent, same pattern as instrument_mooncake.py. Markers: # MB5_INSTRUMENT_START / # MB5_INSTRUMENT_END. Validated on dash1 venv: apply → py_compile ok → revert → py_compile ok. With this in place we can build the stacked-area "KV pool composition over time" figure the user asked for: x = wall-clock, y = block count, colored bands = per-request portions. Comparing 8C colo vs 4P+4D on the same trace will directly show whether (and when) the D pool hits its ceiling — turning "PD-disagg is X× worse" into "PD-disagg is X× worse BECAUSE these specific requests at this specific time filled the pool and forced this queue depth". Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 22:30:53 +08:00
Gahow Wang	4a93096c1e	Add PD_DISAGG_INVESTIGATION.md — living TODO for proving H1–H4 We don't have paper-grade evidence yet that PD-disagg fails in agentic. MB1+MB2 corrected accounting puts phase-isolation cost-benefit on PD-disagg's side; the only direct support is colleague's one data point on a patched dash0 build (TTFT p50 62×, success 52%) and the f4b geometric capacity argument. To close §3.2 properly we need fresh-venv empirical replication PLUS system-level instrumentation that tells the reviewer which component is the bottleneck — not just headline latency. This document tracks the four candidate failure hypotheses (H1 D-pool capacity, H2 static-partition mismatch, H3 cache reuse + P-pool hotspot, H4 end-to-end throughput loss), their current evidence status, and the phased experiment plan to address each. Key findings already recorded: - Phase 0 TODO 0.1 (find standard PD-disagg deployment) is done — vLLM ships an official example at examples/online_serving/disaggregated_serving/mooncake_connector/ with a kv_producer+kv_consumer launcher and a Mooncake-aware proxy that supports arbitrary P:D ratios via env vars. Per user direction, we will NOT polish PD-disagg policy ourselves; we use the official recipe as the "PD-disagg" baseline in §3.2 / §5.2. - Phase 1 (MB5+3 combined: PD ratio sweep with D-pool occupancy logging) is the critical path. Designed to either confirm H1 with system breakdown evidence (D-pool ≥ 90% for ≥ 30% of trace + queue depth spike) or falsify it (some ratio matches 8C colo, in which case §3.2 needs rewriting). - D-pool occupancy timeline is the single most important new instrumentation — turns "PD-disagg is 10× worse" into "PD-disagg is 10× worse BECAUSE the D pool sits at >90% for X% of the trace". Configurations to run on dash1 8-GPU first: 8C (colo baseline), 6P+2D, 4P+4D, 2P+6D × 3 reps × w600 trace. Open question still in the doc: vLLM 0.18.1 had an AttributeError on self.bootstrap_server in kv_consumer mode when we hit it during MB2 sanity; likely the issue was bad kv_transfer_params from our side (missing transfer_id, wrong field names), which we have since fixed. Official proxy uses the same handshake we now have, so it should just work. If not, single-line patch to initialize self.bootstrap_server = None for consumer mode. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 22:24:31 +08:00
Gahow Wang	da39ab6804	Correct PD-disagg cost/benefit framing across repo The §3.2 cost-vs-benefit math in commits `029821c` (MB1 plot + pd_cost_vs_benefit.png) and `abde010` (RESULTS_SUMMARY.md) was wrong. What was wrong: I framed PD-disagg's max phase-isolation benefit as "≤ decode duration of the new request (~50–200 ms)" — implicitly treating the benefit as per-request and bounded by that request's own decode. The correct accounting is per-prefill-event across all stalled streams: benefit_per_prefill = D × T_prefill × (1 − TPOT_baseline/TPOT_during) ≈ D × T_prefill which follows from the chunked-prefill math (each of L/N chunks slows D ongoing decode steps from ~10 ms to t ms, summing to D × T_prefill). Plug MB1 + MB2 numbers in: prefill size \| T_prefill \| T_transfer \| D=8 benefit \| cost/benefit 2k tok \| 0.14 s \| 8 ms \| 1.1 s \| 0.7 % 33k tok \| 4.5 s \| 320 ms \| 36 s \| 0.9 % 125k tok \| 57 s \| 1.9 s \| 456 s \| 0.4 % On the phase-isolation axis alone, PD-disagg WINS by 100×–250× — the opposite of what the deleted figure showed. The actual dominant reason static PD-disagg fails in agentic is the D-side KV pool capacity wall (figs/f4b_pdsep_kv_wall.png) — p99 single-request KV is 11.5 GiB, per-D-instance pool is 38 GiB, so 4P+4D halves system decode capacity. Colleague's 4P+4D experiment showed TTFT p50 62× worse and success rate 99.5% → 52%, driven by pool overflow + queueing, not by transfer latency. Changes (all touched files explicitly listed; no `git add -u`): - figs/pd_cost_vs_benefit.png : DELETED (figure built on wrong math) - microbench/fresh_setup/plot_mb1.py : drop the pd_cost_vs_benefit function; keep mb1_interference.png and update its title to note per-prefill aggregate stall = D × T_prefill (not capped by decode) - figs/mb1_interference.png : regenerated, no misleading band annotation - analysis/mb1/README.md : Summary block rewritten ("what MB1 measures"; no more "max benefit = decode duration" claim); §3.2 implications section replaced with the corrected per-prefill-event table; explicit ⚠ Correction note documents what was wrong - analysis/mb2/README.md : Summary block + §3.2 implications section rewritten the same way; ⚠ Correction note links to RESULTS_SUMMARY §4 - RESULTS_SUMMARY.md §4 + §6 : §4 reordered to lead with the D-side capacity argument (the real failure mode), MB1/MB2 demoted from "kill-shot for PD-disagg" to "supporting context inputs to a cost-benefit table that actually favors PD-disagg on this axis"; §6 paper-claims list reordered to remove the wrong "PD-disagg loses on cost-vs-benefit" claim and replace with the corrected ones PAPER_OUTLINE.md and MEETING.md were checked and never picked up this specific wrong claim — they already (correctly) frame §3.2 around the D-side KV memory wall. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 22:04:49 +08:00
Gahow Wang	029821c1b6	MB1: prefill-decode interference under chunked-prefill default; §3.2 headline Single-GPU bench on dash1 GPU 0 (vanilla vLLM 0.18.1, chunked-prefill on, no kv_connector). 3 decode batch sizes × 5 prefill sizes × 3 reps. Method recap (driver: microbench/interference/driver.py, repurposed): - Pin D streaming decode requests at constant max_tokens - Inject one prefill-only request (max_tokens=1) of varying input length - Bin decode-stream token timestamps into "during prefill" vs baseline - Headline metric: effective per-stream TPOT during the prefill burst, = prefill_ttft / (num_tokens_during_prefill / D). This is the average rate at which each decode stream produces tokens during the burst. p50 of inter-token intervals is deceptive (chunked-prefill makes most intervals look normal); the burst-average gives the true cost. Results (D=8 row, the most agentic-realistic case): P (tokens) \| prefill_ttft \| per-stream TPOT during \| penalty 2048 \| 143 ms \| 32 ms \| 4× 8192 \| 583 ms \| 114 ms \| 15× 32768 \| 4520 ms \| 388 ms \| 52× 65536 \| 15615 ms \| 757 ms \| 99× 131072 \| 56991 ms \| 1419 ms \| 183× Baseline TPOT at D=8: ~7.7 ms. So during a 131k-token prefill burst each ongoing decode is running ~183× slower (i.e. essentially halted) for ~57 seconds. §3.2 implication: PD-disagg's promised phase-isolation benefit per agentic request is bounded by the decode duration, which is 50–200 ms for tool-call output. MB2 says the KV-transfer cost of PD-disagg is 300 ms – 10 s for agentic-size requests. Cost > benefit for every KV size above ~80 MiB (well below trace mean 192 MiB). The new figs/pd_cost_vs_benefit.png overlays MB1 benefit ceiling (50–200 ms band, capped by decode) onto MB2 transfer cost curve and marks the agentic-distribution waypoints (trace mean, p90, p95, p99) on the x-axis. Across the entire agentic distribution, the cost curve sits above the benefit band. Adds: - microbench/fresh_setup/mb1_launch.sh: single-GPU vLLM launcher (no kv_connector, default chunked_prefill=on, max_num_batched_tokens=8192) - microbench/fresh_setup/mb1_driver.py: copy of the existing microbench/interference/driver.py for cpfs deployment - microbench/fresh_setup/analyze_mb1.py: aggregator emitting per-(D, P) effective-TPOT-during + max PD-disagg-benefit table - microbench/fresh_setup/plot_mb1.py: mb1 standalone + pd_cost_vs_benefit headline figure - analysis/mb1/summary.csv: 45 raw rows from the sweep - analysis/mb1/breakdown.json: per-(D, P) aggregate - analysis/mb1/README.md: persistent doc - figs/mb1_interference.png: effective TPOT during prefill, one line per D - figs/pd_cost_vs_benefit.png: §3.2 headline (cost > benefit everywhere) Caveats noted in README: - chunk_tokens=8192 only; Sarathi-Serve's smaller chunks would interleave decode more aggressively. Chunk-size sensitivity is flagged as next run. - D ≤ 8; higher D may saturate or shrink the penalty further. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 21:25:09 +08:00
Gahow Wang	90127c3389	MB2 inter-node: dash1↔dash2 transfer cost is identical to intra-node Sweep on dash1 GPU 0 → dash2 GPU 0 over 200 Gbps RoCE. remote_bootstrap_addr=http://172.27.123.142:8998. Same 9-size × 5-rep config as the 2026-05-27 intra-node run. Per-size pure_transfer (p50) lines up within 1–3% of the intra-node numbers across all sizes: size intra p50 inter p50 512 tok 5.3 ms 5.2 ms 2048 tok 20.6 20.0 8192 tok 83.7 80.9 32k tok 320.9 309.6 64k tok 1895 1734 (bimodal in both) 128k tok 2835 2818 (bimodal in both) => Mooncake's batch_transfer_sync_write does not use NVLink for intra-node peers; both paths go through the 200 Gbps RDMA NIC, with the 200 Gbps NIC (not the GPU interconnect) being the bottleneck. The ~9.7 GB/s steady-state ceiling and the 6+ GiB variance regime are identical across topologies. Operational implication for §3.2: PD-disaggregation does not get cheaper by co-locating P and D on the same node — every routed request pays the same ~10 GB/s ceiling for KV transfer, no matter where it lands. Halving the transfer cost cannot be bought back by topology. Caveat: B's receive_kv events did not log on dash2 — `MB2_LOG_DIR` env var did not propagate through vLLM's EngineCore subprocess on the consumer host (cat /proc/$ENGINE_PID/environ is empty on dash2 for that var, but the producer host on dash1 worked). For this run pure_transfer numbers are from A's send_blocks alone; full rx_total breakdown is not available, but pure_transfer is the dominant term. Adds: - analyze_mb2_send_only.py — analyzer that works from A's send_blocks alone when B's receive_kv events are absent - plot_mb2_compare.py — overlay intra vs inter on the same axes - plot_mb2.py — tolerate the `rows`-less send-only schema - figs/mb2_transfer_{time,bw}_inter.png — inter-node single-curve - figs/mb2_transfer_{time,bw}_compare.png — intra vs inter overlay - analysis/mb2/A_inter_kvboth.jsonl, inter_kvboth_client.json, inter_kvboth_breakdown.json - analysis/mb2/README.md — Summary block updated to reference both paths, dated 2026-05-27 run-log entry appended with the full table and the topology-independence framing Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 20:56:08 +08:00
Gahow Wang	50f72d8875	MB2 inter-node scaffolding: per-host single-instance launcher + client host args Adds the pieces needed to run the producer on dash1 and the consumer on dash2 with the same shared cpfs venv: start_vllm_single.sh INSTANCE / GPU / PORT / BP / MASTER / ROLE env vars; brings up ONE vLLM instance + applies the mooncake instrumentation patch (idempotent since the venv is cpfs-shared, so the first invocation applies and the second is a no-op). Per-instance MB2_LOG_DIR keeps producer/consumer events separate even though both directories live on the same cpfs path visible to both hosts. mb2_kv_transfer.py New --src-host / --dst-host args. Defaults stay 127.0.0.1 for backward-compat with the intra-node sweep. /v1/completions URLs and /query URLs now use the supplied hosts. remote_bootstrap_addr is built as http://<src_host>:<src_bp> so the consumer's do_remote_prefill request carries a routable address. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 20:26:54 +08:00
Gahow Wang	de164e5a64	MB2: pure KV-transfer cost on dash1 intra-node — Mooncake ~9.7 GB/s steady Full sweep result on dash1 GPU 0+1 with vanilla vLLM 0.18.1 + mooncake-transfer-engine 0.3.11, kv_both connector. Per-stage decomposition via the instrumentation patch (analyze_mb2.py pairs A's send_blocks with B's receive_kv enter/finish by time window). Steady-state (1k..32k tokens, 96 MiB..3 GiB KV): pure_transfer ≈ size / 9.7 GB/s rx_overhead ≈ 2–3 ms (ZMQ handshake + P-side setup) bandwidth ≈ 9.6–10.1 GB/s, very stable Large-size regime (65k..131k tokens, 6..12 GiB): p50 bandwidth collapses to 3.4–4.5 GB/s max bandwidth still hits ~9.7 GB/s (some runs achieve it) p99 agentic request (11.5 GiB) lands here Implication for §3.2 PD-disaggregation cost argument: median agentic decode = 50–200 ms (tool-call JSON output) median agentic-tail KV transfer (p99 11.5 GiB): best case (9.7 GB/s) ≈ 1.19 s observed range 1.5 – 10 s ⇒ KV transfer is 8–100× larger than the decode it enables. This is intra-node — the lower-bound transfer cost. Inter-node RDMA will be slower; that's MB2 phase 2. Adds: - analyze_mb2.py: pair A.send_blocks ↔ B.receive_kv by time window; per-size aggregation (n, ms_p50, ms_min/max, GB/s_p50/max) - plot_mb2.py: log-log transfer-time chart + bandwidth-vs-size chart - analysis/mb2/A_intra_kvboth.jsonl, B_intra_kvboth.jsonl: raw events (51 + 102 events including the sanity preamble) - analysis/mb2/intra_kvboth_breakdown.json: paired and aggregated - figs/mb2_transfer_time_intra.png, figs/mb2_transfer_bw_intra.png Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 19:04:03 +08:00
Gahow Wang	91673f1fb8	MB2: working end-to-end intra-node KV transfer microbench This commit closes the loop on the fresh-venv MB2 path. Three corrections on top of the previous scaffold made the bench fire successfully on dash1 GPU 0+1 with kv_both connector roles: 1. Re-target instrumentation patch to vLLM's shipped MooncakeConnector (vllm/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py). The mooncake-package's own mooncake_connector_v1.py turned out not to be the implementation vLLM 0.18.1 loads — the '{"kv_connector": "MooncakeConnector"}' config picks up the vLLM-shipped one. Patches go at _send_blocks (P-side) and receive_kv_from_single_worker (D-side, async, both entry and FINISH branch). 2. /query lives on the mooncake bootstrap port, not the vLLM HTTP port. Add --src-bp / --dst-bp args; default 8998 / 8999. 3. kv_transfer_params schema for the vanilla connector: do_remote_decode → {transfer_id} do_remote_prefill → {transfer_id, remote_engine_id, remote_bootstrap_addr} where remote_bootstrap_addr must include the http:// scheme. The dash0 smoke_test_migrate_cache.py was written for the patched build, which used a different field-name set (remote_host, remote_port, remote_block_ids); those are rejected here. Also discovered (and worked around): vLLM 0.18.1 with kv_role=kv_consumer raises AttributeError on `self.bootstrap_server` because that attribute is only assigned conditionally inside `if not self.is_kv_consumer`. We sidestep by running kv_both for the microbench — transfer mechanics are identical (same batch_transfer_sync_write call); the role gate only affects which request types each instance accepts. For §5 strict PD-disagg baseline we'll need either to fix this bug or front the pair with a role-aware proxy. Sanity smoke (3 sizes × 2 repeats, dash1 GPU 0+1, kv_both intra-node): input KV-MiB send_blocks_ms (P) receive_kv_ms (D) client_step2_ms 512 48 5–23 7–33 18–91 2048 192 21 23 37 8192 768 85 88 110 => intra-node bandwidth ~9 GB/s on the actual transfer for 768 MiB, which is well below NVLink p2p; likely PCIe-staged. Worth verifying. Next step (in flight): full sweep 512..128k tokens × 5 repeats with the per-stage analyzer. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 18:53:25 +08:00
Gahow Wang	622e0bc04c	MB2: parameterize vLLM roles (kv_producer + kv_consumer default) start_vllm_pair.sh ROLE_A / ROLE_B env vars (default kv_producer / kv_consumer for strict PD-disagg). Override to kv_both for the kv_both control. The role is injected into --kv-transfer-config so vLLM imposes the role restriction. mb2_kv_transfer.py --skip-verify flag drops step 3 (the plain completion sanity-check on the destination), required when the dst is kv_consumer-only since a kv_consumer instance refuses to serve a request without do_remote_prefill. The transfer-time itself is still measured from step 2 (do_remote_prefill on the consumer). Also: per-step client-side wall-clock timestamps (t_step1_client_unix, t_step2_client_unix, t_step2_end_unix) are now captured so the post-hoc breakdown analyzer can join with the per-instance JSONL logs on absolute time. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 18:17:42 +08:00
Gahow Wang	efdcf3c555	MB2: per-stage instrumentation patch + launcher integration Per-stage breakdown of "step 2" (the B-side do_remote_prefill) requires vLLM/mooncake-internal timing — we cannot infer it from black-box HTTP E2E. This commit adds the four pieces to do that breakdown: instrument_mooncake.py apply / revert / check patches on mooncake_connector_v1.py to emit structured JSONL transfer events at two key sites: send_blocks (P-side, on batch_transfer_sync_write): {event, remote_session, total_bytes, duration_s, t_start_unix, ret, tp_rank, t_log_unix} receive_kv (D-side, on the ZMQ-driven pull request): {event, path, local_req_ids, remote_req_ids, duration_s, t_start_unix, tp_rank, t_log_unix} All injected code is bracketed by `# MB2_INSTRUMENT_START/END` so the --revert pass is a single regex scan. Apply-revert round-trip validated on dash1 (PATCHED → py_compile ok → revert → CLEAN → ok). start_vllm_pair.sh (updated) - Picks up instrument_mooncake.py via SCRIPT_DIR. - On `start`: applies patch before launching the two vLLM instances. - On `stop` (or trap exit): reverts patch. - Sets per-instance MB2_LOG_DIR = $FRESH_ROOT/mb2_transfer_logs/{A,B}/ so send-side and receive-side events land in cleanly separated dirs. deploy.sh tar-over-ssh sync of microbench/fresh_setup/ → cpfs /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/ so dash1 / dash2 see the same scripts (dash{1,2} don't have rsync; tar pipe works). The mb2_kv_transfer.py client still uses black-box E2E timing — the next commit will teach it to ingest the per-instance JSONL logs to produce the 4-way breakdown (queueing / setup / transfer / decode). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 18:12:44 +08:00
Gahow Wang	7437422618	MB2 scaffolding: launch script for vLLM pair + KV-transfer-time client Two new files prepare measurement of T_transfer(KV_size, network_path), the gap §3.2's PD-disagg cost argument has had since day one. microbench/fresh_setup/start_vllm_pair.sh start \| status \| stop two vLLM 0.18.1 instances on local GPUs (A, B) with --kv-transfer-config '{"kv_connector":"MooncakeConnector", "kv_role":"kv_both"}' running off the fresh venv (vanilla wheel + vanilla mooncake 0.3.11, NOT the dash0 patched build). GPU IDs and ports are env-overridable so the same script drives the intra-node pair (GPU_A=0 GPU_B=1 on one host) and the inter-node pair (GPU_A=0 on dash1, GPU_B=0 on dash2 — launched per host separately). microbench/fresh_setup/mb2_kv_transfer.py Three-step measurement borrowed from connector_tax/.../smoke_test_ migrate_cache.py: 1. do_remote_decode on A (compute & cache KV; max_tokens=1) 2. do_remote_prefill on B (pull KV from A — this is the timed step) 3. plain completion on B (sanity check: cached_tokens ≈ prompt len) Sweeps input_tokens ∈ {512, 1k, 2k, 4k, 8k, 16k, 32k, 64k} with 5 repeats each; reports mean / p50 / p90 transfer time and a per-size raw log. Per-token KV is 98304 B (Qwen3-Coder-30B-A3B), so the upper end ≈ 6 GiB transfers — within the p99 11.5 GiB range from §2 but below it (the model's max_model_len 200000 caps the absolute upper). What we will NOT learn from this design: - Bandwidth saturation when the system is loaded (single-request bench) - vLLM-internal scheduling overhead vs pure transfer (the timed step folds them together — but for the §3.2 argument that's the right "what does PD-disagg actually pay" number) Intentionally not committed yet: an orchestrator that loops over intra-/inter-node configs. We start manual on dash1 intra-node to verify the measurement is sane before scaling out. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 17:47:04 +08:00
Gahow Wang	0a63de5bcf	Phase 0: fresh vllm 0.18.1 + mooncake-transfer-engine on dash1/dash2 Install script lives in microbench/fresh_setup/install.sh. Single shared venv at /home/admin/cpfs/wjh/agentic-kv-fresh/.venv (cpfs is mounted at the same path on dash0/1/2 so one install serves all three). vllm : 0.18.1 (official wheel) mooncake-transfer-engine: 0.3.11.post1 Smoke-tested on dash1 + dash2: imports succeed, kv_transfer module resolves. This venv is the vanilla reference for all subsequent microbench / PD-disagg experiments — not the dash0 patched build that carries the connector_tax fix. The script defines proxyOn inline (ipads 127.0.0.1:11235) so it works under non-interactive ssh (~/.bashrc proxyOn is interactive-only). Sets -eo pipefail (not -u) because venv activation references unset PS1-like vars under -u. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 17:42:36 +08:00
Gahow Wang	ef9e0102ec	Connector tax: trace-replay confirms +45% kv_both penalty is gone; DR-fix adds 22% more Re-runs the elastic_migration_v2 trace (w600 r0.0015 st30, 1214 reqs, 274 sessions, 8×TP1 vLLM + cache_aware_proxy) with three configs: - plain unified - unified + Mooncake kv_both - unified + Mooncake kv_both + DR-fix (env-gated O(\|cache\|) hash sync removal) TTFT p90: 11.97 s → 9.74 s (−18.6%) → 7.58 s (−36.6% vs plain) E2E p90: 23.48 s → 21.25 s (−9.5%) → 17.93 s (−23.6% vs plain) Two findings: 1. The "+45% kv_both penalty" claim from elastic_migration_v2 is OBSOLETE on current codebase — kv_both is now faster than plain at p90. Likely fixed by `e3a1d70` (RDMA-READ → bootstrap PUSH refactor) and the connector-mode delay_free_blocks extending cross-turn prefix cache hits on a 93%-intra-session-reuse trace. 2. DR-fix removes another 22% from TTFT p90 by skipping the O(\|cache\|) hash sync in build_connector_meta. Cache-sweep with DR-fix shows slope drops from +94.5 to +2.3 μs/1k blocks. Adds: - run_trace_replay_drfix.sh: A/B/C harness (env CT_DR_FIX gates patch) - analyze_trace_replay.py: TTFT/TPOT/E2E delta analysis - REPORT_TRACE_REPLAY.md: summary + reproduction - results/20260526_1627_drfix/: cache-sweep with DR-fix - results/trace_replay_20260526_1652/: full trace-replay A/B/C Implication for EAR paper: the kv_both substrate is no longer the bottleneck blocking session migration. The prior 4 migration reverts were dominated by transfer overhead that has now been characterized and (partially) removed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 09:13:50 +08:00
Gahow Wang	31cf8c9b11	DR-fix A/B: env-gate hash sync drops slope from +81 to -0.7 μs/1k blocks Adds an env-gated skip for the per-step `set(cache.keys())` walk in MooncakeConnectorScheduler.build_connector_meta() that was introduced in our own commit `a7df84b` (Direct RDMA read). Re-runs the cache_sweep A/B with three configs: plain (control), mooncake_both (baseline), and mooncake_both_drfix (VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1). Files: apply_direct_read_fix.py one-line env-gate patch (markered revert) run_drfix.sh orchestrator for plain + mooncake_both + drfix analyze.py extended to compare mooncake_both_drfix vs plain and mooncake_both vs mooncake_both_drfix REPORT_DRFIX.md findings results/20260526_1543_drfix/ run artifacts Headline: config \| slope (μs/1k blocks) \| step_dur p50 @ 16.6k ----------------------\|----------------------\|--------------------- mooncake_both \| +81.0 \| 1 550 μs mooncake_both_drfix \| -0.7 (≈ 0) \| 95 μs plain (control) \| -1.8 (≈ 0) \| 72 μs build_meta p50 @ 16.6k blocks: mooncake_both = 1 459 μs mooncake_both_drfix = 6 μs (residual loop bookkeeping) worker get_finished p50: mooncake_both = 178 μs (unchanged; this fix doesn't touch it) mooncake_both_drfix = 183 μs The fix recovers 1 453 μs (99.6 %) of the scheduler-side cost at \|cache\|=16.6k blocks. drfix's per-bin step_dur tracks plain within ±50 μs across the full cache range — that's noise-level. The slope goes from +81 to essentially zero. Worker-side get_finished (180 μs constant) is unchanged because the DR-fix touches scheduler.build_connector_meta only. That's the next target if we want to bring kv_both fully back to plain-level. Extrapolation to trace-replay (\|cache\|≈13k, APC≈79%): before: build_meta 1 060 μs + get_finished 180 μs = 1.24 ms/step after DR-fix: build_meta 6 μs + get_finished 180 μs = ~0.19 ms/step → 85% reduction in per-step connector cost → TPOT inflation drops from ~+18% to ~+3% on a 7 ms decode step Confirms: the entire O(\|cache\|) slope was introduced by our own direct-RDMA-read implementation (commit `a7df84b`), not upstream Mooncake. Production fix: gate the sync on the presence of any direct_read consumer, or replace per-step diff with an incremental delta listener fed by block_pool add/remove callbacks. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 00:03:23 +08:00
Gahow Wang	8829928fc5	Cache-size sweep: build_meta is O(\|cache\|), +85.6 μs / 1k blocks Follow-up to Microbench 3 that finally tests H5 (cache-size dependence) and instruments worker-side connector callbacks the original patch missed. Patch v2 (apply_step_timing_v2.py) adds: scheduler: `cache_size` field in engine_step.jsonl worker: `get_finished_us` + `start_load_kv_us` in worker_step.r0.jsonl uses BLOCK_BEGIN/END sentinels for safe multi-line revert (the original v1 patch survives this v2's apply/revert cycle) Driver: continuous open-loop (1.5 req/s, 4096x256 random per req) that lets APC fill from 0 → ceiling within one vLLM lifetime so a single run produces the full cache_size sweep. Decode-only steps are filtered post-hoc to remove prefill-mix variance. Findings (H20 96GB, ceiling reached ~17.5k blocks; n=15-18k decode steps per config): config \| slope (μs / 1k blocks) \| step_dur p50 @ \|cache\|=16.6k ---------------\|------------------------\|----------------------------- mooncake_both \| +85.6 \| 1528 μs (build_meta=1442, 94%) noop_connector \| -0.8 (≈0) \| 79 μs plain \| +1.0 (≈0) \| 84 μs Worker-side get_finished p50/p90/p99 (μs/step): mooncake_both: 180 / 257 / 333 noop_connector: 0 / 0 / 2 H5 PASSES. mooncake_both step_duration scales linearly with \|cache\| because build_connector_meta walks set(cache.keys()) every step (`mooncake_connector.py:434-450`). plain and noop are flat. The previously-uninstrumented get_finished() adds a constant 180 μs/step on top — two `run_coroutine_threadsafe(...).result()` blocking waits in kv_both mode (`mooncake_connector.py:1107-1137`) fire every step even when no transfer is pending. Trace-replay reconciliation (APC ≈ 79% → \|cache\| ≈ 13k blocks): build_meta @ 13k ≈ 1060 μs + get_finished ≈ 180 μs = 1.24 ms/step On ~7 ms decode forward → +15-20% TPOT per step. This explains most of the trace-replay +25% TPOT p90 gap from single-instance per-step cost alone, leaving a smaller residual for multi-instance coupling than originally assumed. Two clear fixes pointed out in REPORT.md: 1. replace O(\|cache\|) per-step walk with incremental delta listener using block_pool's add/remove callbacks 2. short-circuit get_finished() when both producer/consumer queues are empty in kv_both Heavy raw artifacts (engine_step.jsonl, vllm_stdout/stderr, .vllm.pid) are .gitignored — they re-derive from `bash run_all.sh` and SUMMARY.md / per_config.json fully capture the conclusions. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 23:34:21 +08:00
Gahow Wang	54de78eb11	Connector tax RESULTS.md: errata + run-to-run variance disclosure The prior write-up presented one specific reading of the data as the headline without flagging methodology gaps. Three corrections: 1. The "0% low-concurrency tax" comes from a single back-to-back mooncake_both_v2/plain_v2 rerun. The original Phase A pair showed TTFT p90 +29%, TPOT p90 +54%, E2E p90 +55% at rate=2 — a 40 percentage-point swing between two consecutive runs that the original write-up did not call out. The run-to-run noise floor is too high to claim "0%" at low concurrency. 2. get_finished() was never instrumented. The patch only times step_duration_us and build_meta_us. "100% of per-step cost is build_meta" is an upper bound on what was timed, not a true decomposition. 3. H5 (cache-size dependence) was the central hypothesis but was never tested in the prior run; random content kept APC near empty. The +7-9% high-concurrency (single instance, 512x64, rate=8-16) and +17% 8-instance-saturated numbers are kept; they were measured with adequate sample sizes and are reproducible. The follow-up sweep in cache_sweep/ tests H5 directly and revises the decomposition. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 23:33:01 +08:00
Gahow Wang	e3480f7d28	8-instance connector tax: +2% at non-saturated, +17% only at saturation 8×TP1 + load_only proxy, shape 512×64, rates 32/64/128 req/s total: Rate=32 (non-saturated, thr=0.95-0.97): plain TTFT p90=64ms, mooncake_both=65ms → +2% (noise) Rate=64 (non-saturated, thr=0.96): plain TTFT p90=114ms, mooncake_both=107ms → -6% (noise) Rate=128 (saturated, thr=0.70-0.71): plain TTFT p90=702ms, mooncake_both=822ms → +17% plain TTFT p50=339ms, mooncake_both=470ms → +39% Conclusion: The elastic_migration_v2 +45% is a saturation artifact. Under SLO-compliant load (TTFT<10s, thr_ratio>0.9), mooncake_both's 1.4ms/step build_connector_meta overhead is completely masked by the scheduler-model async pipeline. The tax only manifests when the system is already saturated and queueing amplifies per-step differences. For practical deployment: enabling kv_role=kv_both has effectively zero cost as long as the serving system stays within SLO capacity bounds.	2026-05-26 21:32:46 +08:00
Gahow Wang	c8ec73c548	Connector tax: high-concurrency confirms +7-9% tax, resolves trace-replay gap High-concurrency test (512 input, 64 output, rates 4-32 req/s): Rate=8: plain TTFT p90=94ms, mooncake_both=102ms → +9% tax Rate=16: plain TTFT p90=144ms, mooncake_both=156ms → +8% tax Rate=32: both saturated at ~6.1s → no distinguishable difference Low-concurrency back-to-back retest (4096 input, 256 output): mooncake_both_v2 vs plain_v2: tax is ≈0% (within noise) because scheduler's 1.4ms/step is hidden behind model forward. Decomposition of trace-replay's +45%: +7-9% from build_connector_meta per-step cost (this microbench) +20-30% from multi-instance coupling amplification (not measurable here) remainder from large-cache O(\|cache\|) scaling (Phase B follow-up) Also: bench_loop.py now emits mean/p50/p90/p99 for all three metrics.	2026-05-26 21:00:25 +08:00
Gahow Wang	a473c71cac	Connector tax Phase A: build_connector_meta is 1.4ms/step (the tax source) Per-step timing from engine_step.jsonl definitively resolves H3: plain: 53 μs/step (p50) noop_connector: 69 μs/step (+16 μs = negligible framework cost) mooncake_producer: 1461 μs/step (build_connector_meta = 1386 μs) mooncake_both: 1452 μs/step (same as producer) The substrate tax is NOT in the v1 framework — it's specifically in Mooncake's build_connector_meta() which walks set(cache.keys()) every scheduler step (O(\|cache\|) per step, E2 audit §6.5). Accumulated per-request tax: 256 decode steps × 1.4ms = 358ms. Observed TTFT tax at rate=1.0: plain 378ms vs mooncake_both 422ms (+12%). At rate=2.0 (near saturation): +29%, approaching trace-replay's +45%. Also fixes kill_vllm() to properly kill EngineCore subprocesses.	2026-05-26 19:33:15 +08:00
Gahow Wang	297fed6e73	Microbench 3 (connector_tax): infrastructure for KV connector substrate tax Validates the elastic_migration_v2 finding that kv_role=kv_both adds TTFT p90 +45% even when PD-sep never fires. Replicates under single-instance, synthetic, open-loop workload to disambiguate mechanism cost from 8-instance feedback amplification. Configurations (8): plain, noop_connector, mooncake_{producer,consumer,both}, nixl_both, lmcache_only, multi_mooncake_lmcache. Pre-flight verification gates risky configs (kv_consumer needs dummy bootstrap, multi-connector composition, NoOp custom class loading). Workload: two-phase sweep Phase A: rate {0.5..32} req/s × shape (4096, 256), saturation criteria Phase B: ref_safe rate × cartesian (input ∈ {512,4k,32k}, output ∈ {64,256,1024}) Step-timing patch enriches vLLM's existing AGENTIC_STEP_LOG_PATH emit with step_duration_us and build_meta_us — directly measures per-step substrate cost, not just user-visible TTFT/TPOT. run_all.sh runs as 5-stage barrier: 0 pre-flight + apply patch 1 Phase A all configs 2 pick ref_safe / ref_load 3 Phase B all configs 4 revert patch + analyze + plot Outputs aggregate.{json,csv}, MANIFEST.tsv, and 5 figures. Estimated runtime: 4-5.5 hours on idle dash0 H20.	2026-05-26 17:27:41 +08:00
Gahow Wang	06dd175441	Microbench 1 plots: prefill-decode interference heatmap + lines plot_interference.py reads the interference sweep summary (4 D × 4 P × 3 reps, cold prefill prompts) and produces: fig_interference_heatmap.png TPOT p90 interference index over (D, P): 14x at D=8 P=2k → 214x at D=1 P=32k. fig_interference_lines.png (a) TPOT p90 during prefill vs P, log-y, one line per D + baseline dashed (b) Cold prefill TTFT vs P (interference window length) Confirms B2 finding: cold prefill on the same worker stalls overlapping decodes for 14-214x baseline TPOT. The interference window grows linearly with P (from ~140ms at 2k to ~4.6s at 32k) and is essentially independent of decode batch size — prefill compute time dominates.	2026-05-26 14:21:30 +08:00
Gahow Wang	72790ae6c1	PD-sep server-side profiling: vLLM patches + per-request breakdown Instrumentation patches (microbench/patches/): - pd_profile.py: shared event emitter (VLLM_PD_PROFILE_LOG env var) - apply_patches.py: idempotent patch installer for mooncake_connector.py and scheduler.py, marks insertions with # PD_PROFILE_PATCH - analyze_events.py: joins per-process JSONL event logs by transfer_id into per-request phase durations Seven events captured per request: D_get_num_matched → P_zmq_received → P_prefill_done → P_rdma_start → P_rdma_end → D_recv_complete → D_request_promoted Driver fix (microbench/lifecycle/driver.py): seed_prefix_cache now sends via the proxy URL so P and D both cache the seeded prefix with matching block hashes. Previously seeding D directly produced different block hashes than the proxy-routed measurement requests, making incremental transfer impossible. Real breakdown (fig_breakdown_real.png, server_breakdown.csv, n=93): prefill_compute 620 ms median (95% of overhead) rdma_transfer 42 ms median (~71 Gbps effective) other overhead 10 ms median (dispatch + params + signal + promote) Mooncake transfer is NOT the bottleneck. Even with bulk RDMA the transfer cost is <10% of prefill cost for Qwen3-30B-A3B on H20.	2026-05-26 13:59:09 +08:00
Gahow Wang	f784e49c07	Microbench: prefill-decode interference + PD transfer lifecycle Two microbenchmarks quantifying the elastic offload decision: 1. Interference (corrected): cold prefill causes 14-214x TPOT p90 degradation on same-worker decode (D∈{1,2,4,8} × P∈{2k,8k,16k,32k}). Earlier run had a prefix-cache bug (deterministic prompts hit cache after rep 0); fixed with uuid+time_ns unique prompts. 2. Transfer lifecycle: PD-sep TTFT breakdown via Mooncake proxy, measuring prefill→RDMA→decode startup overhead. Key finding: offload wins at all P≥2048 operating points — transfer cost is 25-50% of interference cost even with bulk Mooncake.	2026-05-26 00:57:06 +08:00

29 Commits