agentic-kvc

Author	SHA1	Message	Date
Gahow Wang	8a6b22c11c	Replayer think-time dispatch mode + benchmarking guidance Adds `--dispatch-mode {tracets,thinktime}` to the replayer and documents that agentic serving should be benchmarked with `thinktime` (the faithful load). - `tracets` (old default): turn-k at the absolute trace timestamp, i.e. max(prev_finished, trace_ts) -- collapses inter-turn think-time to ~0 when the system is behind, manufacturing request bursts. - `thinktime`: turn-1 at trace arrival; turn-k at prev_finished + time_to_parent_chat (real production gap). scripts/add_time_to_parent.py annotates a trace with that gap from the raw trace's request_ready/end_ms. exp(c) ablation (v2/exp_c_dispatch_ablation/): at N=8 (capacity slack) thinktime beats tracets -- E2E p90 -28% (73.5 vs 102.8s), TTFT p90 -29%, TPS +7%, because tracets' bursts spike concurrency -> KV pressure -> preemption. At N=6 (saturated) they converge. So tracets makes the system look ~30% worse on tail latency than realistic agent pacing. Root README.md carries the headline guidance; raw per-request metrics gitignored (perf_summary.json kept). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 16:28:36 +08:00
Gahow Wang	f0d085ceda	Merge remote-tracking branch 'origin/main'	2026-05-30 15:39:25 +08:00
Gahow Wang	8d422c4301	Migration trigger validation: unified_v4 fires at 2x QPS, not at 1x Ran unified vs unified_v4 A/B on dash2 (8×H20, kv_both+DR-fix substrate, w600_r0.0015_st30_first600s trace). Key findings: - At 1x QPS (~1.3 req/s): zero migrations. pending_prefill_tokens is 0 for 95% of routing decisions because instances complete prefill before the next request arrives. The relative arm (src_pp > fleet_median*1.5) never fires. - At 2x QPS (~2.7 req/s): 4 migrations (0.5%). src_pp>0 rises to 24% of eligible decisions. Trigger correctly identifies genuinely overloaded instances (src_pp 13k–73k vs fleet median 3.8k–33k). Conclusion: mechanism is correct but migration benefit requires higher concurrency (scale-out or >3x QPS) where queue pressure makes the signal non-trivial. At single-node 8-instance scale, Pillar 1 (affinity routing) is sufficient and Pillar 2 gracefully degrades to no-op. Next: scale-out validation (16+ GPU) where session skew naturally concentrates load and triggers migration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-30 15:36:58 +08:00
Gahow Wang	d9cf3126c6	docs: reframe PAPER_OUTLINE to GPU-hit-first + embed v2 figures Reorganizes the outline from the EAR / dispatch-coupling framing (kept in git history) into the GPU-hit-first structure: - §1 background splits PD-colo / PD-disagg / KV storage hierarchy, each with a forward pointer to where it is used or refuted. - §2 leads with the metric argument (request latency / TPS / GPU util, not TTFT/TPOT); dispatch coupling is demoted to that justification. §2.2 embeds the two new v2 figures -- the measured 4-tier hit hierarchy (GPU < CPU-local < remote-RDMA-store << miss) and the capacity->APC/latency knee (Evidence #1) -- plus the cluster-scale correction to the working_set "14 nodes" number. - §3 recasts the three optimizations as corollaries of GPU-hit-first: make PD-colocation default (3.1), biased KV-awareness routing (3.2), dedup via migration not replication (3.3). - §5 related work now engages the storage-hierarchy camp directly. - Validation-status table and work plan updated (top priority: wall-clock amplification sweep). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 13:34:19 +08:00
Gahow Wang	dc8e6dd5a8	v2 exp(a): add remote KV-store (RDMA) tier Extends the hit-latency microbench to a 4th tier: a remote global-KV-store hit over RDMA, the Mooncake-Store mechanism. Two kv_both MooncakeConnector instances (run_rdma.sh); for each prefix length, instance B serves the request by pulling instance A's cached prefix over RDMA (do_remote_prefill, via microbench/fresh_setup/mb2_kv_transfer.py) instead of recomputing -- the timed pull is the remote-hit latency. Result (TTFT p50, 11 reps): strict tier ordering GPU(HBM) < CPU(local DRAM) < remote-RDMA-store << miss, gaps growing with context. At 64k: GPU 0.11s, CPU 0.27s, RDMA 0.97s, miss 15.2s -> miss/RDMA 15.8x, RDMA/CPU 3.6x, CPU/GPU 2.4x. So a global RDMA store is a real win over recompute (the blog's 46x) but pays the NIC tax (~5-7 GB/s effective) and sits a tier below local CPU and two below GPU -- reinforcing GPU-hit-first. README + figure updated to four tiers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 12:48:37 +08:00
Gahow Wang	ad754cfe0b	v2 exp(b): GPU KV-capacity APC/latency knee + writeup Sweeps GPU KV-cache capacity (--num-gpu-blocks-override) under a closed-loop replay (concurrency 4) of a controlled multi-turn workload (cumulative intra-session prefix, gen_synth_trace.py), measuring realized APC (prefix_cache hits/queries delta) and latency per capacity. Result: a sharp knee at 3.6 GB = exactly the active working set (4 sessions x 0.91 GB). APC rises 7->12->36->80% then saturates at the ~71% intra-session ceiling; TTFT p90 collapses 13.0 s -> 0.53 s at the same point; dead flat to 14.5 GB, 100% completion throughout. So only the active working set needs HBM; capacity beyond it -- and the CPU/storage tier built to chase the reuse tail -- buys ~0. Knee scales linearly with concurrency = cluster GPU count. README.md ties exp(a)+exp(b) into the section-2.2 GPU-hit-first argument with tables, conclusions, and caveats. Raw per-request dumps gitignored; summary/m0/m1 deltas kept. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 11:23:31 +08:00
Gahow Wang	837df6bc9e	v2 exp(a): three-tier KV-hit latency microbench (GPU >> CPU >> miss) Measures TTFT to serve a reused prefix of length L from each KV tier on a single H20 (Qwen3-Coder-30B-A3B, vLLM 0.18.1): miss (recompute), CPU-tier hit (native DRAM offload), GPU-tier hit (HBM prefix cache). Each measured request is bracketed by /metrics scrapes so the tier is verified (vllm:prefix_cache_hits vs external_prefix_cache_hits), not assumed. Result: GPU hit is ~flat (42->111 ms over 1k->64k tokens); CPU hit is transfer-bound (PCIe H2D ~54 GB/s, 57->272 ms); miss grows superlinearly (78 ms -> 15.2 s). GPU beats CPU 1.4-2.5x (gap grows with context); miss/CPU up to 56x, miss/GPU up to 137x. pcie_transfer.py is the independent CPU-hit floor backstop. Evidence for the GPU-hit-first principle (paper section 2.2). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 11:23:04 +08:00
Gahow Wang	cf812b6264	Workload characterization C1-C3 on full production trace Joint/temporal characterizations of the full 051315 cluster trace (2.11M req / 1.31M sessions / 2h), beyond the existing single-variable marginals: - C1 mixture: 90.3% sessions single-turn, but multi-turn (9.7%) = 44% reqs / 67% prefill mass; continuation hazard rises 10%->94% (Lindy); heaviness unpredictable at turn 1 (corr 0.04-0.15) => reactive routing justified. - C2 resident/delta: resident context 11k->56k while new-prefill 2.7k->~200; per-turn reuse ->99.6%; resident/delta ("PD tax") ->~250-450x. - C3 prefill/decode: token mass 98.7% input / 1.3% output, BUT decode ~70% of TIME (robust 68-71%); "decode negligible" is wrong (tokens != time). Correct colo argument = roofline complementarity, not "no decode". Maps each to (1) PD-colocation and (2) routing. compute_chars.py + chars.json + figs/workload_chars/. Raw-file exact validation (cached_tokens, real timings) pending. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 18:19:39 +08:00
Gahow Wang	847f52f03b	PD-disagg crossover: regular synthetic trace + goodput sweep + figure gen_synthetic_trace.py --mode regular: maximally-regular multi-turn trace (fixed prefix/delta/turns, constant arrivals, zero session skew) to isolate the structural PD cost (per-turn full-context transfer + P/D capacity split) from the skew/hot-pin artifact. analysis/crossover/: SLO-goodput PD_advantage sweeps bracketing the prefill<->decode bottleneck axis (D1 grow input -> prefill-bound; D2 grow output -> decode-bound). figs/crossover_pd_advantage.png shows the crossover (y=1) with the agentic operating region annotated. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 18:19:23 +08:00
Gahow Wang	48ae72467a	Replayer: closed-loop inter-turn think-time mode Add --inter-turn-think (env REPLAY_INTER_TURN_THINK_S): turn 1 fires on session admission, each later turn a FIXED think-time after the previous turn COMPLETES, ignoring absolute trace timestamps. Combined with --max-inflight-sessions (env REPLAY_MAX_INFLIGHT) this is a stable N-user closed loop, removing the open-loop "fire immediately because timestamp is in the past" retrigger artifact. Needed for the dispatch-coupling (wall-clock amplification) sweep. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 18:19:12 +08:00
Gahow Wang	657cd36f3d	Gate evict_sent_blocks behind VLLM_EVICT_SENT_BLOCKS Fork commit `e13391e` unconditionally evicts sent blocks from the prefix cache on every KV transfer. That is correct only for session MIGRATION (source won't see the session again); for plain PD-disagg producer-> consumer transfers it destroys cross-turn producer reuse and contaminates PD reuse experiments. Default OFF; enable for migration runs via VLLM_EVICT_SENT_BLOCKS=1. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 18:18:59 +08:00
Gahow Wang	a0db3cbe77	Add leastwork_kappa decode-aware ablation (net-negative, documented) --policy leastwork_kappa + --kappa (default 2.5e-6, derived from KV ~100KB/tok / HBM 4TB/s / TPOT 10ms on H20+Qwen3-30B-A3B): score = prefill_work * (1 + kappa * ongoing_decode_tokens), modelling decode as a fractional throughput tax on a new prefill. Result on the 600s trace: NET-NEGATIVE vs plain leastwork — TTFT p90 +18%, E2E p90 +14%, balance 1.55x->1.97x, and it does NOT fix the E2E-p99 it targeted. Decode is too cheap in agentic (output p50~80) for the term to help; it just bounces heavy reqs off their cache-owner into cold re-prefill. The E2E-p99 tail is the structural HEAVY+>50k floor (per-class p99 ~51-52k for ALL policies), not decode interference. Kept in-tree as a documented ablation justifying LPWL's omission of any decode term; do not revive without a decode-heavy regime. See analysis/lpwl_5policy_600s.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 17:07:23 +08:00
Gahow Wang	71b0747b3b	600s-truncated trace + LPWL 5-policy results traces/w600_r0.0015_st30_first600s.jsonl: first-600s cut of the shipped w600 trace (807 reqs, 274 sessions, all turn-1s + early later-turns; theoretical APC ceiling ~70% vs 80% full). Faster iteration (~18 min/arm) but a colder, lower-locality regime; whitelisted alongside the parent anonymized trace. analysis/lpwl_5policy_600s.md: LPWL vs LMetric/sticky/unified/unified+A+B on the 600s trace (dash1 8xH20, cold APC, n=1). LPWL is overall best with zero knobs — TTFT p90 7983ms vs tuned A+B 11562 (-31%), E2E p90 -16%, best request balance; APC 0.648 (emergent affinity, far above LMetric 0.507); only loss is E2E p99 from heavy-class decode concentration. Demonstrates anti-overfit: A+B was tuned on full w600 yet is beaten by the knob-free policy on this regime. Includes the run_5policy_600s.sh repro driver. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 16:08:35 +08:00
Gahow Wang	160c29133d	Unified bench report: mean+TPS+per-worker GPU util, auto-captured scripts/bench_report.py is now the canonical analyzer: per run + per input- class it emits TTFT/TPOT/E2E mean+p50+p90+p99, decode/prefill TPS (aggregate and per-worker), APC, per-worker GPU util mean/max, and load-spread ratios. b3_isolated_policy.sh auto-captures the inputs for every run: gpu_util.csv (via gpu_monitor.sh, 5s, replay-window only) + bench_config.json (worker->GPU map); teardown stops the sampler. Future runs populate per-worker GPU util automatically. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 16:08:22 +08:00
Gahow Wang	d9046322c6	Add parameter-free LPWL routing policy (--policy leastwork) Least-Prefill-Work-Left: score = pending_prefill_tokens + max(0, input - cache_hit_here), pure argmin with (num_requests, round-robin) tie-break. Zero hyperparameters — derived from the agentic pattern: decode is cheap (I/O ~217x) so outstanding prefill-token-work is the only load worth modelling. Dropping LMetric's x num_requests factor (a) un-swallows the cache signal so affinity emerges with no gate, and (b) makes an idle-but- decoding host score `input` (its true marginal cost) instead of 0, removing the empty-batch degeneracy. Stick-vs-spill crossover is computed from real token-work, replacing overload_factor + cache_ratio gate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 16:08:10 +08:00
Gahow Wang	8a876e90d1	traces/README: clarify w600 is the session-start window, not span The trace actually spans ~2912 s (~48.5 min): all 274 sessions START within the 600 s --window-seconds window, but their later multi-turn requests (34% of rows, inter-turn gaps up to ~700 s) extend well past t=600 s. Remove the misleading "~600 s span". Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 12:04:14 +08:00
Gahow Wang	e532e83d3e	mb5_run: scrape per-instance prefix-cache counters before teardown Per-port vllm:prefix_cache_{queries,hits}_total -> instance_apc.txt. For PD this is the only honest reuse signal: producer ports show cross-turn prefix hits, while the consumer's per-request cached_tokens just counts transferred KV. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 11:56:43 +08:00
Gahow Wang	d376d91fe1	Engine-state ablation: full sweep harness + results Real-time engine state is NOT the routing lever. Across 6 policies × es0/es1, real state reshuffles 44-76% of decisions but never beats the champion (unified+A+B, p90 7.62s). The effect's SIGN is set by reactivity: one-shot placement (sticky) HELPS -26%; per-request affinity-dominated is a wash; per-request pure-load (lmetric +17%, load_only +27%) HURTS via herding (stale shadow was a dampener). Feed verified fresh (median 25ms, <=92ms during prefills). Prior shadow-state results stand. ES_ABLATION_RESULTS.md has the table + mechanism; run_full_ablation.sh / fresh_sampler.py / cmp_es.py are the harness. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 11:55:49 +08:00
Gahow Wang	08c3cf48aa	Ship anonymized benchmark trace w600_r0.0015_st30 + provenance Whitelist the sampled replay trace (1214 reqs / 274 sessions / ~600 s) past the traces/ ignore so the repo is runnable without dash0 access. Metadata only (token counts, opaque KV-block hashes, timing, session structure) — no prompts/outputs/PII. traces/README documents schema, provenance (sampled from the internal GLM-5.1 production trace via scripts/sample_trace.py), and the regeneration command. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 11:54:43 +08:00
Gahow Wang	8708b75520	Merge layerwise KV transfer + engine-state ablation onto main Brings the worktree-mooncake-layerwise line (layerwise Mooncake connector, write-mode proxy, real engine-state feed + eff_ accessors, mb7 microbench, v3 trace re-profile, A/B x migration matrix runner) into main so the repo is self-contained for these experiments. Disjoint paths (microbench/connector_tax/layerwise/*) => clean merge. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 11:53:40 +08:00
Gahow Wang	ee5db0b321	MB5 driver updates: PD-proxy + snapshot instrument + launcher tweaks Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 11:53:27 +08:00
Gahow Wang	bad512d3c5	PD-disagg crossover: synthetic-trace generator + morpher + plotter gen_synthetic_trace (vanilla Poisson, zero prefix reuse — the regime where PD-disagg is expected to win), mutate_trace (morph reuse/burst/skew toward the agentic regime), and plot_crossover. Emits the replayer's JSONL schema. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 11:53:21 +08:00
Gahow Wang	41a0c1c48f	Migration correctness smoke tests: direct-read, partial-transfer, NIXL Standalone smoke tests validating KV-migration correctness paths before trace replay: full migrate-cache, partial-prefill transfer, and a NIXL-connector variant, each with a runner. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 11:53:13 +08:00
Gahow Wang	1262c9c22e	Migration transfer-cost study: KV transfer is slow on busy GPUs MIGRATION_TRANSFER_COST.md: under real load, migration KV transfer runs at ~3 GB/s vs ~10 GB/s idle. Decomposed (instruments + MB6 microbench) into ~55% RDMA-actual (HBM/PCIe contention with running kernels: 7.6->4.0 GB/s) + ~45% control-plane GIL starvation during long prefills. Reproduced on a fresh upstream venv (byte-identical transfer path) -> upstream/hardware inherent, not our patch. Layerwise is the wrong lever; the tax is structural on a loaded agentic cluster. Includes mb6_transfer_under_load + run_mb6, instrument_dst_migration/mooncake, and the dst/transfer decomposition analyzers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 11:53:01 +08:00
Gahow Wang	67fcec7933	Unified-routing A+B ablation: decode-aware LMetric + v3 anti-hotspot cache_aware_proxy: add lmetric_decode_weight (decode-load penalty in the LMetric fallback score) and a v3 anti-hotspot recent-migration penalty (effective_load = num_req + recent-migration count over a sliding window), preventing back-to-back migration clustering. UNIFIED_ABLATION.md documents the A (overload_factor=1.3) + B' (decode-weight, max(num_req,1)) + RaceFix sweep: A+B'+RaceFix reaches TTFT p90 7770ms, beating v3 PD-sep migration by ~20%. Runners/analyzer for the b3 trace replay included. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 11:52:44 +08:00
Gahow Wang	a2f2645fda	PD_DISAGG_RESULTS §6.3: producer hot-pinning figure Direct per-producer KV-pool evidence for the session-affinity backfire. At the same 4P+4D ratio: - round-robin: 4 producers within 1pp of each other (spread 0pp, CV 0.01) - session-affinity: spread 49pp (one producer ~93%, another 45%; CV 0.25) A 25x jump in producer load imbalance — heavy multi-turn sessions concentrate onto single producers, the same hot-pinning pathology as sticky routing in the colocated §3.3 study. plot_producer_hotspot.py: reduce (numpy, per-producer KV timeline from snapshots, runs on the serving host) + plot (matplotlib, 2-panel rr vs session comparison) — same two-stage pattern as aggregate_mb5.py. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 00:38:20 +08:00
Gahow Wang	7947831e0f	run_v3_trace.sh: stage LAYERWISE conn + enhanced proxy from shared cpfs (dash1-ready)	2026-05-29 00:29:56 +08:00
Gahow Wang	6243b78bba	PD_DISAGG_RESULTS §6: session-affinity routing does not rescue PD Swept session-affinity P routing (MB5_P_ROUTING=session) across all four ratios on the metrics-fixed stack. Findings: - Strictly worse than round-robin at every ratio. 4P+4D: round-robin 100% vs session-affinity 36% completion. - Success DECREASES monotonically as decode capacity grows (6P+2D 59% -> 4P+4D 36% -> 3P+5D 24% -> 2P+6D 19%) — refutes the "session prefill is faster so it needs more D" hypothesis. - GPUs sit at ~0% utilization (2P+6D entirely idle) — the cluster stalls on KV-transfer/admission coordination, not compute. This is the deepest anti-PD argument: paid-for hardware does nothing while requests pile up; colocation keeps every GPU busy. - Mechanism: session-affinity pins heavy multi-turn sessions onto single producers (producer hot-pinning, same pathology as sticky routing in the colocated §3.3 study); fewer producers -> worse concentration -> the monotonic decline. Failed transfers also pin producer KV (kv_load_failure_policy=fail), compounding to deadlock. Verdict: neither ratio tuning nor routing policy rescues static PD-disagg for this agentic workload — the failure is structural. mb5_launch.sh: add 5P+3D / 3P+5D ratios for the sweep. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-29 00:25:10 +08:00
Gahow Wang	5b26c345f4	P2: all routing policies read real state via eff_ accessors + ablation harness InstanceState.eff_{num_requests,pending_prefill,ongoing_decode,ongoing_tokens} = max(shadow, real) when feed fresh (fixes 30s-stale under-count, keeps in-flight RaceFix), plus real-only r_max_prefill_remaining / r_kv_used_frac. Wired into load_only, lmetric, sticky, unified(_kv_both), unified_v3, and snapshot logging. Feed off => identical to before. run_v3_trace.sh gains ES=1 toggle (always deploys enhanced proxy); run_ablation_es.sh runs each config ES0-vs-ES1 to test whether real state changes policy performance/ranking. All unit-tested without GPU. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 20:21:12 +08:00
Gahow Wang	be948d32b8	P2: real engine-state feed replaces stale shadow counters for migration targeting vLLM scheduler publishes real state (running/waiting, KV free, and the max-in-progress-prefill signal /metrics lacks) to a tmpfs/redis store ~20Hz; router reads it and avoids GIL-stall (mid-large-prefill) + KV-capacity-wall targets, using real load over 30s-stale shadow counters. Components: engine_state.py (canonical+reader), instrument_engine_state.py (scheduler patch, file/redis writer), migration_target.py (scorer), proxy wiring (--engine-state-uri, off=unchanged). All unit-tested without GPU; not yet run live. See P2_ENGINE_STATE.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 20:01:26 +08:00
Gahow Wang	19191940e6	A/B x migration matrix runner (parameterized run_v3_trace.sh + wrapper)	2026-05-28 19:23:16 +08:00
Gahow Wang	63387f614d	Full v3 trace re-profile with layer-wise: matched migrations improve 1213/1214 success; matched migrations (4 common) improved -2.6 to -7.2s, scaling with prefill hidden behind transfer. Trace-level TTFT p90 -6% / p99 -5% (modest: migrations are 2% of reqs and partly queue-bound). Confirms layer-wise removes the transfer half of migration overhead but not the control-plane/queue residual. DESIGN.md updated with results. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 19:16:37 +08:00
Gahow Wang	21db2affb4	Trace runner (run_v3_trace.sh) + concurrent mb7 correctness test	2026-05-28 17:28:48 +08:00
Gahow Wang	e705bb33b6	Proxy write-mode: concurrent prefill+decode dispatch for v3 (EAR_WRITE_MODE=1)	2026-05-28 17:22:18 +08:00
Gahow Wang	4242bba034	Chunk-safe + concurrent layer-wise connector (per-step incremental shipping) Scheduler tracks per-producer block_ids (accumulated from scheduler_output) and emits per-step LWSendMeta with cumulative computed_tokens. Worker lw_wait_for_save records a CUDA event per step and enqueues progress; the sender-loop ship loop drains it, shipping only computed+dst-wanted+unshipped blocks in order (correct under chunked prefill). Per-transfer state = concurrent-safe. Keeps v1 single-transfer version as reference. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 17:15:54 +08:00
Gahow Wang	4cd71b6631	Working-set figure: extend left panel to ~50 nodes Include T=600s/1800s points so the diminishing-returns tail is visible: 14 -> 52 nodes buys only +6pp APC (74%->79.8%), still under the 80.4% ceiling that oracle/LRU reaches at 14 nodes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 17:11:12 +08:00
Gahow Wang	2247d1de08	Working-set figure: right panel = W(t) time series Replace the (redundant) nodes-vs-T cost curve with the working-set W(t) over wall-clock time for T=2/30/300s. Shows footprint is steady (peak ~ median) after a short warm-up, so peak-based sizing is sound; the 300s curve hugs the 14-node ceiling throughout. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 16:31:26 +08:00
Gahow Wang	e77bdcac5a	Layerwise under load: overlap benefit survives (bg=16) mb7 with background decode load (8/instance). Critical-path transfer overhead stays ~constant ~90ms for layerwise vs 158/239/749ms baseline (up to 7.9x at 32k), prefill not slowed, KV correct. Confirms the overlap holds on busy instances. DESIGN.md updated with idle-vs-load table + the two blockers (chunk-safety, concurrent-transfer safety) that the full 1200-req trace needs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 16:30:14 +08:00
Gahow Wang	c94b2e237a	Working-set figure: linear node axes + benefit/cost split Drop log node axis (decade ticks were unreadable). Left = APC vs #nodes (linear), right = #nodes vs retention window T. Mark the 1-node budget crossing (~7s reuse, ~8% APC) and the 14-node oracle ceiling. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 16:24:15 +08:00
Gahow Wang	3b8be5bb61	Working-set figure: express footprint in node count, not GB Both axes now in "# nodes" (footprint / per-node KV pool) so the cluster-size implication is direct: 1-node budget line + 14-node oracle ceiling, instead of raw GB. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 16:16:00 +08:00
Gahow Wang	dae98c6472	Working-set sizing tool + GLM-5.1-FP8/B300 result Configurable KV working-set analyzer (GPU model x TP/PP/EP x model config.json with MLA/GQA auto x KV/weight dtype). Computes Denning W(T), oracle [first,last], and retain-forever footprints vs a per-replica KV pool, plus the APC captured at each retention window. GLM-5.1-FP8 (MLA, 43.9 KiB/token) on 1x B300 node (1528 GB KV pool): live KV fits trivially (~533 GB), but the full 80.4% APC ceiling needs ~14 nodes (oracle) -> long-tail reuse motivates DRAM offload, not HBM. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 16:03:25 +08:00
Gahow Wang	fec50fa45d	Layerwise KV transfer on Mooncake: PoC + microbench (worktree exploration) Implements per-layer KV push during prefill (write mode) on vLLM's MooncakeConnector, env-gated by MOONCAKE_LAYERWISE=1. 2-instance microbench (mb7) shows correctness (KV lands, cached==prompt) and that the transfer is hidden behind prefill compute: critical-path overhead drops from O(KV size) (123/202/529ms for 8k/16k/32k) to a flat ~58ms (2-9x), with no prefill slowdown, on idle instances. Caveats: idle-only, chunked-prefill disabled, single concurrent transfer — see DESIGN.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 15:34:43 +08:00
Gahow Wang	2e6a369046	PD_DISAGG_RESULTS §5.1: D-pool pressure crashes consumers Document the consumer EngineCore crash chain (D-pool 97% -> 112k-token KV transfer fails -> negative prompt-token counter -> prometheus ValueError -> engine dead -> cliff failure). Explains the round-robin 6P+2D rep variance (100/56/80%) as intermittent consumer death, and notes the counter-clamp patch needed to compare routing arms fairly. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 13:02:21 +08:00
Gahow Wang	3957c2df86	MB5 patch: clamp PD-consumer metrics counter underflow Root cause of the 6P+2D run-to-run collapse (rep1 100%, rep2 56%, rep3 80%, session-routing 6.6%): not load-shedding, but a consumer EngineCore crash. Failure chain observed in the consumer logs: 1. D-pool fills to ~97% (decode-side capacity ceiling, the H1 story) 2. a large request's KV transfer fails: "Mooncake transfer engine returned -1" (112k-token request, pool full) 3. scheduler fails the request (kv_load_failure_policy=fail) 4. PromptTokenStats.local_cache_hit = num_cached + recomputed - num_external_computed goes NEGATIVE (external transfer exceeded cached count) 5. loggers.record() calls Counter.inc(negative) -> prometheus raises "Counters can only be incremented by non-negative amounts." 6. EngineCore dies -> every subsequent request fails (the cliff: all successes in the first ~110s, zero after) This turns ONE failed request into a total config collapse, and is what made the round-robin 6P+2D reps look randomly variable. Fix: clamp the three per-source prompt-token counts to >= 0 in loggers.record() before they hit Counter.inc(). Pure insertion, revertible via the existing sentinel mechanism. Lets a transfer failure stay a single failed request instead of killing the engine, so routing arms can be compared on equal footing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 13:01:23 +08:00
Gahow Wang	8596135680	MB5 analysis: per-role KV split proves static-partition mismatch aggregate_mb5.py: - Split the cluster KV timeline by role (P-pool vs D-pool) using a PID->role map parsed from vllm_logs filenames. The cluster average hid the result — 6P+2D/4P+4D look ~45% utilized but the decode pool is actually pegged at ~100% while prefill idles at ~30%. - Two-stage reduce/plot: --reduce-to (numpy-only, runs on the serving host over multi-GB snapshot dirs) dumps a compact JSON; --from-reduced (matplotlib) renders locally. matplotlib import is now lazy. - New plot_role_split figure + p/d peak/steady columns in the CSV. PD_DISAGG_RESULTS.md: consolidated writeup with figures inline. Verdict: no static P:D ratio beats 8C colocation. The binding constraint moves with the ratio (D-pool saturates at 6P+2D/4P+4D, P-pool jams at 2P+6D -> 91% request loss); 8C's shared pool stays elastic at 34% steady, 100% completion. PD wins TPOT (10-35x cleaner, the MB1 phase-isolation benefit is real) but loses TTFT and sheds load. Round-robin P routing also zeroes prefix-cache reuse; a session-affinity re-run of 6P+2D is in flight to test the fix. Figures (rep1): mb5_kv_timeline, mb5_role_split, mb5_peak_utilization, mb5_latency_compare + mb5_summary.csv. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 12:05:17 +08:00
Gahow Wang	e8980ce957	MB5 proxy: session-affinity P routing (MB5_P_ROUTING=session) The upstream mooncake_connector_proxy round-robins both P and D selection. For agentic multi-turn sessions this destroys prefix-cache reuse on the producer side — every turn of a session lands on a different P, so the prefix-cache hit ratio collapses to 0 (observed in the 6P+2D round-robin baseline) and every turn re-prefills from scratch, piling extra load on the P pool. Add an env-gated routing mode so the same proxy serves both arms of a clean A/B: MB5_P_ROUTING=rr round-robin (default, = upstream behavior) MB5_P_ROUTING=session consistent md5 hash on X-Session-Id -> same producer for all turns of a session Decode side stays round-robin (load balance) in both modes — decode KV is freshly transferred per turn, so D gains nothing from affinity but everything from even load spreading. mb5_launch.sh threads MB5_P_ROUTING through to the proxy and logs the active mode. Default path is byte-for-byte the old behavior, so an in-flight round-robin sweep is unaffected if this is redeployed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 11:05:25 +08:00
Gahow Wang	b13ca10d19	PD_DISAGG_INVESTIGATION: snapshot Phase 0 done + sweep in flight Phase 0 infrastructure (vendored proxy, dual-file vLLM patcher, per-instance + cross-config plotters) is fully assembled and smoke-validated. Sweep RUN_TAG=20260527_164040 (4 configs × 3 reps on w600) is running on dash1. Also realigned the figure list with what `aggregate_mb5.py` actually produces (mb5_kv_timeline, mb5_peak_utilization, mb5_latency_compare, mb5_summary.csv). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 00:51:28 +08:00
Gahow Wang	a66f24d242	MB5 aggregate: cross-config KV-pool + latency comparison Reads sweep root + tag, for each (config, rep): - merges per-PID snapshots into cluster-wide KV timeline (carry-forward for PIDs without a sample in the bin) - computes peak (max) and steady-state (10-90% median) pool utilization - pulls latency p50/p90/p99 from replay_metrics.summary.json Produces 4 outputs in --out-dir: - mb5_kv_timeline.png — N-panel cluster KV % over time, one panel per config, faint per-rep lines + bold median - mb5_peak_utilization.png — bar chart (peak vs steady) with ±std error bars - mb5_latency_compare.png — bar chart p50/p90/p99 e2e latency per config - mb5_summary.csv — flat per-(config, rep) table for the writeup Validated on 4P+4D × 20-req smoke: 4P+4D rep1: peak=12.8% steady=10.7% peak_wait=1 p50=1.3s p90=10.5s p99=17.1s (vs. <1s for 8C — expected gap). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 00:49:21 +08:00
Gahow Wang	a9c7310f4a	MB5 PD-disagg pipeline: working end-to-end Three independent bugs were blocking PD-disagg smoke; each fix is isolated so the next PD experiment doesn't re-hit them. 1. mb5_launch.sh - stop_all() also kills mb5_pd_proxy.py (our vendored copy), not just the upstream filename, and asserts ports 8000-8007 + PROXY_PORT are free before launching — stale proxies were silently passing the readiness check. - Proxy readiness uses a generic "any HTTP response" probe; mooncake_connector_proxy only exposes /v1/completions so /v1/models 404 is expected. 2. mb5_pd_proxy.py (vendored from third_party so deploy.sh ships it) - Force min_tokens=1 on the prefill leg. Clients that set min_tokens == max_tokens (our replayer does) collide with vLLM's min_tokens<=max_tokens check after the proxy caps max_tokens=1. 3. instrument_kv_snapshot.py - Adds a second patch target: initialize MooncakeConnectorWorker.bootstrap_server = None in __init__. vLLM 0.18.1 only sets it under the is_kv_producer branch, so kv_consumer hits AttributeError as soon as the first remote prefill request lands. - apply/revert refactored to iterate over (path, patches) pairs. plot_kv_pool_timeline.py also handles snapshot files that never captured a running request (would otherwise IndexError on an empty stackplot input). Smoke: 4P+4D × 20 reqs → 20/20 success, mean 3.9s, p99 17s, 8 PIDs all writing snapshots (601 total), well above the 8C baseline. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 00:14:22 +08:00
Gahow Wang	e0d3b5150a	MB5 driver fixes: bash env-prefix + replayer flag names + python date math Two bugs caught by 8C smoke: mb5_launch.sh ${env_bp_arg} expanded as a literal command line prefix doesn't work when env_bp_arg is itself a variable — bash only treats VAR=val as an env assignment if it sees the literal in the parsed command, not after expansion. Fix: always export VLLM_MOONCAKE_BOOTSTRAP_PORT as a literal, defaulting to 9999 when caller passed no port (consumer mode ignores the var so the placeholder is harmless). mb5_run.sh replayer's actual CLI flags are --trace / --output / --endpoint / --model, not the ---path / ---name variants I had. Plus dash1 has no `bc`; compute wall_clock_s via python instead. Both fixed; 8C smoke (CONFIG=8C REPS=1 REQUEST_LIMIT=20) now runs end-to-end in ~30 s: - 8 vLLM kv_both instances on GPU 0-7 come up - replayer round-robins 20 reqs across them - MB5 instrumentation captures 8 snapshot files (one per EngineCore PID), ranging 7-139 snapshots each = ~10 Hz throttle works - plot_kv_pool_timeline.py renders the stacked-area + queue-depth chart cleanly (figs/mb5_smoke/*.png) Pipeline validated. Ready for the real PD-ratio sweep. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 23:23:23 +08:00

1 2 3 4 5

241 Commits