agentic-kvc

Author	SHA1	Message	Date
Gahow Wang	ef9e0102ec	Connector tax: trace-replay confirms +45% kv_both penalty is gone; DR-fix adds 22% more Re-runs the elastic_migration_v2 trace (w600 r0.0015 st30, 1214 reqs, 274 sessions, 8×TP1 vLLM + cache_aware_proxy) with three configs: - plain unified - unified + Mooncake kv_both - unified + Mooncake kv_both + DR-fix (env-gated O(\|cache\|) hash sync removal) TTFT p90: 11.97 s → 9.74 s (−18.6%) → 7.58 s (−36.6% vs plain) E2E p90: 23.48 s → 21.25 s (−9.5%) → 17.93 s (−23.6% vs plain) Two findings: 1. The "+45% kv_both penalty" claim from elastic_migration_v2 is OBSOLETE on current codebase — kv_both is now faster than plain at p90. Likely fixed by `e3a1d70` (RDMA-READ → bootstrap PUSH refactor) and the connector-mode delay_free_blocks extending cross-turn prefix cache hits on a 93%-intra-session-reuse trace. 2. DR-fix removes another 22% from TTFT p90 by skipping the O(\|cache\|) hash sync in build_connector_meta. Cache-sweep with DR-fix shows slope drops from +94.5 to +2.3 μs/1k blocks. Adds: - run_trace_replay_drfix.sh: A/B/C harness (env CT_DR_FIX gates patch) - analyze_trace_replay.py: TTFT/TPOT/E2E delta analysis - REPORT_TRACE_REPLAY.md: summary + reproduction - results/20260526_1627_drfix/: cache-sweep with DR-fix - results/trace_replay_20260526_1652/: full trace-replay A/B/C Implication for EAR paper: the kv_both substrate is no longer the bottleneck blocking session migration. The prior 4 migration reverts were dominated by transfer overhead that has now been characterized and (partially) removed. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 09:13:50 +08:00
Gahow Wang	31cf8c9b11	DR-fix A/B: env-gate hash sync drops slope from +81 to -0.7 μs/1k blocks Adds an env-gated skip for the per-step `set(cache.keys())` walk in MooncakeConnectorScheduler.build_connector_meta() that was introduced in our own commit `a7df84b` (Direct RDMA read). Re-runs the cache_sweep A/B with three configs: plain (control), mooncake_both (baseline), and mooncake_both_drfix (VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1). Files: apply_direct_read_fix.py one-line env-gate patch (markered revert) run_drfix.sh orchestrator for plain + mooncake_both + drfix analyze.py extended to compare mooncake_both_drfix vs plain and mooncake_both vs mooncake_both_drfix REPORT_DRFIX.md findings results/20260526_1543_drfix/ run artifacts Headline: config \| slope (μs/1k blocks) \| step_dur p50 @ 16.6k ----------------------\|----------------------\|--------------------- mooncake_both \| +81.0 \| 1 550 μs mooncake_both_drfix \| -0.7 (≈ 0) \| 95 μs plain (control) \| -1.8 (≈ 0) \| 72 μs build_meta p50 @ 16.6k blocks: mooncake_both = 1 459 μs mooncake_both_drfix = 6 μs (residual loop bookkeeping) worker get_finished p50: mooncake_both = 178 μs (unchanged; this fix doesn't touch it) mooncake_both_drfix = 183 μs The fix recovers 1 453 μs (99.6 %) of the scheduler-side cost at \|cache\|=16.6k blocks. drfix's per-bin step_dur tracks plain within ±50 μs across the full cache range — that's noise-level. The slope goes from +81 to essentially zero. Worker-side get_finished (180 μs constant) is unchanged because the DR-fix touches scheduler.build_connector_meta only. That's the next target if we want to bring kv_both fully back to plain-level. Extrapolation to trace-replay (\|cache\|≈13k, APC≈79%): before: build_meta 1 060 μs + get_finished 180 μs = 1.24 ms/step after DR-fix: build_meta 6 μs + get_finished 180 μs = ~0.19 ms/step → 85% reduction in per-step connector cost → TPOT inflation drops from ~+18% to ~+3% on a 7 ms decode step Confirms: the entire O(\|cache\|) slope was introduced by our own direct-RDMA-read implementation (commit `a7df84b`), not upstream Mooncake. Production fix: gate the sync on the presence of any direct_read consumer, or replace per-step diff with an incremental delta listener fed by block_pool add/remove callbacks. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 00:03:23 +08:00
Gahow Wang	8829928fc5	Cache-size sweep: build_meta is O(\|cache\|), +85.6 μs / 1k blocks Follow-up to Microbench 3 that finally tests H5 (cache-size dependence) and instruments worker-side connector callbacks the original patch missed. Patch v2 (apply_step_timing_v2.py) adds: scheduler: `cache_size` field in engine_step.jsonl worker: `get_finished_us` + `start_load_kv_us` in worker_step.r0.jsonl uses BLOCK_BEGIN/END sentinels for safe multi-line revert (the original v1 patch survives this v2's apply/revert cycle) Driver: continuous open-loop (1.5 req/s, 4096x256 random per req) that lets APC fill from 0 → ceiling within one vLLM lifetime so a single run produces the full cache_size sweep. Decode-only steps are filtered post-hoc to remove prefill-mix variance. Findings (H20 96GB, ceiling reached ~17.5k blocks; n=15-18k decode steps per config): config \| slope (μs / 1k blocks) \| step_dur p50 @ \|cache\|=16.6k ---------------\|------------------------\|----------------------------- mooncake_both \| +85.6 \| 1528 μs (build_meta=1442, 94%) noop_connector \| -0.8 (≈0) \| 79 μs plain \| +1.0 (≈0) \| 84 μs Worker-side get_finished p50/p90/p99 (μs/step): mooncake_both: 180 / 257 / 333 noop_connector: 0 / 0 / 2 H5 PASSES. mooncake_both step_duration scales linearly with \|cache\| because build_connector_meta walks set(cache.keys()) every step (`mooncake_connector.py:434-450`). plain and noop are flat. The previously-uninstrumented get_finished() adds a constant 180 μs/step on top — two `run_coroutine_threadsafe(...).result()` blocking waits in kv_both mode (`mooncake_connector.py:1107-1137`) fire every step even when no transfer is pending. Trace-replay reconciliation (APC ≈ 79% → \|cache\| ≈ 13k blocks): build_meta @ 13k ≈ 1060 μs + get_finished ≈ 180 μs = 1.24 ms/step On ~7 ms decode forward → +15-20% TPOT per step. This explains most of the trace-replay +25% TPOT p90 gap from single-instance per-step cost alone, leaving a smaller residual for multi-instance coupling than originally assumed. Two clear fixes pointed out in REPORT.md: 1. replace O(\|cache\|) per-step walk with incremental delta listener using block_pool's add/remove callbacks 2. short-circuit get_finished() when both producer/consumer queues are empty in kv_both Heavy raw artifacts (engine_step.jsonl, vllm_stdout/stderr, .vllm.pid) are .gitignored — they re-derive from `bash run_all.sh` and SUMMARY.md / per_config.json fully capture the conclusions. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 23:34:21 +08:00
Gahow Wang	54de78eb11	Connector tax RESULTS.md: errata + run-to-run variance disclosure The prior write-up presented one specific reading of the data as the headline without flagging methodology gaps. Three corrections: 1. The "0% low-concurrency tax" comes from a single back-to-back mooncake_both_v2/plain_v2 rerun. The original Phase A pair showed TTFT p90 +29%, TPOT p90 +54%, E2E p90 +55% at rate=2 — a 40 percentage-point swing between two consecutive runs that the original write-up did not call out. The run-to-run noise floor is too high to claim "0%" at low concurrency. 2. get_finished() was never instrumented. The patch only times step_duration_us and build_meta_us. "100% of per-step cost is build_meta" is an upper bound on what was timed, not a true decomposition. 3. H5 (cache-size dependence) was the central hypothesis but was never tested in the prior run; random content kept APC near empty. The +7-9% high-concurrency (single instance, 512x64, rate=8-16) and +17% 8-instance-saturated numbers are kept; they were measured with adequate sample sizes and are reproducible. The follow-up sweep in cache_sweep/ tests H5 directly and revises the decomposition. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 23:33:01 +08:00
Gahow Wang	e3480f7d28	8-instance connector tax: +2% at non-saturated, +17% only at saturation 8×TP1 + load_only proxy, shape 512×64, rates 32/64/128 req/s total: Rate=32 (non-saturated, thr=0.95-0.97): plain TTFT p90=64ms, mooncake_both=65ms → +2% (noise) Rate=64 (non-saturated, thr=0.96): plain TTFT p90=114ms, mooncake_both=107ms → -6% (noise) Rate=128 (saturated, thr=0.70-0.71): plain TTFT p90=702ms, mooncake_both=822ms → +17% plain TTFT p50=339ms, mooncake_both=470ms → +39% Conclusion: The elastic_migration_v2 +45% is a saturation artifact. Under SLO-compliant load (TTFT<10s, thr_ratio>0.9), mooncake_both's 1.4ms/step build_connector_meta overhead is completely masked by the scheduler-model async pipeline. The tax only manifests when the system is already saturated and queueing amplifies per-step differences. For practical deployment: enabling kv_role=kv_both has effectively zero cost as long as the serving system stays within SLO capacity bounds.	2026-05-26 21:32:46 +08:00
Gahow Wang	c8ec73c548	Connector tax: high-concurrency confirms +7-9% tax, resolves trace-replay gap High-concurrency test (512 input, 64 output, rates 4-32 req/s): Rate=8: plain TTFT p90=94ms, mooncake_both=102ms → +9% tax Rate=16: plain TTFT p90=144ms, mooncake_both=156ms → +8% tax Rate=32: both saturated at ~6.1s → no distinguishable difference Low-concurrency back-to-back retest (4096 input, 256 output): mooncake_both_v2 vs plain_v2: tax is ≈0% (within noise) because scheduler's 1.4ms/step is hidden behind model forward. Decomposition of trace-replay's +45%: +7-9% from build_connector_meta per-step cost (this microbench) +20-30% from multi-instance coupling amplification (not measurable here) remainder from large-cache O(\|cache\|) scaling (Phase B follow-up) Also: bench_loop.py now emits mean/p50/p90/p99 for all three metrics.	2026-05-26 21:00:25 +08:00
Gahow Wang	a473c71cac	Connector tax Phase A: build_connector_meta is 1.4ms/step (the tax source) Per-step timing from engine_step.jsonl definitively resolves H3: plain: 53 μs/step (p50) noop_connector: 69 μs/step (+16 μs = negligible framework cost) mooncake_producer: 1461 μs/step (build_connector_meta = 1386 μs) mooncake_both: 1452 μs/step (same as producer) The substrate tax is NOT in the v1 framework — it's specifically in Mooncake's build_connector_meta() which walks set(cache.keys()) every scheduler step (O(\|cache\|) per step, E2 audit §6.5). Accumulated per-request tax: 256 decode steps × 1.4ms = 358ms. Observed TTFT tax at rate=1.0: plain 378ms vs mooncake_both 422ms (+12%). At rate=2.0 (near saturation): +29%, approaching trace-replay's +45%. Also fixes kill_vllm() to properly kill EngineCore subprocesses.	2026-05-26 19:33:15 +08:00
Gahow Wang	297fed6e73	Microbench 3 (connector_tax): infrastructure for KV connector substrate tax Validates the elastic_migration_v2 finding that kv_role=kv_both adds TTFT p90 +45% even when PD-sep never fires. Replicates under single-instance, synthetic, open-loop workload to disambiguate mechanism cost from 8-instance feedback amplification. Configurations (8): plain, noop_connector, mooncake_{producer,consumer,both}, nixl_both, lmcache_only, multi_mooncake_lmcache. Pre-flight verification gates risky configs (kv_consumer needs dummy bootstrap, multi-connector composition, NoOp custom class loading). Workload: two-phase sweep Phase A: rate {0.5..32} req/s × shape (4096, 256), saturation criteria Phase B: ref_safe rate × cartesian (input ∈ {512,4k,32k}, output ∈ {64,256,1024}) Step-timing patch enriches vLLM's existing AGENTIC_STEP_LOG_PATH emit with step_duration_us and build_meta_us — directly measures per-step substrate cost, not just user-visible TTFT/TPOT. run_all.sh runs as 5-stage barrier: 0 pre-flight + apply patch 1 Phase A all configs 2 pick ref_safe / ref_load 3 Phase B all configs 4 revert patch + analyze + plot Outputs aggregate.{json,csv}, MANIFEST.tsv, and 5 figures. Estimated runtime: 4-5.5 hours on idle dash0 H20.	2026-05-26 17:27:41 +08:00
Gahow Wang	06dd175441	Microbench 1 plots: prefill-decode interference heatmap + lines plot_interference.py reads the interference sweep summary (4 D × 4 P × 3 reps, cold prefill prompts) and produces: fig_interference_heatmap.png TPOT p90 interference index over (D, P): 14x at D=8 P=2k → 214x at D=1 P=32k. fig_interference_lines.png (a) TPOT p90 during prefill vs P, log-y, one line per D + baseline dashed (b) Cold prefill TTFT vs P (interference window length) Confirms B2 finding: cold prefill on the same worker stalls overlapping decodes for 14-214x baseline TPOT. The interference window grows linearly with P (from ~140ms at 2k to ~4.6s at 32k) and is essentially independent of decode batch size — prefill compute time dominates.	2026-05-26 14:21:30 +08:00
Gahow Wang	72790ae6c1	PD-sep server-side profiling: vLLM patches + per-request breakdown Instrumentation patches (microbench/patches/): - pd_profile.py: shared event emitter (VLLM_PD_PROFILE_LOG env var) - apply_patches.py: idempotent patch installer for mooncake_connector.py and scheduler.py, marks insertions with # PD_PROFILE_PATCH - analyze_events.py: joins per-process JSONL event logs by transfer_id into per-request phase durations Seven events captured per request: D_get_num_matched → P_zmq_received → P_prefill_done → P_rdma_start → P_rdma_end → D_recv_complete → D_request_promoted Driver fix (microbench/lifecycle/driver.py): seed_prefix_cache now sends via the proxy URL so P and D both cache the seeded prefix with matching block hashes. Previously seeding D directly produced different block hashes than the proxy-routed measurement requests, making incremental transfer impossible. Real breakdown (fig_breakdown_real.png, server_breakdown.csv, n=93): prefill_compute 620 ms median (95% of overhead) rdma_transfer 42 ms median (~71 Gbps effective) other overhead 10 ms median (dispatch + params + signal + promote) Mooncake transfer is NOT the bottleneck. Even with bulk RDMA the transfer cost is <10% of prefill cost for Qwen3-30B-A3B on H20.	2026-05-26 13:59:09 +08:00
Gahow Wang	f784e49c07	Microbench: prefill-decode interference + PD transfer lifecycle Two microbenchmarks quantifying the elastic offload decision: 1. Interference (corrected): cold prefill causes 14-214x TPOT p90 degradation on same-worker decode (D∈{1,2,4,8} × P∈{2k,8k,16k,32k}). Earlier run had a prefix-cache bug (deterministic prompts hit cache after rep 0); fixed with uuid+time_ns unique prompts. 2. Transfer lifecycle: PD-sep TTFT breakdown via Mooncake proxy, measuring prefill→RDMA→decode startup overhead. Key finding: offload wins at all P≥2048 operating points — transfer cost is 25-50% of interference cost even with bulk Mooncake.	2026-05-26 00:57:06 +08:00

1 2

61 Commits