agentic-kvc

Author	SHA1	Message	Date
Gahow Wang	a0db3cbe77	Add leastwork_kappa decode-aware ablation (net-negative, documented) --policy leastwork_kappa + --kappa (default 2.5e-6, derived from KV ~100KB/tok / HBM 4TB/s / TPOT 10ms on H20+Qwen3-30B-A3B): score = prefill_work * (1 + kappa * ongoing_decode_tokens), modelling decode as a fractional throughput tax on a new prefill. Result on the 600s trace: NET-NEGATIVE vs plain leastwork — TTFT p90 +18%, E2E p90 +14%, balance 1.55x->1.97x, and it does NOT fix the E2E-p99 it targeted. Decode is too cheap in agentic (output p50~80) for the term to help; it just bounces heavy reqs off their cache-owner into cold re-prefill. The E2E-p99 tail is the structural HEAVY+>50k floor (per-class p99 ~51-52k for ALL policies), not decode interference. Kept in-tree as a documented ablation justifying LPWL's omission of any decode term; do not revive without a decode-heavy regime. See analysis/lpwl_5policy_600s.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 17:07:23 +08:00
Gahow Wang	160c29133d	Unified bench report: mean+TPS+per-worker GPU util, auto-captured scripts/bench_report.py is now the canonical analyzer: per run + per input- class it emits TTFT/TPOT/E2E mean+p50+p90+p99, decode/prefill TPS (aggregate and per-worker), APC, per-worker GPU util mean/max, and load-spread ratios. b3_isolated_policy.sh auto-captures the inputs for every run: gpu_util.csv (via gpu_monitor.sh, 5s, replay-window only) + bench_config.json (worker->GPU map); teardown stops the sampler. Future runs populate per-worker GPU util automatically. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 16:08:22 +08:00
Gahow Wang	d9046322c6	Add parameter-free LPWL routing policy (--policy leastwork) Least-Prefill-Work-Left: score = pending_prefill_tokens + max(0, input - cache_hit_here), pure argmin with (num_requests, round-robin) tie-break. Zero hyperparameters — derived from the agentic pattern: decode is cheap (I/O ~217x) so outstanding prefill-token-work is the only load worth modelling. Dropping LMetric's x num_requests factor (a) un-swallows the cache signal so affinity emerges with no gate, and (b) makes an idle-but- decoding host score `input` (its true marginal cost) instead of 0, removing the empty-batch degeneracy. Stick-vs-spill crossover is computed from real token-work, replacing overload_factor + cache_ratio gate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 16:08:10 +08:00
Gahow Wang	67fcec7933	Unified-routing A+B ablation: decode-aware LMetric + v3 anti-hotspot cache_aware_proxy: add lmetric_decode_weight (decode-load penalty in the LMetric fallback score) and a v3 anti-hotspot recent-migration penalty (effective_load = num_req + recent-migration count over a sliding window), preventing back-to-back migration clustering. UNIFIED_ABLATION.md documents the A (overload_factor=1.3) + B' (decode-weight, max(num_req,1)) + RaceFix sweep: A+B'+RaceFix reaches TTFT p90 7770ms, beating v3 PD-sep migration by ~20%. Runners/analyzer for the b3 trace replay included. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 11:52:44 +08:00
Gahow Wang	4cd71b6631	Working-set figure: extend left panel to ~50 nodes Include T=600s/1800s points so the diminishing-returns tail is visible: 14 -> 52 nodes buys only +6pp APC (74%->79.8%), still under the 80.4% ceiling that oracle/LRU reaches at 14 nodes. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 17:11:12 +08:00
Gahow Wang	2247d1de08	Working-set figure: right panel = W(t) time series Replace the (redundant) nodes-vs-T cost curve with the working-set W(t) over wall-clock time for T=2/30/300s. Shows footprint is steady (peak ~ median) after a short warm-up, so peak-based sizing is sound; the 300s curve hugs the 14-node ceiling throughout. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 16:31:26 +08:00
Gahow Wang	c94b2e237a	Working-set figure: linear node axes + benefit/cost split Drop log node axis (decade ticks were unreadable). Left = APC vs #nodes (linear), right = #nodes vs retention window T. Mark the 1-node budget crossing (~7s reuse, ~8% APC) and the 14-node oracle ceiling. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 16:24:15 +08:00
Gahow Wang	3b8be5bb61	Working-set figure: express footprint in node count, not GB Both axes now in "# nodes" (footprint / per-node KV pool) so the cluster-size implication is direct: 1-node budget line + 14-node oracle ceiling, instead of raw GB. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 16:16:00 +08:00
Gahow Wang	dae98c6472	Working-set sizing tool + GLM-5.1-FP8/B300 result Configurable KV working-set analyzer (GPU model x TP/PP/EP x model config.json with MLA/GQA auto x KV/weight dtype). Computes Denning W(T), oracle [first,last], and retain-forever footprints vs a per-replica KV pool, plus the APC captured at each retention window. GLM-5.1-FP8 (MLA, 43.9 KiB/token) on 1x B300 node (1528 GB KV pool): live KV fits trivially (~533 GB), but the full 80.4% APC ceiling needs ~14 nodes (oracle) -> long-tail reuse motivates DRAM offload, not HBM. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-28 16:03:25 +08:00
Gahow Wang	f739f7d461	Proxy/runner support for Nixl connector + unified_v3 (offload-decode) policy scripts/b3_isolated_policy.sh: Recognize unified_v3 as a kv_both-requiring policy; respect explicit KV_CONNECTOR=Nixl override (so unified_v2 / unified_v3 / unified_kv_both can run against either Mooncake or Nixl back-end). When Nixl is selected, skip the bootstrap-ports plumbing — Nixl uses its own UCX side-channel and the proxy forwards kv_transfer_params from the src response body instead of pre-baking engine_id/bootstrap_addr. scripts/cache_aware_proxy.py: - New unified_v3 policy (~250 lines): prefill stays on session-affinity host (preserves intra-session prefix-cache reuse), decode is migrated to a lower-load target when the affinity host is busy with concurrent decodes. KV transfer flows prefill_host → decode_target, opposite of v2. Knobs: v3_min_new_tokens, v3_min_prefill_decode_busy, v3_target_load_ratio, v3_min_load_gap, v3_rotate_affinity, v3_prefer_cache_target. cache_miss_audit found rotation hurts cross- turn locality (9.5% hit with vs ~80% without) so default v3_rotate_affinity=False. - New connector_type setting ("mooncake" \| "nixl") gating the PD-sep handshake form: mooncake uses pre-baked kv_transfer_params, nixl forwards them from the response body. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 22:05:19 +08:00
Gahow Wang	876d09db83	Add chatbot T_external CDF; overlay on f3a vs agentic User-requested comparison of inter-turn external gap distribution between the production agentic trace (Qwen3-Coder) and a production chatbot trace (qwen3-max chat). Both computed as T_external = next_turn.start_ms - prev_turn.end_ms on the same kind of pipeline (raw input + raw output join on request_id, session structure from the formatted trace's parent_chat_id chains). The chatbot trace lives as two files on dash0: input : bailian-trace/qwen-trace-260321-260327/qwen3-max-input-032309-032311.jsonl output : bailian-trace/qwen-trace-260321-260327/qwen3-max-output-032109-032711.jsonl The raw input has no session_id (uuid is per-record, user_id has only 4 distinct tenant values for 346 k requests). We recover session structure from the formatted file (qwen_chat_blksz_64_032309-032311.jsonl, which groups requests by parent_chat_id), matching each formatted record to a raw record by (timestamp, output_length) — prompt_token_num is anonymized to 0 in this trace, so we use generate_token_num as the join key. End time is derived from time_to_finish_token (ms duration) not the "time" string field (which is the log-write time, not request completion). Numbers (chatbot, 42 228 inter-turn gaps over 32 262 multi-turn sessions): p25 4.85 s p50 7.18 s p75 8.22 s p90 15.0 s p99 43 s 4% gaps < 1 s 29% < 5 s 78% < 10 s 98% < 30 s Compare to agentic (same metric, scripts/compute_inter_turn_gap_remote.py): p25 0.69 s p50 1.6 s p75 8.6 s p90 44 s p99 738 s 39% gaps < 1 s 67% < 5 s 77% < 10 s 87% < 30 s Distributions differ in shape, not just location: - Chatbot is tight, unimodal around 5–10 s (human interaction). - Agentic is bimodal: a sub-second autonomous tool-call mode (39 % < 1 s) plus a long-pause tail (13 % > 30 s, p99 = 738 s) for sessions where the operator steps away. - The sub-second tool-call mass is where dispatch coupling lives — those turns have W_turn ≫ T_external for any current scheduler. The earlier "chatbot has T_human ≈ 30 s" hand-wave was wrong empirically. The right framing for §2.3 is "agentic has a sub-second tool-call mode that chatbot doesn't", not "chatbot has think-time and agentic doesn't". Adds: - scripts/compute_inter_turn_gap_chatbot.py: dash0-side aggregator (raw input/output join + formatted alignment by ts + output_length) - analysis/characterization/data/chatbot_inter_turn_gap.json: CDF cache - scripts/plot_inter_turn_gap.py: overlays both curves on log-x Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 14:49:44 +08:00
Gahow Wang	03d8c5d0d1	Render 4 per-policy figures on b3_replay_20260527_0114 into figs/v2/ User-provided fresh run with five policies (lmetric, load_only, sticky, unified, plus a new unified_v2 variant). Reproduces the v1 set under figs/v2/ so we can A/B the same panels: f4a_apc_loss.png — APC bars per policy f4c_per_worker_ttft.png — per-worker TTFT p90 panel per policy f6_e2e_latency_bars.png — TTFT/TPOT/E2E p90 bars per policy f6_e2e_latency_full_grid — mean/p50/p90/p99 × TTFT/TPOT/E2E grid scripts/render_b3_figures_v2.py is a standalone driver that reads each policy's metrics.summary.json and breakdown.json directly from the run directory — the breakdown.json `routed_to` field is required to recover per-worker assignment because the new setup routes every request through a proxy (127.0.0.1:9300), so metrics.jsonl's endpoint_url no longer identifies the backend. Headline numbers, new vs v1: APC v2: lmetric 57.2% / load_only 53.9% / sticky 77.7% unified 78.7% / unified_v2 78.4% v1: lmetric 56.9% / load_only 54.1% / sticky 77.2% / unified 79.4% TTFT p90 (s) v2: lmetric 14.8 / load_only 20.1 / sticky 14.8 / unified 8.8 / unified_v2 10.1 v1: lmetric 15.7 / load_only 20.2 / sticky 18.0 / unified 7.3 E2E p90 (s) v2: lmetric 25.4 / load_only 33.9 / sticky 30.3 / unified 20.0 / unified_v2 24.1 v1: lmetric 24.8 / load_only 33.5 / sticky 34.6 / unified 18.0 Worker p90 (s, median / max) v2: lmetric 13.3/30.4 · load_only 21.3/29.2 · sticky 13.5/33.0 unified 10.0/35.1 · unified_v2 8.6/34.2 v1: lmetric 13.9/31.3 · load_only 19.4/25.1 · sticky 20.3/55.4 unified 10.3/37.7 Story is unchanged: unified dominates at p90 across TTFT/E2E and on median-worker latency; unified_v2 is competitive at p50 but slightly worse than unified at p90. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 13:52:17 +08:00
Gahow Wang	41232f49d3	Measure inter-turn T_external on the raw production trace; add f3a CDF The earlier conversation suggested agentic might "have no human think-time" and therefore live in a strict closed-loop regime. The user pushed back: tool calls also take time and might restore a chatbot-like buffer between turns. To resolve this, we go to the actual data. The previously-published per-record formatted trace only carries arrival timestamps, so an arrival-to-arrival diff conflates W_turn + T_external. The raw trace (/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/ 051315-051317-raw.jsonl on dash0) additionally carries request_end_time_ms, which lets us compute the pure inter-turn external gap T_external = next.request_ready_time_ms - prev.request_end_time_ms for each session's consecutive turn pair. Headline numbers (n = 783 k inter-turn gaps over 127 k multi-turn sessions): p25 = 0.69 s p50 = 1.6 s p75 = 8.6 s p90 = 44 s mean = 37 s (heavy long-tail; paused/abandoned sessions) 39 % of gaps < 1 s 67 % of gaps < 5 s 87 % of gaps < 30 s The bulk of the distribution is dominated by sub-second to a-few-seconds tool-call latencies. Under any current scheduler (e.g. unified TTFT p90 = 7.3 s, lmetric 15.7 s), W_turn is already at or above the 75th percentile of T_external, so dispatch coupling is the dominant regime for the majority of turns — not a corner case. This corrects the earlier conflated arrival-to-arrival "median gap 11 s" figure (which folded W_turn into T_external). The true T_external median is 1.6 s. Adds: - scripts/compute_inter_turn_gap_remote.py: dash0-side aggregator - analysis/characterization/data/agentic_inter_turn_gap.json: 500-point CDF cache + summary stats, scp'd back from dash0 - scripts/plot_inter_turn_gap.py: local figure renderer - figs/f3a_inter_turn_gap.png: log-x CDF with p25/p50/p75/p90 anchors and unified/lmetric TTFT p90 reference lines Next step (per user): pull a chatbot trace through the same pipeline and compare distributions side by side; this will let §2.3 stop hand-waving about "no think-time" and instead present the regime split empirically. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 12:37:32 +08:00
Gahow Wang	74e0c2157a	Add solo production-trace CDF figure (f2b_session_skew_prod.png) Single-curve variant of f2b — production trace only, no replay overlay and no uniform reference. Cleaner for boss-meeting/talk slides where the extra context is noise. The combined three-curve figure is unchanged. scripts/plot_session_skew_cdf.py: split into plot_combined + plot_production_solo helpers; one run emits both PNGs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 10:53:30 +08:00
Gahow Wang	1220da249c	f2b: regenerate CDF from production trace (1.3M sessions on dash0) Pulls 456 (rank%, cum%) sample points from the raw production trace at dash0:/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl, cached locally so the figure is reproducible without ssh access. Sampled anchors match the precomputed summary exactly: top 1% = 46.5%, top 5% = 66.5%, top 10% = 74.6% plus newly readable points: top 25% = 87.5%, top 50% = 96.0% Workload characterization is now consistent with the production distribution rather than the small replay subset. Replay window CDF kept as an overlay to show the same hockey-stick shape on the data §5 actually uses. - analysis/characterization/data/production_session_skew_cdf.json: cached sample points (29 KB), so the figure rebuilds locally - scripts/plot_session_skew_cdf.py: now plots from the cache + replay raw - MEETING.md / PAPER_OUTLINE.md: revert numbers to production trace, add top-25%/50% data points Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 10:41:53 +08:00
Gahow Wang	22c4aa58e4	f2b: replace top-1/5/10% bars with full CDF; align all docs to replay-trace numbers The previous f2b_session_skew.png was a 3-bar chart (top 1/5/10%) computed from the production trace summary (which is not present locally, only its precomputed JSON). The new figure is a continuous CDF of cumulative input-token mass vs session rank percentile, generated directly from the replay trace traces/w600_r0.0015_st30.jsonl so any percentile is readable. Headline numbers update accordingly: replay trace (n=274 sessions): top 1% = 24.3%, top 5% = 61.9%, top 10% = 75.8% production trace (n=1.3M): top 1% = 46.5%, top 5% = 66.5%, top 10% = 74.6% Both show extreme skew well above the y=x uniform reference; the replay trace is less extreme at top-1% because n=274 makes that bucket only ~3 sessions. We standardize §2/§3 narrative on the replay-trace numbers so motivation matches §5 evaluation; production numbers kept as a side note for context. - scripts/plot_session_skew_cdf.py: reproducible figure generator - MEETING.md / PAPER_OUTLINE.md: update narrative + caption Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 10:37:22 +08:00
Gahow Wang	3fdcec9c0f	Fix review P2s: lockfile, model path convention, trap robustness - Regenerate uv.lock after adding fastapi/uvicorn deps so uv sync --locked no longer fails - B3 scripts: default MODEL to $HOME/models/... matching documented convention and other launch scripts (repo has no models/ directory) - launch_elastic_p2p: append \|\| true to each trap command so set -e doesn't abort cleanup when jobs -p is empty and EngineCore orphans remain	2026-05-26 16:05:43 +08:00
Gahow Wang	645b067dd4	Fix review bugs: PD-sep counter leaks, hardcoded paths, missing deps Critical: - cache_aware_proxy: _handle_pd_sep leaked p_inst.num_requests (never decremented) and never managed d_inst.num_requests; fix media_type from application/json to text/event-stream for SSE stream High: - b3_sweep/b3_isolated_policy/b3_analyze: replace hardcoded /home/admin/cpfs/wjh/ ROOT with script-relative $(dirname "$0")/.. - b3_analyze: replace hardcoded 8-port WORKER_MAP with dynamic generation from BASE_PORT and N_INSTANCES Medium: - analyze_breakdown: warn on stderr when records are skipped (was silent) - deploy_vllm_patches: fail-fast on SSH/SCP errors instead of continuing with empty VENV_SITE - pyproject.toml: declare fastapi and uvicorn as runtime dependencies - launch_elastic_p2p: kill EngineCore and proxy in trap handler to prevent GPU memory leaks on exit	2026-05-26 15:54:55 +08:00
Gahow Wang	0eb49dcc34	Fix NIXL multi-instance port conflict: per-instance SIDE_CHANNEL_PORT NIXL's _nixl_handshake_listener (vllm/distributed/kv_transfer/ kv_connector/v1/nixl_connector.py:700) binds a ZMQ ROUTER socket on the side_channel_port, which defaults to 5600. When 8 NIXL vLLMs launch concurrently on the same host all 8 race for tcp://localhost:5600; exactly one succeeds and the others silently hang in the listener thread with: zmq.error.ZMQError: Address already in use (addr='tcp://localhost:5600') The engines themselves never reach "Application startup complete" and the b3_isolated_policy.sh health-check times out. First observed when 7 of 8 inst_X.log files contained the ZMQ error and the 8th (by random ordering) was the one healthy instance. Fix: set VLLM_NIXL_SIDE_CHANNEL_PORT=$((5600 + i)) per instance in the NIXL launch branch. Each engine now gets a distinct handshake port (5600..5607 by default). Verified: all 8 instances now reach "Application startup complete" within the 360 s health budget. This is NIXL-specific; Mooncake uses VLLM_MOONCAKE_BOOTSTRAP_PORT which we were already varying per instance. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 15:09:16 +08:00
Gahow Wang	151bf33541	Add unified_nixl_both policy: NIXL connector isolation control Adds a NIXL-backed counterpart to unified_kv_both so we can attribute the kv_both substrate overhead measured in the elastic_migration_v2 section to either Mooncake-specific code or a generic v1-connector cost shared by all connectors. - scripts/cache_aware_proxy.py: register --policy unified_nixl_both. Picker is identical to unified (and unified_kv_both); routing decisions never go through the PD-sep branch. Differs only at the vLLM launch layer. - scripts/b3_isolated_policy.sh: new KV_CONNECTOR env var (Mooncake\|Nixl), auto-set based on POLICY. NIXL launch path uses --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' with no VLLM_MOONCAKE_BOOTSTRAP_PORT (NIXL uses UCX side-channels). - Health-check timeout: 90 iterations * 2s -> 180 iterations * 2s (180s -> 360s). Empirically NIXL needs ~100-150s per instance to initialize the UCX agent and register KV cache memory; 8 concurrent NIXL launches frequently overshoot the previous 180s budget. Mooncake is unaffected (still finishes well inside the new budget). The 8-vLLM unified_nixl_both first launch tripped the old timeout despite 7/8 instances reaching startup-complete. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 14:57:54 +08:00
Gahow Wang	95c8ef853c	Fix proxy shadow drift: actively reconcile against vLLM /metrics The proxy maintains shadow counters (num_requests, ongoing_tokens, pending_prefill_tokens, ongoing_decode_tokens) used by every routing picker. They are incremented in _handle_local_request and decremented in the generator's finally block. When the StreamingResponse generator never enters (client disconnect between proxy returning the response and Starlette starting iteration, or Starlette failing before iteration), the decrement never fires and the counter stays elevated forever. Over a multi-hour run the shadow accumulates "phantom" load on the affected instances and biases the router away from them. Concrete observation that prompted the fix: during the unified_kv_both B3 run, engine_0 sat at proxy num_requests=1 / ongoing_decode_tokens=80406 while vLLM's own /metrics reported num_running=0 num_waiting=0 and the GPU sat at 0% utilization. Every routing decision after that point believed engine_0 was busy with an 80k-token decode that did not exist. Fix: extend _reconcile_loop to actively poll each instance's /metrics every 30 s. If the proxy's num_requests has been higher than vLLM's (running + waiting) for two consecutive cycles (~60 s of stable drift), reduce the shadow to vLLM's truth. When vLLM is fully idle (running=0, waiting=0), zero ongoing_tokens, ongoing_decode_tokens, and pending_prefill_tokens as well. Two-cycle persistence avoids correcting transient mismatches where the proxy has just incremented for a new request that vLLM has not scheduled yet. A single ~30 s blip is not large enough to corrupt routing decisions; only persistent drift gets corrected. The previous _reconcile_loop only clamped negatives. Phantom positives are now caught and logged ("[reconcile] {url}: phantom drift ..."). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 11:29:02 +08:00
Gahow Wang	4b833d33b7	unified_v2.1: relax gates + add unified_kv_both isolation control v2.0 ran on B3 and triggered PD-sep only 2 / 1214 times (0.2%). The gates were too conservative; the v2-vs-v1 latency gap (TTFT p90 7.35 -> 8.96 s) is therefore probably attributable to kv_both always-on overhead, not to the PD-sep mechanism itself. v2.1 has two fixes plus an isolation control. Bug fix: - The "chosen has live decodes worth protecting" gate combined num_requests and ongoing_decode_tokens with AND, falling through when EITHER was small. Under agentic workloads each worker rarely stacks more than 1-2 concurrent requests, so the gate killed 84% of v2.0 candidates that reached it. Replace with a pure ongoing_decode_tokens == 0 check ("chosen_no_active_decode") — same semantic, much higher recall. Threshold relaxation (B2 microbench is the calibration source): - pd_sep_min_new_tokens: 16000 -> 8000 (B2 TPOT idx 1.9x already at 8k, TTFT idx 12x — strictly worth migrating) - pd_sep_min_decodes_protected: 2 -> 1 - pd_sep_min_src_cache_tokens: 8000 -> 4000 - pd_sep_min_extra_cache_tokens: 4000 -> 2000 Isolation control: - New --policy unified_kv_both option. Uses the exact same picker as --policy unified but the vLLMs are launched in kv_role=kv_both (the same launch mode unified_v2 requires). PD-sep never fires. Compares against unified_v2 to attribute any v2 effect to the PD-sep branch alone, not the kv_both always-on overhead. - Both unified_kv_both and unified_v2 auto-enable kv_both launch in b3_isolated_policy.sh. Tests: - Updated the existing "chosen has no decodes" test for the new gate name and semantic. - All 24 proxy tests pass. Refs: window_1_results/v2_breakdown analysis (88.7% of candidates caught by old new_local_below_threshold; 84% of the remainder caught by the old few_decodes gate). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 10:40:57 +08:00
Gahow Wang	19f69a9d2e	unified_v2: selective per-request PD-sep via Mooncake (E3+E4) Adds a sixth routing policy --policy unified_v2 that wraps the existing unified hybrid picker with a selective PD-sep branch. When all of the following hold, a request is split prefill-on-src, decode-on-chosen via Mooncake kv_role=kv_both transfer: 1. new_local = input_length - chosen.cache_hit > 16k (B2 microbench shows same-worker TTFT idx >= 3x from this size up) 2. chosen has live decodes worth protecting (>= 2 in-flight) 3. some other instance holds materially more cache for this prefix (>= 8k tokens, and >= 4k more than chosen) 4. cost(src_interference + RDMA xfer) + 0.2s margin < cost(chosen_interference) The cost model is the audit-blessed shape from E1's post-mortem: - gate on new_tokens (post-cache), NOT input_length (the old PUSH gate) - bind to a single transfer mechanism (kv_both peer-to-peer pull) - realistic RDMA cost as a function of bytes: 0.3s base + bytes / 2.7 GB/s (calibrated against contention_16s_elastic p50) - both source and target decode counts considered E2 mechanism-level patches not yet applied (this commit is policy-only). Patches 6.2 / 6.3 / 6.5 remain on the table. Patch 6.6 (per-request xfer timeout, 60s default) is implemented on the proxy side as an httpx per-chunk read timeout on the dst streaming call, so a stuck KV transfer fails the request instead of hanging for 600s. cache_aware_proxy.py: - Settings: kv_bytes_per_token, prefill_throughput_kv_both, rdma_base_overhead_s, rdma_effective_gb_per_s, pd_sep_* gating knobs - estimate_transfer_cost(bytes) replaces the constant rdma_overhead_s - estimate_same_worker_interference_s(new_tokens, num_decodes) reads off the B2 penalty curve in 4 bins - pick_instance_unified_v2: inherits unified, returns extra (src_inst, src_idx) tuple when PD-sep wins the cost compare - _handle_combined_pd_sep_v2: prefill on src (do_remote_decode=True, max_tokens=1), Mooncake xfer, decode-stream on dst with httpx Timeout(read=pd_sep_xfer_timeout_s) - --policy unified_v2 added to argparse choices - lifespan auto-runs init_prefill_bootstrap when policy is unified_v2 b3_isolated_policy.sh: - ENABLE_KV_BOTH env var, auto-set when POLICY=unified_v2, threads kv_role=kv_both + VLLM_MOONCAKE_BOOTSTRAP_PORT to vllm and --bootstrap-ports to the proxy Tests: 8 new unit tests cover the gating predicates and the cost estimators; all 32 proxy tests still pass. Refs: E1 (PUSH post-mortem) + E2 (Mooncake audit) reports. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 09:25:45 +08:00
Gahow Wang	0e82612100	Fix B3 analysis bugs from subagent audit (median + percentile + sweep) Three fixes from the B3 audit: 1) joined_analysis.hotspot_index used sorted[n//2] as median, which returns the ~60th percentile for n=8 (even-length). Systematically under-states the hotspot index. Recomputed values: lmetric 2.238 -> 2.253 (+0.7%) load_only 1.140 -> 1.294 (+13.5%) sticky 2.349 -> 2.728 (+16.1%) unified 3.350 -> 3.667 (+9.5%) capped 1.937 -> 2.020 (+4.3%) Qualitative ranking preserved; "capped only modestly reduces hotspot" story holds with ~10% drop instead of the previously reported 13%. Added test_hotspot_index_uses_true_median_for_even_n to lock in the fix. 2) b3_analyze.sh's pct() helper used floor-indexed percentile sorted[int(p*(n-1))], inconsistent with metrics._percentile and joined_analysis._percentile which both use linear interpolation. Now matches. 3) b3_sweep.sh's capped step called run_policy "capped", but the proxy's argparse has no "capped" choice, so the hot-sweep variant would have crashed on this step. The actual capped data was produced via b3_isolated_policy.sh with --policy lmetric. Replace the broken inline call with an explicit launch_proxy lmetric + inline replayer block so the sweep script matches the data path it documents. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 01:08:37 +08:00
Gahow Wang	b7902061d1	Window 1 analysis: APC upper bound, B2 window-overlap, figure renderer Three CPU-only analysis pieces that turn raw Window 1 artifacts into publishable numbers and figures. scripts/compute_apc_upper_bound.py Block-level trie walk over hash_ids to compute the theoretical APC ceiling on a trace, decomposed into intra-session / any-session / shared-prefix-only. Gives a fixed reference for what each routing policy could possibly achieve. w600 result: 79.6% intra-session, 80.3% any-session, 0.1% shared-prefix. analysis/characterization/b2_sweep_analysis.py (rewrite) Previous version used joined_analysis.interference_index() which labeled overlap = "any prefill in any other request during this decode". With short-prompt decode load this is always true (everyone's prefill overlaps everyone else's decode); n_overlap was 239/240 even in the different-worker control. New version labels overlap iff the decode's [t_first_token, t_finish] intersects an actual large injection window, computed from the cell's "prefill"-tagged metric rows. Different-worker control now cleanly sits at idx ≈ 1.0, same-worker scales monotonically. analysis/characterization/render_window1_figures.py Renders 8 PNGs from the result JSONs: B3 latency / APC vs ceiling / APC vs hotspot scatter / per-worker TTFT / failure breakdown, B2 TPOT and TTFT curves (overlap vs clean and idx), reuse decomposition, KV footprint. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 23:24:54 +08:00
Gahow Wang	b9f324f2e6	B2 interference driver: request return_token_ids + text fallback The first B2 run produced metrics with ttft_s=null/tpot_s=null for every decode request because the OpenAI-style payload did not set return_token_ids: true, and the parser only inspected choices[0].token_ids. With token_ids missing the loop skipped every chunk, so no per-token timestamps were captured and the aggregator returned interference_index=null on all 10 cells. Fix: - send return_token_ids: true in the payload (matches replayer.replay) - also accept text-delta chunks as token signals (fallback for servers that drop token_ids despite the flag) vLLM engine_state was fine; only the load-gen metric capture was broken. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 22:39:54 +08:00
Gahow Wang	df3249925b	B3 analyze: prefer per-policy engine_state over slicing shared dir The hot-sweep variant of B3 writes one shared engine_state across all policies; the isolated variant writes per-policy. Previously slice_engine_state.py was called unconditionally and would overwrite an isolated policy's real data with an empty slice (the isolated policy's run-window doesn't overlap with the shared dir's contents). Now we check the policy directory's engine_state for any non-empty engine_*.jsonl first; if present, use it directly; else slice from the shared one as before. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 22:19:43 +08:00
Gahow Wang	1d87082ca1	B3: cold-start isolated policy runner (clean APC per cell) scripts/b3_isolated_policy.sh wraps one policy run in a fresh 8-instance vLLM lifecycle: hard reset -> launch -> health -> proxy -> replayer -> snapshot artifacts -> cleanup. Used when cross- policy APC contamination matters more than the ~25-min vLLM warmup overhead per policy. Counterpart to the existing b3_sweep.sh which keeps vLLM warm across all policies (faster but warm-cache; we found via the sticky pre-flight that contamination is < 1% on this trace, so b3_sweep.sh stays the default). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 20:33:44 +08:00
Gahow Wang	123a74a4b9	B3 report renderer: incremental markdown table from comparison JSON Reads b3_policy_comparison.json (produced by b3_analyze.sh) and emits a markdown report with three tables: headline latency + APC, mechanism indices (interference / hotspot / reuse), and slow-request cause breakdown. Rows for policies not yet present in the sweep are left as "pending" so the same renderer can be re-invoked as each policy finishes, producing an evolving report rather than waiting for the full sweep. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 18:58:21 +08:00
Gahow Wang	92db1c4370	B3 post-run helpers: engine_state slicer + per-policy aggregator scripts/slice_engine_state.py filters a shared engine_*.jsonl by a [t_start_unix, t_end_unix] window. Needed because the patched scheduler appends to one file per engine across the whole sweep; per-policy analysis requires the per-policy slice. scripts/b3_analyze.sh drives the slice + joined_analysis loop for every policy directory in a completed sweep, then aggregates one row per policy (latency percentiles, APC, interference_index, hotspot_index, reuse fractions, failure-cause counts) into b3_policy_comparison.json. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 18:51:33 +08:00
Gahow Wang	e23128ad65	B2: PD-colo interference microbench harness + sweep aggregator scripts/b2_interference.py is the controlled microbench. It runs two coroutines against the open proxy bypass (direct vLLM endpoints): - decode_load: continuous short-prompt requests at fixed QPS into a designated decode instance, to keep it decode-saturated. - prefill_injections: N large one-token requests at fixed interval, pointed at either the same instance (same-worker variant) or a paired one (different-worker control). Each cell (variant × prefill_size) gets its own metrics.jsonl plus a run_window.json containing t_start_unix/t_end_unix. The shared engine_*.jsonl from the scheduler patch is sliced by that window in the aggregator. analysis/characterization/b2_sweep_analysis.py walks the cell tree, slices the per-worker step log by each cell's window, runs the A5 interference_index() against the slice, and emits a single b2_sweep_summary.json with one row per cell. This is what feeds the "interference vs uncached prefill size" figure. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 17:54:51 +08:00
Gahow Wang	c6b7c3471b	B3: load_only + sticky policies, capped-trace builder, sweep driver Three additions land together because B3's whole point is comparing LMetric against meaningful controls. - scripts/cache_aware_proxy.py: two new --policy values. - load_only: pure min(num_requests) routing, no cache or affinity. The B3 control that strips locality so the LMetric-vs-load gap is legible. - sticky: first turn goes to min-load, subsequent turns ALWAYS return to the same instance, even under saturation. The B3 control that maxes out locality so the hot-spot cost is legible. - scripts/build_capped_trace.py: per-session turn cap (default 8). Generates the session-mass-equalized variant the TODO calls for so that hot-spot index can be re-measured with the heavy-tail removed. - scripts/b3_sweep.sh: orchestrates the 5-cell sweep. - GPU_INDICES makes it easy to skip a dead GPU. - EXTRA_VLLM_ARGS defaults to --enable-prompt-tokens-details so usage.prompt_tokens_details.cached_tokens is populated. vLLM 0.18.1 omits the field by default and breaks the reuse-decomp pipeline; the smoke run surfaced this. - Trap kills EngineCore by name in addition to "vllm serve" — the parent dies first but the child holds GPU memory. Was the root cause of the 89 GB ghost on GPU 0 earlier today. - Proxy readiness is a polling loop, not a fixed sleep. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 17:54:24 +08:00
Gahow Wang	5816aad731	A3: vLLM scheduler patch for step-level JSONL log When AGENTIC_STEP_LOG_PATH is set, the scheduler emits one JSONL line per scheduler step with t_unix, worker_id, prefill/decode token counts, n_running/n_waiting, preempted ids, and per-request phase labels. No-op when the env var is unset, so production engines are not impacted. bench.sh now threads AGENTIC_STEP_LOG_DIR through to each per-engine launch so step logs end up at engine_${i}.jsonl. Required by Batch 2 (PD-colo interference index) and Batch 5 (same-worker overlap attribution); engine /metrics polling cannot provide per-step granularity. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 16:19:11 +08:00
Gahow Wang	fe556b5d98	A2: proxy worker-state snapshot and request-id passthrough Honor incoming X-Request-Id so replayer metrics and proxy breakdown share a join key. Each route decision now captures session_id, the full per-worker candidate-score snapshot (ongoing/pending/num_requests /cached_blocks plus both linear and lmetric scores), the chosen score, and unix timestamps for first-token and done events. A separate _worker_state_log records one row per decision and is exposed via GET /worker_state; GET /worker_state/latest returns a live snapshot without recording it. Required by Batch 3 (session hot-spot proof) and Batch 5 (failure attribution); existing breakdown.json had no per-worker state at decision time. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 16:19:01 +08:00
Gahow Wang	21ffb3d4f7	PD-sep matrix infrastructure: bench.sh pdsep mode + matrix driver Adds the experiment harness that gates the empirical claims (C2/C3/C4/C5) in the PD-sep paper section. Three pieces: 1. scripts/bench.sh: new --mode pdsep with --pd-ratio P:D, and an --eager flag to re-enable --enforce-eager for the cuda-graph ablation. pdsep reuses the elastic-mode Mooncake kv_both launch and swaps the proxy command from --combined to --prefill/--decode. baseline and elastic flows are unchanged. 2. analysis/pd_sep_paper_section/scripts/bench_pd_matrix.sh: matrix driver that runs {combined-ca, pdsep-4p4d, pdsep-6p2d} x cudagraph x 3 seeds by default (~2 h on dash0). --with-rr adds combined-rr; --with-eager doubles to ~5 h with the cuda-graph ablation. Skips completed runs, captures per-instance vLLM logs (needed for C3 step-level KV-utilization mining). 3. fig_kv_memory_wall.pdf: empirical anchor (star) at REPORT.md §3.3's observed 6P+2D 97% KV utilization. The marker lands on the model's predicted curve at p90 input, confirming the steady-state analysis. README updated with the run command, output layout, and the followup plotters that consume outputs/pd_matrix/. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 11:47:33 +08:00
Gahow Wang	d71a111099	Paper section: PD-sep scaffold + drop --enforce-eager from launch scripts Adds analysis/pd_sep_paper_section/ as the home for the "PD separation is net negative under agentic workloads" paper section: plot scripts for C1 (workload chars), C6 (roofline), C7 (routing-vs-PD-sep lever), the C6/C7 PDFs already rendered, and a README mapping candidate claims to required figures plus open re-run items. Removes --enforce-eager from bench.sh and all active launch scripts so cuda graphs are captured -- the prior methodology suppressed one of PD-sep's structural advantages (D-node fixed-shape decode). Legacy scripts under scripts/legacy/ are intentionally untouched as historical records. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 11:24:16 +08:00
Gahow Wang	ac6534c3ff	Cleanup: retire dead PUSH path + extract hybrid picker - Delete unreachable best_needs_push block in _handle_combined and the four orphaned helpers (_handle_cached_prefill_offload, _handle_direct_read_offload, _query_bootstrap_hit, _get_bootstrap_client). Their only caller was the retired PUSH gate; see REPORT §3.9 errata for the rejected experiments (`cc6e562`, `4c583f2`). - Extract pick_instance_unified_hybrid as a pure function returning (chosen, idx, decision_dict). The decision dict carries the review #7 breakdown fields (decision, affinity_idx/chosen_idx, cache_hit/ratio, avg_num_requests, fallback_score, tie_break_used). - Add LMetric-fallback tie-breaker (primary score, then new_uncached, num_requests, round-robin) so new sessions don't all pin to inst 0 when BS=0 across the board. - Drop the lmetric-policy affinity write so --policy lmetric stays affinity-free per review #3. - Mark --max-offload-inflight / --offload-mode / --cache-gate-ratio / --decode-iteration-s as [DEPRECATED] in --help; flags remain accepted so scripts/bench.sh and legacy launchers don't break. - Revert uncommitted overload_factor 2.0->1.5 default; H7 sweep already rejected this knob (within noise). Future sweeps should go via CLI. Tests: add 6 hybrid-policy tests in tests/test_proxy_pick.py covering affinity-hit, overload break, low-cache fallback, tie-break rotation, lmetric purity, and breakdown field shape. 19/19 pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 10:46:57 +08:00
Gahow Wang	255c8e6884	Hybrid routing: LMetric for LB + explicit affinity for high-cache sessions Replace the full unified cost model with a simpler hybrid: - If session has >50% cache on affinity instance AND instance not overloaded (num_requests <= avg * overload_factor) → stick to affinity - Otherwise → use LMetric (P × BS) for best load balance This combines LMetric's superior load balance with explicit session affinity for high-value sessions that have significant cache accumulation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-25 09:05:08 +08:00
Gahow Wang	4c583f2f1c	Revert relaxed gate + push_cost fix: 134 offloads destroyed performance PD-sep offload overhead (C queue + prefill + KV transfer + D schedule) far exceeds any load balance benefit. With relaxed gate, cost model triggered 134 offloads → E2E p90 went from 37s to 82s. The proven winning configuration is Unified routing in baseline mode (no Mooncake connector), which beats LMetric on E2E mean/p50/p90 purely through better routing (contention-aware + session affinity). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-25 03:38:59 +08:00
Gahow Wang	bf4469a150	Fix cost model: accurate push_cost + aligned hard gate 1. push_cost now models both C and D: max(c_cost, d_cost) where c_cost includes C's queue + prefill, d_cost includes D's queue + RDMA overhead. Old formula only had D's contention + RDMA. 2. Hard gate uses num_requests instead of ongoing_tokens, aligning with the contention-based cost model. 3. Fix migration_discount: min(cap, 5) instead of hardcoded min(cap, 3). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-25 01:01:03 +08:00
Gahow Wang	1d2148cf65	Remove second push_new gate that caused downgrade-to-cold-LOCAL After _push_allowed was relaxed, the cost model correctly chose push for high-cache sessions on overloaded instances. But a second gate at execution time (push_new < heavy_threshold) blocked the actual offload, downgrading to LOCAL on the target instance — which had no cache. Worse, session affinity was already updated to the target, so all subsequent turns also hit cold prefill. This was the root cause of relaxed gate's performance regression: affinity broken + push blocked = worst of both worlds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-25 00:42:31 +08:00
Gahow Wang	3ae99293fd	Relax _push_allowed: gate on request size, not cache savings The old gate blocked offload when push_new (= input - cache_hit) < 20K, which prevented migration of high-cache sessions — exactly the ones that benefit most. After PD-sep, the target receives full KV via RDMA and has the same cache as the source, so cache_hit is irrelevant to the offload decision. New gate: only check input_length >= heavy_threshold (request must be HEAVY) and max_offload_inflight (concurrency cap). Let the cost model decide whether the contention difference justifies migration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-25 00:03:28 +08:00
Gahow Wang	cc6e5625bb	Revert Approach B (session migration): overhead exceeds LB benefit Reverts 3 commits: `e991960`, `5772149`, `5b1d360`. 57 migrations triggered but PD-sep overhead (C queue + KV transfer + D cold start) caused HEAVY TTFT p90 to regress from 15.9s to 59.1s. Migration mechanism needs fundamental rework before it can help. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 23:43:47 +08:00
Gahow Wang	5b1d36080a	Fix B2 migration: correct offload call signature (c_inst/d_inst order + cache_hit arg) The session migration path was calling _handle_cached_prefill_offload with swapped c_inst/d_inst and missing cache_hit parameter, causing TypeError on every migration attempt (13 of 41 errors in the test run). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 22:46:46 +08:00
Gahow Wang	5772149d36	Approach B v2: TTFT-based migration trigger Replace num_requests threshold with recent TTFT median as migration trigger. Track per-instance rolling TTFT (last 8 requests) and trigger migration when median > 5s (configurable). Target is the instance with lowest recent TTFT, requiring > 2x improvement to justify migration. This is more responsive than the instantaneous num_requests signal because TTFT directly measures the user-facing impact of contention. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 21:54:06 +08:00
Gahow Wang	e9919605af	Approach B: session-level lazy migration trigger When a request arrives for a session on an overloaded instance, force migration if three conditions hold: 1. Instance busy: num_requests > avg * migration_request_factor (1.5x) 2. Session has cache value: cache_ratio > 50% 3. Request is HEAVY (>= heavy_threshold) 4. A meaningfully less-loaded target exists (num_requests gap > 2) This bypasses the cost model for migration decisions — the cost model's cache-inflated costs prevented migration even when instances had 150s queue times with 99% cache hit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 17:34:06 +08:00
Gahow Wang	e06de5144b	Approach A: contention-aware cost model with migration discount Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 17:24:27 +08:00
Gahow Wang	4b50c5a08d	Fix unified cost model: include decode load in queue + hard overload gate Two bugs caused elastic to concentrate load on cached instances (10x token imbalance vs 2.7x baseline): 1. _instance_cost queue only counted pending_prefill_tokens, missing ongoing_decode_tokens entirely — instances with 50 decoding requests appeared idle to the cost model. 2. Cache hits made overloaded instances look "cheap", creating a positive feedback loop: more sessions → more cache → lower cost → more routing. Added a hard gate (ongoing_tokens > avg * overload_factor) that breaks affinity before the cost model runs, matching linear policy behavior. Result: token imbalance 10.3x → 2.6x, TTFT p90 -37% vs baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 16:25:02 +08:00
Gahow Wang	9cebdb6b9b	Fix multi-turn replay fidelity: track realized output tokens across all components The replayer and proxy were building multi-turn prompts from trace tokens, but the model generates different output tokens. Subsequent turns had wrong prefix tokens, causing cache misses and invalid experimental measurements. - replay.py: min_tokens=max_tokens for deterministic length, return_token_ids to capture actual output, _apply_realized_prefix for next-turn correction - proxy: extract output token_ids from SSE, record prompt+output as realized prefix in shadow cache, extract _handle_local_request to deduplicate - bench.sh/launch_elastic_p2p.sh: default elastic mode to unified policy - mooncake_connector: only send prompt blocks (not stale output blocks), track failed_recving_block_ids for error recovery Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 14:47:51 +08:00
Gahow Wang	657812f8c4	Add deploy_vllm_patches.sh: sync third_party/vllm patches to site-packages Copies mooncake_connector.py, mooncake_utils.py, scheduler.py from third_party/vllm to the pip-installed vllm's site-packages. C extensions stay from the pip package; only Python files are overridden. Usage: bash scripts/deploy_vllm_patches.sh [HOST] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 11:59:52 +08:00

1 2 3

105 Commits