agentic-kvc

Author	SHA1	Message	Date
Gahow Wang	a0db3cbe77	Add leastwork_kappa decode-aware ablation (net-negative, documented) --policy leastwork_kappa + --kappa (default 2.5e-6, derived from KV ~100KB/tok / HBM 4TB/s / TPOT 10ms on H20+Qwen3-30B-A3B): score = prefill_work * (1 + kappa * ongoing_decode_tokens), modelling decode as a fractional throughput tax on a new prefill. Result on the 600s trace: NET-NEGATIVE vs plain leastwork — TTFT p90 +18%, E2E p90 +14%, balance 1.55x->1.97x, and it does NOT fix the E2E-p99 it targeted. Decode is too cheap in agentic (output p50~80) for the term to help; it just bounces heavy reqs off their cache-owner into cold re-prefill. The E2E-p99 tail is the structural HEAVY+>50k floor (per-class p99 ~51-52k for ALL policies), not decode interference. Kept in-tree as a documented ablation justifying LPWL's omission of any decode term; do not revive without a decode-heavy regime. See analysis/lpwl_5policy_600s.md. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 17:07:23 +08:00
Gahow Wang	d9046322c6	Add parameter-free LPWL routing policy (--policy leastwork) Least-Prefill-Work-Left: score = pending_prefill_tokens + max(0, input - cache_hit_here), pure argmin with (num_requests, round-robin) tie-break. Zero hyperparameters — derived from the agentic pattern: decode is cheap (I/O ~217x) so outstanding prefill-token-work is the only load worth modelling. Dropping LMetric's x num_requests factor (a) un-swallows the cache signal so affinity emerges with no gate, and (b) makes an idle-but- decoding host score `input` (its true marginal cost) instead of 0, removing the empty-batch degeneracy. Stick-vs-spill crossover is computed from real token-work, replacing overload_factor + cache_ratio gate. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 16:08:10 +08:00
Gahow Wang	67fcec7933	Unified-routing A+B ablation: decode-aware LMetric + v3 anti-hotspot cache_aware_proxy: add lmetric_decode_weight (decode-load penalty in the LMetric fallback score) and a v3 anti-hotspot recent-migration penalty (effective_load = num_req + recent-migration count over a sliding window), preventing back-to-back migration clustering. UNIFIED_ABLATION.md documents the A (overload_factor=1.3) + B' (decode-weight, max(num_req,1)) + RaceFix sweep: A+B'+RaceFix reaches TTFT p90 7770ms, beating v3 PD-sep migration by ~20%. Runners/analyzer for the b3 trace replay included. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 11:52:44 +08:00
Gahow Wang	f739f7d461	Proxy/runner support for Nixl connector + unified_v3 (offload-decode) policy scripts/b3_isolated_policy.sh: Recognize unified_v3 as a kv_both-requiring policy; respect explicit KV_CONNECTOR=Nixl override (so unified_v2 / unified_v3 / unified_kv_both can run against either Mooncake or Nixl back-end). When Nixl is selected, skip the bootstrap-ports plumbing — Nixl uses its own UCX side-channel and the proxy forwards kv_transfer_params from the src response body instead of pre-baking engine_id/bootstrap_addr. scripts/cache_aware_proxy.py: - New unified_v3 policy (~250 lines): prefill stays on session-affinity host (preserves intra-session prefix-cache reuse), decode is migrated to a lower-load target when the affinity host is busy with concurrent decodes. KV transfer flows prefill_host → decode_target, opposite of v2. Knobs: v3_min_new_tokens, v3_min_prefill_decode_busy, v3_target_load_ratio, v3_min_load_gap, v3_rotate_affinity, v3_prefer_cache_target. cache_miss_audit found rotation hurts cross- turn locality (9.5% hit with vs ~80% without) so default v3_rotate_affinity=False. - New connector_type setting ("mooncake" \| "nixl") gating the PD-sep handshake form: mooncake uses pre-baked kv_transfer_params, nixl forwards them from the response body. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 22:05:19 +08:00
Gahow Wang	645b067dd4	Fix review bugs: PD-sep counter leaks, hardcoded paths, missing deps Critical: - cache_aware_proxy: _handle_pd_sep leaked p_inst.num_requests (never decremented) and never managed d_inst.num_requests; fix media_type from application/json to text/event-stream for SSE stream High: - b3_sweep/b3_isolated_policy/b3_analyze: replace hardcoded /home/admin/cpfs/wjh/ ROOT with script-relative $(dirname "$0")/.. - b3_analyze: replace hardcoded 8-port WORKER_MAP with dynamic generation from BASE_PORT and N_INSTANCES Medium: - analyze_breakdown: warn on stderr when records are skipped (was silent) - deploy_vllm_patches: fail-fast on SSH/SCP errors instead of continuing with empty VENV_SITE - pyproject.toml: declare fastapi and uvicorn as runtime dependencies - launch_elastic_p2p: kill EngineCore and proxy in trap handler to prevent GPU memory leaks on exit	2026-05-26 15:54:55 +08:00
Gahow Wang	151bf33541	Add unified_nixl_both policy: NIXL connector isolation control Adds a NIXL-backed counterpart to unified_kv_both so we can attribute the kv_both substrate overhead measured in the elastic_migration_v2 section to either Mooncake-specific code or a generic v1-connector cost shared by all connectors. - scripts/cache_aware_proxy.py: register --policy unified_nixl_both. Picker is identical to unified (and unified_kv_both); routing decisions never go through the PD-sep branch. Differs only at the vLLM launch layer. - scripts/b3_isolated_policy.sh: new KV_CONNECTOR env var (Mooncake\|Nixl), auto-set based on POLICY. NIXL launch path uses --kv-transfer-config '{"kv_connector":"NixlConnector","kv_role":"kv_both"}' with no VLLM_MOONCAKE_BOOTSTRAP_PORT (NIXL uses UCX side-channels). - Health-check timeout: 90 iterations * 2s -> 180 iterations * 2s (180s -> 360s). Empirically NIXL needs ~100-150s per instance to initialize the UCX agent and register KV cache memory; 8 concurrent NIXL launches frequently overshoot the previous 180s budget. Mooncake is unaffected (still finishes well inside the new budget). The 8-vLLM unified_nixl_both first launch tripped the old timeout despite 7/8 instances reaching startup-complete. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 14:57:54 +08:00
Gahow Wang	95c8ef853c	Fix proxy shadow drift: actively reconcile against vLLM /metrics The proxy maintains shadow counters (num_requests, ongoing_tokens, pending_prefill_tokens, ongoing_decode_tokens) used by every routing picker. They are incremented in _handle_local_request and decremented in the generator's finally block. When the StreamingResponse generator never enters (client disconnect between proxy returning the response and Starlette starting iteration, or Starlette failing before iteration), the decrement never fires and the counter stays elevated forever. Over a multi-hour run the shadow accumulates "phantom" load on the affected instances and biases the router away from them. Concrete observation that prompted the fix: during the unified_kv_both B3 run, engine_0 sat at proxy num_requests=1 / ongoing_decode_tokens=80406 while vLLM's own /metrics reported num_running=0 num_waiting=0 and the GPU sat at 0% utilization. Every routing decision after that point believed engine_0 was busy with an 80k-token decode that did not exist. Fix: extend _reconcile_loop to actively poll each instance's /metrics every 30 s. If the proxy's num_requests has been higher than vLLM's (running + waiting) for two consecutive cycles (~60 s of stable drift), reduce the shadow to vLLM's truth. When vLLM is fully idle (running=0, waiting=0), zero ongoing_tokens, ongoing_decode_tokens, and pending_prefill_tokens as well. Two-cycle persistence avoids correcting transient mismatches where the proxy has just incremented for a new request that vLLM has not scheduled yet. A single ~30 s blip is not large enough to corrupt routing decisions; only persistent drift gets corrected. The previous _reconcile_loop only clamped negatives. Phantom positives are now caught and logged ("[reconcile] {url}: phantom drift ..."). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 11:29:02 +08:00
Gahow Wang	4b833d33b7	unified_v2.1: relax gates + add unified_kv_both isolation control v2.0 ran on B3 and triggered PD-sep only 2 / 1214 times (0.2%). The gates were too conservative; the v2-vs-v1 latency gap (TTFT p90 7.35 -> 8.96 s) is therefore probably attributable to kv_both always-on overhead, not to the PD-sep mechanism itself. v2.1 has two fixes plus an isolation control. Bug fix: - The "chosen has live decodes worth protecting" gate combined num_requests and ongoing_decode_tokens with AND, falling through when EITHER was small. Under agentic workloads each worker rarely stacks more than 1-2 concurrent requests, so the gate killed 84% of v2.0 candidates that reached it. Replace with a pure ongoing_decode_tokens == 0 check ("chosen_no_active_decode") — same semantic, much higher recall. Threshold relaxation (B2 microbench is the calibration source): - pd_sep_min_new_tokens: 16000 -> 8000 (B2 TPOT idx 1.9x already at 8k, TTFT idx 12x — strictly worth migrating) - pd_sep_min_decodes_protected: 2 -> 1 - pd_sep_min_src_cache_tokens: 8000 -> 4000 - pd_sep_min_extra_cache_tokens: 4000 -> 2000 Isolation control: - New --policy unified_kv_both option. Uses the exact same picker as --policy unified but the vLLMs are launched in kv_role=kv_both (the same launch mode unified_v2 requires). PD-sep never fires. Compares against unified_v2 to attribute any v2 effect to the PD-sep branch alone, not the kv_both always-on overhead. - Both unified_kv_both and unified_v2 auto-enable kv_both launch in b3_isolated_policy.sh. Tests: - Updated the existing "chosen has no decodes" test for the new gate name and semantic. - All 24 proxy tests pass. Refs: window_1_results/v2_breakdown analysis (88.7% of candidates caught by old new_local_below_threshold; 84% of the remainder caught by the old few_decodes gate). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 10:40:57 +08:00
Gahow Wang	19f69a9d2e	unified_v2: selective per-request PD-sep via Mooncake (E3+E4) Adds a sixth routing policy --policy unified_v2 that wraps the existing unified hybrid picker with a selective PD-sep branch. When all of the following hold, a request is split prefill-on-src, decode-on-chosen via Mooncake kv_role=kv_both transfer: 1. new_local = input_length - chosen.cache_hit > 16k (B2 microbench shows same-worker TTFT idx >= 3x from this size up) 2. chosen has live decodes worth protecting (>= 2 in-flight) 3. some other instance holds materially more cache for this prefix (>= 8k tokens, and >= 4k more than chosen) 4. cost(src_interference + RDMA xfer) + 0.2s margin < cost(chosen_interference) The cost model is the audit-blessed shape from E1's post-mortem: - gate on new_tokens (post-cache), NOT input_length (the old PUSH gate) - bind to a single transfer mechanism (kv_both peer-to-peer pull) - realistic RDMA cost as a function of bytes: 0.3s base + bytes / 2.7 GB/s (calibrated against contention_16s_elastic p50) - both source and target decode counts considered E2 mechanism-level patches not yet applied (this commit is policy-only). Patches 6.2 / 6.3 / 6.5 remain on the table. Patch 6.6 (per-request xfer timeout, 60s default) is implemented on the proxy side as an httpx per-chunk read timeout on the dst streaming call, so a stuck KV transfer fails the request instead of hanging for 600s. cache_aware_proxy.py: - Settings: kv_bytes_per_token, prefill_throughput_kv_both, rdma_base_overhead_s, rdma_effective_gb_per_s, pd_sep_* gating knobs - estimate_transfer_cost(bytes) replaces the constant rdma_overhead_s - estimate_same_worker_interference_s(new_tokens, num_decodes) reads off the B2 penalty curve in 4 bins - pick_instance_unified_v2: inherits unified, returns extra (src_inst, src_idx) tuple when PD-sep wins the cost compare - _handle_combined_pd_sep_v2: prefill on src (do_remote_decode=True, max_tokens=1), Mooncake xfer, decode-stream on dst with httpx Timeout(read=pd_sep_xfer_timeout_s) - --policy unified_v2 added to argparse choices - lifespan auto-runs init_prefill_bootstrap when policy is unified_v2 b3_isolated_policy.sh: - ENABLE_KV_BOTH env var, auto-set when POLICY=unified_v2, threads kv_role=kv_both + VLLM_MOONCAKE_BOOTSTRAP_PORT to vllm and --bootstrap-ports to the proxy Tests: 8 new unit tests cover the gating predicates and the cost estimators; all 32 proxy tests still pass. Refs: E1 (PUSH post-mortem) + E2 (Mooncake audit) reports. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 09:25:45 +08:00
Gahow Wang	c6b7c3471b	B3: load_only + sticky policies, capped-trace builder, sweep driver Three additions land together because B3's whole point is comparing LMetric against meaningful controls. - scripts/cache_aware_proxy.py: two new --policy values. - load_only: pure min(num_requests) routing, no cache or affinity. The B3 control that strips locality so the LMetric-vs-load gap is legible. - sticky: first turn goes to min-load, subsequent turns ALWAYS return to the same instance, even under saturation. The B3 control that maxes out locality so the hot-spot cost is legible. - scripts/build_capped_trace.py: per-session turn cap (default 8). Generates the session-mass-equalized variant the TODO calls for so that hot-spot index can be re-measured with the heavy-tail removed. - scripts/b3_sweep.sh: orchestrates the 5-cell sweep. - GPU_INDICES makes it easy to skip a dead GPU. - EXTRA_VLLM_ARGS defaults to --enable-prompt-tokens-details so usage.prompt_tokens_details.cached_tokens is populated. vLLM 0.18.1 omits the field by default and breaks the reuse-decomp pipeline; the smoke run surfaced this. - Trap kills EngineCore by name in addition to "vllm serve" — the parent dies first but the child holds GPU memory. Was the root cause of the 89 GB ghost on GPU 0 earlier today. - Proxy readiness is a polling loop, not a fixed sleep. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 17:54:24 +08:00
Gahow Wang	fe556b5d98	A2: proxy worker-state snapshot and request-id passthrough Honor incoming X-Request-Id so replayer metrics and proxy breakdown share a join key. Each route decision now captures session_id, the full per-worker candidate-score snapshot (ongoing/pending/num_requests /cached_blocks plus both linear and lmetric scores), the chosen score, and unix timestamps for first-token and done events. A separate _worker_state_log records one row per decision and is exposed via GET /worker_state; GET /worker_state/latest returns a live snapshot without recording it. Required by Batch 3 (session hot-spot proof) and Batch 5 (failure attribution); existing breakdown.json had no per-worker state at decision time. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 16:19:01 +08:00
Gahow Wang	ac6534c3ff	Cleanup: retire dead PUSH path + extract hybrid picker - Delete unreachable best_needs_push block in _handle_combined and the four orphaned helpers (_handle_cached_prefill_offload, _handle_direct_read_offload, _query_bootstrap_hit, _get_bootstrap_client). Their only caller was the retired PUSH gate; see REPORT §3.9 errata for the rejected experiments (`cc6e562`, `4c583f2`). - Extract pick_instance_unified_hybrid as a pure function returning (chosen, idx, decision_dict). The decision dict carries the review #7 breakdown fields (decision, affinity_idx/chosen_idx, cache_hit/ratio, avg_num_requests, fallback_score, tie_break_used). - Add LMetric-fallback tie-breaker (primary score, then new_uncached, num_requests, round-robin) so new sessions don't all pin to inst 0 when BS=0 across the board. - Drop the lmetric-policy affinity write so --policy lmetric stays affinity-free per review #3. - Mark --max-offload-inflight / --offload-mode / --cache-gate-ratio / --decode-iteration-s as [DEPRECATED] in --help; flags remain accepted so scripts/bench.sh and legacy launchers don't break. - Revert uncommitted overload_factor 2.0->1.5 default; H7 sweep already rejected this knob (within noise). Future sweeps should go via CLI. Tests: add 6 hybrid-policy tests in tests/test_proxy_pick.py covering affinity-hit, overload break, low-cache fallback, tie-break rotation, lmetric purity, and breakdown field shape. 19/19 pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 10:46:57 +08:00
Gahow Wang	255c8e6884	Hybrid routing: LMetric for LB + explicit affinity for high-cache sessions Replace the full unified cost model with a simpler hybrid: - If session has >50% cache on affinity instance AND instance not overloaded (num_requests <= avg * overload_factor) → stick to affinity - Otherwise → use LMetric (P × BS) for best load balance This combines LMetric's superior load balance with explicit session affinity for high-value sessions that have significant cache accumulation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-25 09:05:08 +08:00
Gahow Wang	4c583f2f1c	Revert relaxed gate + push_cost fix: 134 offloads destroyed performance PD-sep offload overhead (C queue + prefill + KV transfer + D schedule) far exceeds any load balance benefit. With relaxed gate, cost model triggered 134 offloads → E2E p90 went from 37s to 82s. The proven winning configuration is Unified routing in baseline mode (no Mooncake connector), which beats LMetric on E2E mean/p50/p90 purely through better routing (contention-aware + session affinity). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-25 03:38:59 +08:00
Gahow Wang	bf4469a150	Fix cost model: accurate push_cost + aligned hard gate 1. push_cost now models both C and D: max(c_cost, d_cost) where c_cost includes C's queue + prefill, d_cost includes D's queue + RDMA overhead. Old formula only had D's contention + RDMA. 2. Hard gate uses num_requests instead of ongoing_tokens, aligning with the contention-based cost model. 3. Fix migration_discount: min(cap, 5) instead of hardcoded min(cap, 3). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-25 01:01:03 +08:00
Gahow Wang	1d2148cf65	Remove second push_new gate that caused downgrade-to-cold-LOCAL After _push_allowed was relaxed, the cost model correctly chose push for high-cache sessions on overloaded instances. But a second gate at execution time (push_new < heavy_threshold) blocked the actual offload, downgrading to LOCAL on the target instance — which had no cache. Worse, session affinity was already updated to the target, so all subsequent turns also hit cold prefill. This was the root cause of relaxed gate's performance regression: affinity broken + push blocked = worst of both worlds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-25 00:42:31 +08:00
Gahow Wang	3ae99293fd	Relax _push_allowed: gate on request size, not cache savings The old gate blocked offload when push_new (= input - cache_hit) < 20K, which prevented migration of high-cache sessions — exactly the ones that benefit most. After PD-sep, the target receives full KV via RDMA and has the same cache as the source, so cache_hit is irrelevant to the offload decision. New gate: only check input_length >= heavy_threshold (request must be HEAVY) and max_offload_inflight (concurrency cap). Let the cost model decide whether the contention difference justifies migration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-25 00:03:28 +08:00
Gahow Wang	cc6e5625bb	Revert Approach B (session migration): overhead exceeds LB benefit Reverts 3 commits: `e991960`, `5772149`, `5b1d360`. 57 migrations triggered but PD-sep overhead (C queue + KV transfer + D cold start) caused HEAVY TTFT p90 to regress from 15.9s to 59.1s. Migration mechanism needs fundamental rework before it can help. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 23:43:47 +08:00
Gahow Wang	5b1d36080a	Fix B2 migration: correct offload call signature (c_inst/d_inst order + cache_hit arg) The session migration path was calling _handle_cached_prefill_offload with swapped c_inst/d_inst and missing cache_hit parameter, causing TypeError on every migration attempt (13 of 41 errors in the test run). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 22:46:46 +08:00
Gahow Wang	5772149d36	Approach B v2: TTFT-based migration trigger Replace num_requests threshold with recent TTFT median as migration trigger. Track per-instance rolling TTFT (last 8 requests) and trigger migration when median > 5s (configurable). Target is the instance with lowest recent TTFT, requiring > 2x improvement to justify migration. This is more responsive than the instantaneous num_requests signal because TTFT directly measures the user-facing impact of contention. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 21:54:06 +08:00
Gahow Wang	e9919605af	Approach B: session-level lazy migration trigger When a request arrives for a session on an overloaded instance, force migration if three conditions hold: 1. Instance busy: num_requests > avg * migration_request_factor (1.5x) 2. Session has cache value: cache_ratio > 50% 3. Request is HEAVY (>= heavy_threshold) 4. A meaningfully less-loaded target exists (num_requests gap > 2) This bypasses the cost model for migration decisions — the cost model's cache-inflated costs prevented migration even when instances had 150s queue times with 99% cache hit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 17:34:06 +08:00
Gahow Wang	e06de5144b	Approach A: contention-aware cost model with migration discount Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 17:24:27 +08:00
Gahow Wang	4b50c5a08d	Fix unified cost model: include decode load in queue + hard overload gate Two bugs caused elastic to concentrate load on cached instances (10x token imbalance vs 2.7x baseline): 1. _instance_cost queue only counted pending_prefill_tokens, missing ongoing_decode_tokens entirely — instances with 50 decoding requests appeared idle to the cost model. 2. Cache hits made overloaded instances look "cheap", creating a positive feedback loop: more sessions → more cache → lower cost → more routing. Added a hard gate (ongoing_tokens > avg * overload_factor) that breaks affinity before the cost model runs, matching linear policy behavior. Result: token imbalance 10.3x → 2.6x, TTFT p90 -37% vs baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 16:25:02 +08:00
Gahow Wang	9cebdb6b9b	Fix multi-turn replay fidelity: track realized output tokens across all components The replayer and proxy were building multi-turn prompts from trace tokens, but the model generates different output tokens. Subsequent turns had wrong prefix tokens, causing cache misses and invalid experimental measurements. - replay.py: min_tokens=max_tokens for deterministic length, return_token_ids to capture actual output, _apply_realized_prefix for next-turn correction - proxy: extract output token_ids from SSE, record prompt+output as realized prefix in shadow cache, extract _handle_local_request to deduplicate - bench.sh/launch_elastic_p2p.sh: default elastic mode to unified policy - mooncake_connector: only send prompt blocks (not stale output blocks), track failed_recving_block_ids for error recovery Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 14:47:51 +08:00
Gahow Wang	bf76273778	Add --offload-mode switch for ablation (direct_read vs cached_prefill) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 11:24:15 +08:00
Gahow Wang	cdf83493ab	Fix A+C: real cache sync + cached-prefill-on-C architecture A: Add /estimate_hit endpoint to bootstrap server for real-time cache probing. Proxy queries this before committing to PUSH, eliminating 24% zero-match PUSH requests (shadow cache divergence). C: Add _handle_cached_prefill_offload: C (cache source) does fast cached prefill → KV to Mooncake → D pulls and decodes. Replaces broken direct_read PUSH where D waited for RDMA transfer while occupying KV blocks without doing compute. Also: update §3.9 baseline to plain vLLM with full mean/p50/p90/p99. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 11:22:38 +08:00
Gahow Wang	97f4fe5164	Fix: rename inst->chosen in generate function (NameError crash) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 02:55:01 +08:00
Gahow Wang	5892739159	Add session affinity as soft preference in unified routing Without affinity, all cached requests route to the same instance (cache source always has lowest prefill cost), causing 149s queue. Fix: if the session's last instance has cost <= 2x the global best, use it (preserves cache locality). Only re-route when the affinity instance is significantly more expensive (overloaded). The 2x threshold is intentionally loose — it's not a hardcoded magic number but a "prefer locality unless clearly worse" heuristic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 02:37:58 +08:00
Gahow Wang	6b255fad91	Unified routing: single argmin(expected_latency) over all instances Replace two-phase routing (pick_instance → offload gate) with a single cost function evaluated per instance: latency(D) = queue(D) + prefill_time(D) + transfer_cost(D) - If D has local cache: prefill = (input - local_hit) / throughput - If D can receive PUSH from cache source: prefill = (input - push_hit) / throughput + rdma - Otherwise: prefill = input / throughput (cold) Choose argmin(latency). If the winner needs PUSH → trigger migration. Removed: - WARM/MEDIUM/HEAVY classification (no routing purpose) - heavy_threshold, overload_factor, max_offload_inflight, cache_gate_ratio - Interference penalty magic number (0.3) - Separate pick_instance + offload gate stages Only 2 measured parameters remain: - prefill_throughput = 7000 tokens/s (H20 measured) - rdma_overhead_s = 0.1s (RDMA PUSH measured) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 02:21:34 +08:00
Gahow Wang	1cf03c6e79	Cost model: add interference penalty for co-located heavy prefill Old cost model: offload_cost = colocated_cost + RDMA_overhead, so offload was always 0.1s more expensive. Result: only 19/117 HEAVY offloaded. New: colocated_cost includes interference penalty when C_s has decode requests: penalty = prefill_time × min(num_requests, 3) × 0.3. Offload now wins when C_s has 1+ concurrent request. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 23:59:06 +08:00
Gahow Wang	a7514fc3d5	Fix retry syntax: async generator can't use return, use break+try/finally Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 22:37:32 +08:00
Gahow Wang	daeb95eca0	RDMA overhead 2.0→0.1s (direct read is raw memory, not scheduler flow) + retry on ConnectError to handle kv_both connection instability With RDMA_overhead=0.1s, offload triggers when C_s has just 700 tokens pending (0.1s queue), vs 38k tokens (5.4s) with the old 2.0s estimate. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 22:33:10 +08:00
Gahow Wang	5c66f500fc	Fix offload gate: remove cache_gate for direct RDMA read, fix cost model The cache_gate_ratio=0.3 check blocked 83/112 HEAVY requests (75%) because they were cold (cache_ratio=0). But with direct RDMA read, D reads C's cached blocks via RDMA regardless of cache ratio — the gate was protecting against the OLD flow (C does prefill + push). Also fixed cost model: offload_cost now reflects direct read reality: OLD: P_queue + P_full_prefill + RDMA (P has no cache → expensive) NEW: D_queue + RDMA_read + D_local_prefill(new_tokens) Offload wins when C_s queue > RDMA_overhead (~2s). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 22:01:43 +08:00
Gahow Wang	52a54e44af	proxy: split session_affinity per mode + vLLM patch self-check (M4, S2) - Replace the global session_affinity dict with two namespace-isolated ones (combined / prefill) so a session_id never indexes the wrong instance list across mode switches. Keep `session_affinity` as a read-only alias to the combined dict for any existing tooling. - Add a startup _verify_vllm_patch() that scans vllm.v1.core.sched.scheduler.Scheduler for the original `assert req_id in self.requests` line. If the patch was not re-applied after a vLLM upgrade we now print a loud warning at lifespan startup instead of dying mid-experiment on a KV-transfer abort race.	2026-05-23 21:12:56 +08:00
Gahow Wang	c843f2e3db	proxy: Settings dataclass + cache-ratio gate + P-pick offload penalty (B4, M2, M3, D5) - Replace mutable module constants (HEAVY_THRESHOLD/OVERLOAD_FACTOR/ MAX_OFFLOAD_INFLIGHT/PREFILL_THROUGHPUT/RDMA_OVERHEAD_S/ CACHE_CAPACITY_BLOCKS) with a Settings dataclass + SETTINGS singleton. __main__ now mutates SETTINGS so CLI overrides survive even when the module is imported as a library (e.g. by tests/) (D5). - Add --max-offload-inflight CLI flag (M3) and read it from SETTINGS. - Add --cache-gate-ratio CLI flag and a real gate before the cost-model branch: if cache_hit/input_length < ratio, mark cache_gate_REASON and fall back to colocated. cache_ratio is no longer a write-only field (B4). - P candidate selection penalises instances already running offloaded HEAVY prefills, so back-to-back HEAVY requests don't pile onto the same P (M2). - bench.sh forwards --max-offload-inflight / --cache-gate-ratio to the proxy. - Tests cover SETTINGS knobs + the heavy_threshold-driven P-offload penalty.	2026-05-23 21:11:17 +08:00
Gahow Wang	a7df84bd3b	Direct RDMA read: D reads cached KV from C's GPU without C's scheduler Complete implementation of direct RDMA read for KV cache migration: vLLM Mooncake connector (mooncake_connector.py): - PullReqMeta: add direct_read flag + block_hashes - MooncakeConnectorMetadata: add hash_table_updates/removals for scheduler->worker block hash sync - MooncakeConnectorScheduler: set_block_pool() to access BlockPool, build_connector_meta() computes hash table deltas each step, update_state_after_alloc() captures request block hashes for direct_read - MooncakeConnectorWorker: _start_direct_read() + _direct_read_single() implements D-side RDMA read via batch_transfer_sync_read, with HTTP query/unpin to C's bootstrap server Bootstrap server (mooncake_utils.py): - POST /query_blocks: look up block hashes, return block_ids + GPU layout - POST /unpin_blocks: release pin tracking - set_worker_kv_info(): register GPU addresses at init - update_hash_table(): receive scheduler deltas each step Scheduler (scheduler.py): - One-line hookup: pass block_pool to connector after KVCacheManager init Proxy (cache_aware_proxy.py): - _handle_direct_read_offload: sends request ONLY to D with direct_read=True + remote_bootstrap_addr. No request to C at all. - C's scheduler is completely uninvolved (0 GPU time on C) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 21:02:13 +08:00
Gahow Wang	020be9f444	proxy: real LRU for cached_blocks + shadow-state reconcile loop (M1, M5) M1: cached_blocks was a plain set with a "trim half via list slicing" eviction. CPython does not guarantee set iteration order, so the trim discarded an arbitrary half of the entries — completely unlike vLLM's LRU and a known contributor to the router's cache_hit estimate diverging from real APC. Replace with an OrderedDict-backed LRU: move_to_end on hits, popitem(last=False) on overflow. Capacity exposed as CACHE_CAPACITY_BLOCKS module constant (200000 by default). M5: streamed responses decrement load counters in their generator's finally block. If a client disconnects before consuming the body the generator is never entered and the decrement is lost, causing ongoing_tokens / num_requests / pending_prefill_tokens to drift negative under load. Add a 60s background reconcile_loop that clamps those counters at zero as a safety net. Started in lifespan, cancelled on shutdown. Does not replace proper vLLM exact-state syncing.	2026-05-23 21:00:35 +08:00
Gahow Wang	556f3011c6	proxy: remove dead state and broken fire-and-forget path (B1, D1) B1: _inst_cumulative_tokens was written by pick_instance but never read anywhere; delete the variable, global declaration, and per-call increment. Load is already tracked via inst.ongoing_tokens. D1: _send_prefill_async + the --fire-and-forget branch were unreachable in practice (no launch/bench script enabled the flag) and broken even if exercised: D-decode would fire before P registered the transfer_id, guaranteeing a Mooncake 502. Collapse _handle_pd_sep to its synchronous path and drop the CLI flag.	2026-05-23 20:56:11 +08:00
Gahow Wang	ea5149726c	Partial remote prefill: C_s exports cache, D computes new tokens locally vLLM Mooncake patch: - get_num_new_matched_tokens: support remote_num_tokens parameter for partial remote prefill (pull N tokens from remote, compute rest locally) - update_state_after_alloc: only allocate receive blocks for external portion Proxy _handle_heavy_offload rewrite: - Step 1: C_s exports ONLY cached blocks (truncated prompt, 0 compute) - Step 2: D pulls cached blocks + does local prefill for new tokens + decodes - C_s's blocks auto-freed by Mooncake delay_free after D confirms receipt This enables true session migration: C_s releases cache, D takes over. C_s's GPU is freed immediately (no compute), vs old approach where C_s had to do full prefill (1-15s GPU occupancy). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 20:04:13 +08:00
Gahow Wang	be273f7f27	Replace static offload gate with runtime cost model Old gate: cache_ratio >= 0.3 (static, only 14% of HEAVY triggered) New gate: offload when offload_cost < colocated_cost, where: colocated_cost = queue(C_s) + prefill(new_tokens) offload_cost = queue(P_idle) + prefill(P_tokens) + RDMA_overhead Key changes: - P is now least-loaded instance (not session-sticky C_s) - Gate considers C_s queue depth dynamically - Crossover: offload wins when C_s queue >= 38k tokens (~5.4s) - Cold HEAVY requests CAN be offloaded if C_s is busy enough - P accounting uses P's actual cache hit, not C_s's Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 19:42:33 +08:00
Gahow Wang	f5e45afd4e	Fix 4 elastic PS bugs: D accounting, offload cap, cache migration, prefix sync Bug 1+5: D instance had no accounting during prefill phase (7-11s window). Router saw D as idle, routing extra traffic that caused KV allocation failures. Fix: reserve D's ongoing_tokens+num_requests at offload decision time. Bug 7: No cap on concurrent offloads despite REPORT claiming MAX_OFFLOAD=4. Fix: add MAX_OFFLOAD_INFLIGHT=4 check before offloading. Bug 6: Session affinity migrated to D but proxy cache estimator wasn't updated for D. Future turns scored D as cache-cold. Fix: call d_inst.record_prefix(token_ids) after successful decode. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 15:55:11 +08:00
Gahow Wang	3594f7dce0	Fix LMetric routing: remove session affinity, align with OSDI'26 spec LMetric was incorrectly sharing session-sticky logic with Linear policy. Fixed to pure per-request routing: score = P_tokens × BS where P = pending_prefill + (input - cache_hit), BS = num_requests. Experiment result (200 req, fresh restart): Linear vs corrected LMetric show <2% difference on all metrics — LMetric's cache-hit estimation provides implicit soft affinity that preserves locality without explicit session stickiness. Also fix bench.sh missing cd (replayer module not found from non-project cwd) and rewrite run_lmetric_ab.sh as thin wrapper around bench.sh to eliminate duplicated launch/cleanup logic that broke under set -euo. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 11:56:58 +08:00
Gahow Wang	85b230455e	H7 OVERLOAD_FACTOR sweep: negative result + H4 GPU profiling H7: Sweeping OVERLOAD_FACTOR (2.0/1.5/1.3/1.0) has no effect on GPU imbalance (~3.5-4x across all settings). Root cause: imbalance is from workload skew at session placement (turn 1), not from routing at turn 2+. H4 GPU profiling confirms: GPU balance improvement IS real (4.0x→2.0x), and it directly improves HEAVY_COLO TTFT by 10.5%. But RDMA-offloaded requests have bimodal transfer times (0.6s or 18-31s) that negate the routing benefit. Updated elastic_hypotheses.md with H7 results and next directions: higher load experiments where contention amplifies routing differences. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 03:04:02 +08:00
Gahow Wang	3bc37cc6d5	PS experiments + H4 cache-gate + GPU profiling + Mooncake elif→if fix Experiments run: - Phase 0: kv_both has zero idle overhead (TPOT +1.3%, noise) - PS V1 (cold prefill): REJECTED — PS always slower than cached C - PS V1+flexD: 92.5% OK, HEAVY TTFT 7.8s (baseline 5.0s) — PS bottleneck - V2 (C_s prefill + flexible D): E2E -9% but 6 errors, RDMA bimodal - H4 (cache-gate): 198/200 OK, GPU imbalance 4.0x→2.0x, but HEAVY_OFFLOAD TTFT=11.5s due to RDMA. HEAVY_COLO improved 10.5% from better balance. - H5: Mooncake RDMA transfer R²=0.095, bimodal (0.6s or 18-30s) Key findings: - Mooncake lacks layerwise KV transfer → RDMA is pure sequential overhead - 92% of HEAVY are turn-1 cold → offloading cold requests always loses - GPU balance improvement from routing IS real (-10.5% HEAVY_COLO TTFT) - RDMA transfer negates the routing benefit for offloaded requests Code changes: - bench.sh: add GPU timeline monitoring (gpu_monitor.sh during benchmark) - cache_aware_proxy.py: H4 cache-gate, flexible D, PS routing - mooncake_connector.py: elif→if fix (allow dual prefill+decode flags) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 02:14:37 +08:00
Gahow Wang	fc92410ec9	Invalidate prior A/B results + add proper experiment harness Prior cross-machine comparison (commit `1e86285`) was invalid: dash0 baseline used warm instances with residual KV cache, inflating TTFT by 2x. Evidence: inst_7 APC=68.3% impossible from 25 cold-start requests; WARM TTFT p90=3.3s vs fresh=0.26s. Fair same-machine comparison (both fresh restart on dash0): Baseline: TTFT50=1.075 TPOT90=0.076 E2E50=5.075 OK=198/200 Elastic P2P: TTFT50=1.018 TPOT90=0.085 E2E50=6.977 OK=195/200 Elastic is WORSE due to Mooncake kv_both memory overhead. Changes: - REPORT.md: rewrite §3-4 with corrected results, add §3.5 errata - pd_separation_analysis.md: update elastic TL;DR with correct numbers - cache_aware_proxy.py: fix double-decrement bugs in offload path, add 120s prefill timeout with co-located fallback (HEAVY_COLO_FALLBACK) - bench.sh: standardized experiment harness with guaranteed GPU cleanup and fresh-state verification (nvidia-smi check before start) - run_elastic_stability_test.sh: two-phase elastic vs baseline test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 17:54:21 +08:00
Gahow Wang	e4fa56cb1e	LMetric routing policy (OSDI'26) + A/B results vs linear baseline Implement LMetric (P_tokens × BS multiplication score) from "Simple is Better" (Zhang et al., OSDI'26) as alternative routing policy for combined mode. Key changes: - cache_aware_proxy.py: add --policy {linear,lmetric} flag, track pending_prefill_tokens and num_requests per instance, /stats endpoint - run_lmetric_ab.sh: automated A/B script for fair comparison Results (200 req, fresh restart, same trace): Linear: TTFT50=1.086 TPOT90=0.077 E2E50=5.423 LMetric: TTFT50=1.099 TPOT90=0.073 E2E50=5.205 Delta: TTFT +1.2% TPOT -5.9% E2E -4.0% LMetric improves TPOT/E2E modestly through better load balancing, but routing policy headroom is limited vs elastic P2P offload (-44% E2E). TODO: vLLM → Redis → router pipeline for exact state ablation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 16:57:32 +08:00
Gahow Wang	2b0ac70ee7	Phase 1 milestone: system-level analysis + reproducible report - REPORT.md: self-contained milestone report covering baseline vs elastic setup, exact launch commands, benchmark params, results, log locations, and repo structure — sufficient for anyone to reproduce - analysis/pd_separation_analysis.md §5: elastic P2P system-level breakdown (KV cache hit ratio, per-class TTFT, GPU util paradox explanation) - scripts/cache_aware_proxy.py: round-robin P-instance selection replacing argmin(ongoing_tokens) to fix GPU load imbalance (3.0x → expected ~2x) - scripts/launch_elastic_p2p.sh: one-command launch for elastic P2P config Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 16:17:41 +08:00
Gahow Wang	76ee28a40f	Elastic P2P v4: error rate 25% -> 4%, TTFT p50 -12% (median-tail tradeoff) Fixed offload decision: removed p>=d gate (was blocking all offloads), added MAX_OFFLOAD_INFLIGHT=4 cap and p_saturated threshold. Result (200 req, fresh restart): Baseline: 99% success, TTFT=1.080/9.410, TPOT90=0.076, E2E=5.306 Elastic: 96% success, TTFT=0.946/15.843, TPOT90=0.077, E2E=5.717 Architectural tradeoff confirmed: - Median (p50) improves: D instances not disrupted by heavy prefill - Tail (p90) worsens: offloaded HEAVY requests pay KV transfer cost - TPOT unchanged: decode isolation is not the bottleneck To improve p90: need layerwise pipelined KV transfer (overlap with prefill compute) or smarter offload gating that avoids offloading the very largest requests (which have the longest prefill time and generate the most KV). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 15:08:16 +08:00
Gahow Wang	1d2eeb4925	Elastic P2P offload: TTFT p50 -49% vs baseline (0.551 vs 1.080) Design: offload HEAVY prefill only when P instance is less loaded than D AND P is not overloaded (< 1.5x avg). Preserves session-sticky on D for future KV reuse. External KV correctly registered in prefix cache. Result (67/200 processed, 75% success): TTFT p50: 0.551s (-49% vs baseline 1.080s) TTFT p90: 4.135s (vs baseline 9.410s, -56%) TPOT p90: 0.074s (same as baseline) E2E p50: 2.938s (-45% vs baseline 5.306s) 25% error rate from ReadTimeout on very large HEAVY requests queuing on P. Needs stricter elastic gate or higher timeout. But successful requests show significant improvement over both baseline and previous P2P. Also: added external_prefix_cache metrics tracking to replayer summary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 13:50:25 +08:00
Gahow Wang	1b9268ba4c	P2P prefill offload: TTFT p50 -13% but p90 +59% (median-vs-tail tradeoff) Fixed race condition in P instance selection (all going to inst_0). P2P design: HEAVY requests prefill on least-loaded OTHER instance, KV transfer via Mooncake, decode on session-sticky instance. Result (200 req, fresh restart, vs baseline): TTFT p50: 1.080 -> 0.939 (-13%) <- median improves (decode not disrupted) TTFT p90: 9.410 -> 14.987 (+59%) <- tail worsens (KV transfer on large req) TPOT p90: 0.076 -> 0.075 (-1%) <- unchanged (not the bottleneck) E2E p50: 5.306 -> 5.565 (+5%) <- slightly worse overall The P2P offload helps the common case (WARM/MEDIUM get lower TTFT because their instance isn't blocked by a heavy prefill) but hurts HEAVY requests (extra KV transfer latency). This is a median-vs-tail tradeoff. For SLOs targeting p50: P2P offload helps. For SLOs targeting p90/p99: baseline combined is better. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 12:28:24 +08:00

1 2

58 Commits