agentic-kvc

Author	SHA1	Message	Date
Gahow Wang	559faa1e26	B2 finding: TPOT idx peaks at 32k, not 65k — cost migrates to TTFT The B2 same-worker TPOT p90 idx is non-monotone: 7.89x at 32k drops to 2.26x at 65k. The naive reading is "interference gets weaker for huge prefills"; the actual mechanism is a regime shift, and reading TPOT p90 alone is misleading. Three superimposed effects: 1. Cost migration TPOT -> TTFT. A 32k prefill is short enough that chunked-prefill keeps interleaving decode steps, so overlapping decodes trickle tokens out at painful per-token rates. A 65k prefill is long enough that overlapping decodes are fully blocked for ~10s; once they break through, the injection is winding down and subsequent iterations run unobstructed. The cost lands on the TTFT clock (14s) instead of inflating TPOT. 2. Bimodal TPOT distribution. At 65k overlap, decodes split into "blocked entire prefill then normal rate" and "trickled slowly through prefill chunks". p99 sits on the second population and grows 59 -> 169.5 ms; p90 sits on the first and shrinks. 3. "Clean" stops being clean. With 4x ~10s injections in 60s, the 110 "clean" decodes at 65k are squeezed into 2-3s recovery pockets. TPOT p90 clean rises 6.9 -> 9.6 ms (40%), shrinking the denominator of the ratio. window_1_results.md adds a new B2 subsection laying out the mechanism with the per-cell data table and the explicit reading rule: headline interference metric is TTFT idx (monotone); TPOT p99 is the right tail indicator; TPOT p90 alone is unsafe across regime shifts. Direct implication: TTFT and TPOT need separate SLO thresholds under PD-colo, because they measure costs from different points in the request lifecycle and the cost migration between them is workload-dependent. current_results/characterization_claim_matrix.md adds a new supported claim for the cost migration, listed against the existing B2 evidence. current_results/reviewer_risk_register.md adds a low-severity entry warning future readers off TPOT p90 alone. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 00:35:45 +08:00
Gahow Wang	4722883903	Audit package refresh: Window 1 supported claims + risk register Refresh the standing audit package now that B1' / B2 / B3 are complete. current_results/characterization_claim_matrix.md Flips seven entries from "not_yet_supported" / "partially_supported" to "supported" with pointers into window_1_results/. New entries cover per-session sequentiality, KV per request, real reuse decomposition, theoretical APC ceiling, the LMetric locality gap, Unified breaking the locality-vs-latency tradeoff, B2 causal interference proof, sticky's interference inflation, and the partial heavy-tail / hot-spot story. B4 SRR + B5 attribution stay "not_yet_supported" (Window 2 work). current_results/main_claim_allowed_runs.md New "Allowed For Routing-Policy Comparison" section pins the five B3 policy directories. New "Allowed For PD-colo Interference" section pins the B2 sweep. Legacy section retained for the pre-instrumentation 200/500/1000-req runs. current_results/reviewer_risk_register.md Marks the two old "high"-severity risks (sequentiality / reuse decomposition) as resolved; adds new entries for the APC contamination empirics, the b3_analyze.sh truncate-write bug that cost unified's interference index, the GPU-0 EngineCore ghost cleanup, the saturated-replay caveat for trace-timestamp dispatch, and the synthetic B2 decode workload. current_results/all_figures_index.md Adds the 8 new Window 1 figures alongside the existing 6 from the legacy summarize_runs run. current_results/reproduction_commands.sh Records the full B3 + B2 + figure pipeline. analysis/characterization_todo_for_interns.md Updates the Progress Snapshot table: B0, B1, B2, B3, B6 all DONE; only B4 and B5 remain (Window 2). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 23:25:27 +08:00
Gahow Wang	0c3220cbb8	Window 1 results: combined B1' + B2 + B3 report and artifacts analysis/characterization/window_1_results.md is the headline write-up for Window 1: workload characterization (KV per request, real reuse decomposition, APC theoretical ceilings), B3 5-policy sweep with per-policy interpretation, B2 same-vs-different-worker interference microbench with causal reading, and an explicit list of what Window 1 does not answer (deferred to B4 SRR sweep + B5 attribution). Under window_1_results/: - 5 raw result JSONs from the B3 sweep, the B2 microbench, the APC upper bound, and the KV footprint - per-policy hotspot_index.json snapshots so render_window1_figures.py can plot per-worker TTFT p90 distributions - 8 PNG figures (figures/) covering the headline claims Three takeaways the figures pin down: 1) intra-session reuse dominates (93.2%), so session-affinity routing is the right primary lever 2) unified hybrid affinity hits 79.4% APC (97% of the 79.6% intra- session ceiling) AND cuts TTFT p90 from lmetric's 15.6s to 7.24s 3) B2 different-worker control sits at idx ≈ 1.0 across 32× prefill- size variation; same-worker TTFT idx scales 2.15× -> 218×, which is the cleanest causal evidence for same-worker prefill-decode interference Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 23:25:09 +08:00
Gahow Wang	b7902061d1	Window 1 analysis: APC upper bound, B2 window-overlap, figure renderer Three CPU-only analysis pieces that turn raw Window 1 artifacts into publishable numbers and figures. scripts/compute_apc_upper_bound.py Block-level trie walk over hash_ids to compute the theoretical APC ceiling on a trace, decomposed into intra-session / any-session / shared-prefix-only. Gives a fixed reference for what each routing policy could possibly achieve. w600 result: 79.6% intra-session, 80.3% any-session, 0.1% shared-prefix. analysis/characterization/b2_sweep_analysis.py (rewrite) Previous version used joined_analysis.interference_index() which labeled overlap = "any prefill in any other request during this decode". With short-prompt decode load this is always true (everyone's prefill overlaps everyone else's decode); n_overlap was 239/240 even in the different-worker control. New version labels overlap iff the decode's [t_first_token, t_finish] intersects an actual large injection window, computed from the cell's "prefill"-tagged metric rows. Different-worker control now cleanly sits at idx ≈ 1.0, same-worker scales monotonically. analysis/characterization/render_window1_figures.py Renders 8 PNGs from the result JSONs: B3 latency / APC vs ceiling / APC vs hotspot scatter / per-worker TTFT / failure breakdown, B2 TPOT and TTFT curves (overlap vs clean and idx), reuse decomposition, KV footprint. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 23:24:54 +08:00
Gahow Wang	b9f324f2e6	B2 interference driver: request return_token_ids + text fallback The first B2 run produced metrics with ttft_s=null/tpot_s=null for every decode request because the OpenAI-style payload did not set return_token_ids: true, and the parser only inspected choices[0].token_ids. With token_ids missing the loop skipped every chunk, so no per-token timestamps were captured and the aggregator returned interference_index=null on all 10 cells. Fix: - send return_token_ids: true in the payload (matches replayer.replay) - also accept text-delta chunks as token signals (fallback for servers that drop token_ids despite the flag) vLLM engine_state was fine; only the load-gen metric capture was broken. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 22:39:54 +08:00
Gahow Wang	df3249925b	B3 analyze: prefer per-policy engine_state over slicing shared dir The hot-sweep variant of B3 writes one shared engine_state across all policies; the isolated variant writes per-policy. Previously slice_engine_state.py was called unconditionally and would overwrite an isolated policy's real data with an empty slice (the isolated policy's run-window doesn't overlap with the shared dir's contents). Now we check the policy directory's engine_state for any non-empty engine_*.jsonl first; if present, use it directly; else slice from the shared one as before. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 22:19:43 +08:00
Gahow Wang	1d87082ca1	B3: cold-start isolated policy runner (clean APC per cell) scripts/b3_isolated_policy.sh wraps one policy run in a fresh 8-instance vLLM lifecycle: hard reset -> launch -> health -> proxy -> replayer -> snapshot artifacts -> cleanup. Used when cross- policy APC contamination matters more than the ~25-min vLLM warmup overhead per policy. Counterpart to the existing b3_sweep.sh which keeps vLLM warm across all policies (faster but warm-cache; we found via the sticky pre-flight that contamination is < 1% on this trace, so b3_sweep.sh stays the default). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 20:33:44 +08:00
Gahow Wang	08530b3915	B3 policies: pseudocode reference for the five-policy sweep Documents each pick_instance_* function from cache_aware_proxy.py in pseudocode so the policy semantics can be cited without re-reading implementation details. Covers lmetric (main baseline), load_only (no cache / no affinity control), sticky (hard affinity control), unified (gated affinity + LMetric fallback), and capped (lmetric on a per-session turn-capped trace). Includes a decision matrix that maps each policy to whether it uses session affinity, cache awareness, load awareness, and overload break, plus a one-liner per control explaining what comparison isolates which factor. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 19:57:02 +08:00
Gahow Wang	123a74a4b9	B3 report renderer: incremental markdown table from comparison JSON Reads b3_policy_comparison.json (produced by b3_analyze.sh) and emits a markdown report with three tables: headline latency + APC, mechanism indices (interference / hotspot / reuse), and slow-request cause breakdown. Rows for policies not yet present in the sweep are left as "pending" so the same renderer can be re-invoked as each policy finishes, producing an evolving report rather than waiting for the full sweep. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 18:58:21 +08:00
Gahow Wang	92db1c4370	B3 post-run helpers: engine_state slicer + per-policy aggregator scripts/slice_engine_state.py filters a shared engine_*.jsonl by a [t_start_unix, t_end_unix] window. Needed because the patched scheduler appends to one file per engine across the whole sweep; per-policy analysis requires the per-policy slice. scripts/b3_analyze.sh drives the slice + joined_analysis loop for every policy directory in a completed sweep, then aggregates one row per policy (latency percentiles, APC, interference_index, hotspot_index, reuse fractions, failure-cause counts) into b3_policy_comparison.json. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 18:51:33 +08:00
Gahow Wang	e23128ad65	B2: PD-colo interference microbench harness + sweep aggregator scripts/b2_interference.py is the controlled microbench. It runs two coroutines against the open proxy bypass (direct vLLM endpoints): - decode_load: continuous short-prompt requests at fixed QPS into a designated decode instance, to keep it decode-saturated. - prefill_injections: N large one-token requests at fixed interval, pointed at either the same instance (same-worker variant) or a paired one (different-worker control). Each cell (variant × prefill_size) gets its own metrics.jsonl plus a run_window.json containing t_start_unix/t_end_unix. The shared engine_*.jsonl from the scheduler patch is sliced by that window in the aggregator. analysis/characterization/b2_sweep_analysis.py walks the cell tree, slices the per-worker step log by each cell's window, runs the A5 interference_index() against the slice, and emits a single b2_sweep_summary.json with one row per cell. This is what feeds the "interference vs uncached prefill size" figure. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 17:54:51 +08:00
Gahow Wang	c6b7c3471b	B3: load_only + sticky policies, capped-trace builder, sweep driver Three additions land together because B3's whole point is comparing LMetric against meaningful controls. - scripts/cache_aware_proxy.py: two new --policy values. - load_only: pure min(num_requests) routing, no cache or affinity. The B3 control that strips locality so the LMetric-vs-load gap is legible. - sticky: first turn goes to min-load, subsequent turns ALWAYS return to the same instance, even under saturation. The B3 control that maxes out locality so the hot-spot cost is legible. - scripts/build_capped_trace.py: per-session turn cap (default 8). Generates the session-mass-equalized variant the TODO calls for so that hot-spot index can be re-measured with the heavy-tail removed. - scripts/b3_sweep.sh: orchestrates the 5-cell sweep. - GPU_INDICES makes it easy to skip a dead GPU. - EXTRA_VLLM_ARGS defaults to --enable-prompt-tokens-details so usage.prompt_tokens_details.cached_tokens is populated. vLLM 0.18.1 omits the field by default and breaks the reuse-decomp pipeline; the smoke run surfaced this. - Trap kills EngineCore by name in addition to "vllm serve" — the parent dies first but the child holds GPU memory. Was the root cause of the 89 GB ghost on GPU 0 earlier today. - Proxy readiness is a polling loop, not a fixed sleep. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 17:54:24 +08:00
Gahow Wang	763355b825	A5 fix: worker-id resolution and vLLM cmpl- rid stripping Smoke validation on dash0 surfaced three real bugs that broke interference and failure-attribution labels end-to-end: 1. endpoint_url in metrics is the proxy URL (e.g. http://h:9200); the vLLM worker URL lives in breakdown's routed_to. The interference index and label path were taking endpoint_url first, so every request looked routed to a non-existent worker and the overlap counter stayed at zero. 2. _normalize_worker hard-coded base port 8000, so a smoke run on port 9100 resolved to engine_1100 instead of engine_0. Added a --worker-map URL=engine_id CLI flag and _resolve_worker() that prefers the explicit map and falls back to the heuristic. 3. vLLM rewrites the per-step rid as cmpl-<proxy_id>-<i>-<hash>, so the str equality check between per_req rid and our proxy request_id never matched -> every prefill step looked like "other request prefill", which would have flipped overlap to 100%. Added _vllm_rid_matches() that strips the cmpl-/chatcmpl- prefix. After the fix, the same smoke run reports interference_index = 22.9 across 24 overlap / 6 clean requests on a single instance, which is the expected shape for serial dispatch into a cold engine. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 16:47:23 +08:00
Gahow Wang	cd82b8c2a2	PD-sep matrix results: C2/C3/C4 figures + empirical mechanism refined Captures 5 runs from the experiment matrix (combined-ca x3 seeds, pdsep-4p4d seed1, pdsep-6p2d seed1) on traces/w600_r0.0015_st30.jsonl with cuda graphs enabled. The headline: combined-ca: TTFT p50 0.91s success 99.5% pdsep-4p4d: TTFT p50 62.8s success 52% (69x worse, half dropped) pdsep-6p2d: TTFT p50 51.1s success 68% (56x worse, third dropped) C2 (fig_c2): headline bars per config with error bars. C3 (fig_c3): per-instance KV utilization time-series. Both PD-sep splits hit the memory wall, but the side differs by P:D ratio -- 4P+4D pins the P-side, 6P+2D pins both sides (D-side back-pressures P-side). C4 (fig_c4): TTFT stacked breakdown. 99% of PD-sep TTFT is P-side prefill compute; D-side wait + first token is <=1.2s. The bottleneck is P-side prefill queueing, not D-side decode wait as the original analytical model assumed. system_analysis.md gains a Layer 5b that reconciles the analytical KV-wall model (which considered D-side only) with the empirical finding that the wall hits whichever side has fewer GPUs, and co-saturates both at extreme splits via D-side back-pressure. plot_pd_matrix.py ingests outputs/pd_matrix/* into all four figures. bench.sh gained AGENTIC_STEP_LOG_DIR hooks for future runs (set during this work but not used by the current matrix's data). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 16:23:52 +08:00
Gahow Wang	25445e3d18	A5: joined analysis with reuse decomp, interference, hot-spot, labels New analysis/characterization/joined_analysis.py joins replayer metrics.jsonl + proxy breakdown.json + worker_state.jsonl by request_id, plus engine_*.jsonl by worker_id, and emits: - joined.jsonl per-request merged record - reuse_decomposition.json real intra/cross/shared classification using session_id + hash_ids + cached_tokens - interference_index.json TPOT_p90(same-worker prefill overlap) / TPOT_p90(clean), per Batch 2 - hotspot_index.json max/median worker TTFT-p90, per Batch 3 - failure_label.jsonl per-slow-request cause label, per Batch 5 - failure_breakdown.json label histogram - window_summary.json SRR warmup/steady/drain aggregates Closes the analyzer side of Phase A; replaces the status: unavailable placeholders the existing scaffold emits when join sources are missing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 16:19:33 +08:00
Gahow Wang	f42c715ec1	A4: open-loop session-causal SRR loadgen New replayer/srr.py drives a Poisson session-arrival load against the existing proxy, with strict per-session turn sequentiality, explicit warmup/steady/drain windows, and per-arrival fresh session_id + request_id so APC/session-affinity counters are not contaminated by repeated draws from the trace pool. Writes window_summary.json with attempted/completed/errored split by window so latency tails can be read on the steady-state window only. Required by Batch 4 SRR sweep; trace-timestamp dispatch in replay.py cannot drive arrival rate independently. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 16:19:20 +08:00
Gahow Wang	5816aad731	A3: vLLM scheduler patch for step-level JSONL log When AGENTIC_STEP_LOG_PATH is set, the scheduler emits one JSONL line per scheduler step with t_unix, worker_id, prefill/decode token counts, n_running/n_waiting, preempted ids, and per-request phase labels. No-op when the env var is unset, so production engines are not impacted. bench.sh now threads AGENTIC_STEP_LOG_DIR through to each per-engine launch so step logs end up at engine_${i}.jsonl. Required by Batch 2 (PD-colo interference index) and Batch 5 (same-worker overlap attribution); engine /metrics polling cannot provide per-step granularity. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 16:19:11 +08:00
Gahow Wang	fe556b5d98	A2: proxy worker-state snapshot and request-id passthrough Honor incoming X-Request-Id so replayer metrics and proxy breakdown share a join key. Each route decision now captures session_id, the full per-worker candidate-score snapshot (ongoing/pending/num_requests /cached_blocks plus both linear and lmetric scores), the chosen score, and unix timestamps for first-token and done events. A separate _worker_state_log records one row per decision and is exposed via GET /worker_state; GET /worker_state/latest returns a live snapshot without recording it. Required by Batch 3 (session hot-spot proof) and Batch 5 (failure attribution); existing breakdown.json had no per-worker state at decision time. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 16:19:01 +08:00
Gahow Wang	d57e338366	A1: replayer instrumentation for cross-process join RequestMetrics gains absolute unix timestamps (t_dispatch_unix, t_first_token_unix, t_finish_unix), the proxy_request_id, the chosen endpoint URL, and the trace hash_ids. Replayer sends X-Request-Id: <session_id>:<turn_id>:<chat_id>:<idx> so proxy breakdown rows can be joined to metrics by exact key. Required by Batch 0 (online sequentiality proof) and Batch 1 reuse decomposition; existing metrics.jsonl couldn't establish either. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 16:18:52 +08:00
Gahow Wang	e5761fa6f3	Characterization plan: progress snapshot + Claude work plan - Add Progress Snapshot table to the intern TODO so per-batch status (DONE / partial / blocked-on-instrumentation) is visible at a glance. - New analysis/claude_characterization_work_plan.md scopes the Phase A instrumentation tasks (A1-A5) plus Window 1 (B1'+B2+B3) and Window 2 (B4+B5) on dash0, with locked decisions for model, topology, trace, SLO style, and GPU phasing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 16:18:41 +08:00
Gahow Wang	5ed6f6fe5b	Add characterization result figures	2026-05-25 15:15:10 +08:00
Gahow Wang	0f64fb3261	Add agentic workload characterization audit scaffold	2026-05-25 15:01:18 +08:00
Gahow Wang	21ffb3d4f7	PD-sep matrix infrastructure: bench.sh pdsep mode + matrix driver Adds the experiment harness that gates the empirical claims (C2/C3/C4/C5) in the PD-sep paper section. Three pieces: 1. scripts/bench.sh: new --mode pdsep with --pd-ratio P:D, and an --eager flag to re-enable --enforce-eager for the cuda-graph ablation. pdsep reuses the elastic-mode Mooncake kv_both launch and swaps the proxy command from --combined to --prefill/--decode. baseline and elastic flows are unchanged. 2. analysis/pd_sep_paper_section/scripts/bench_pd_matrix.sh: matrix driver that runs {combined-ca, pdsep-4p4d, pdsep-6p2d} x cudagraph x 3 seeds by default (~2 h on dash0). --with-rr adds combined-rr; --with-eager doubles to ~5 h with the cuda-graph ablation. Skips completed runs, captures per-instance vLLM logs (needed for C3 step-level KV-utilization mining). 3. fig_kv_memory_wall.pdf: empirical anchor (star) at REPORT.md §3.3's observed 6P+2D 97% KV utilization. The marker lands on the model's predicted curve at p90 input, confirming the steady-state analysis. README updated with the run command, output layout, and the followup plotters that consume outputs/pd_matrix/. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 11:47:33 +08:00
Gahow Wang	4028c587b1	Paper section: system analysis + workload figures + KV-wall model Adds the system-level argument resolving the roofline/PD-sep paradox. Even at 95% cache reuse prefill stays compute-bound (the C6 roofline fact), yet PD separation regresses TTFT 72%. The new system_analysis.md walks through six layers showing why the roofline claim is necessary but not sufficient, with the falsifiable condition being decode-side KV memory budget: concurrent_decode * KV_per_req / (N_D * HBM_pool). For chatbot this ratio is << 1 at any layout; for agentic at p90+ context it goes >> 1 under 4P+4D and 6P+2D, predicting the empirical 97% decode KV occupancy. fig_kv_memory_wall.pdf visualizes the model with audit-able constants; fig_c1a/b ground the per-request KV-size inputs in the actual sampled trace (input p50=33.5k, p90=101k, intra-session reuse 79.2%). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 11:41:31 +08:00
Gahow Wang	d71a111099	Paper section: PD-sep scaffold + drop --enforce-eager from launch scripts Adds analysis/pd_sep_paper_section/ as the home for the "PD separation is net negative under agentic workloads" paper section: plot scripts for C1 (workload chars), C6 (roofline), C7 (routing-vs-PD-sep lever), the C6/C7 PDFs already rendered, and a README mapping candidate claims to required figures plus open re-run items. Removes --enforce-eager from bench.sh and all active launch scripts so cuda graphs are captured -- the prior methodology suppressed one of PD-sep's structural advantages (D-node fixed-shape decode). Legacy scripts under scripts/legacy/ are intentionally untouched as historical records. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 11:24:16 +08:00
Gahow Wang	6a27f75337	Docs: reconcile routing docs with current hybrid direction Per analysis/unified_routing_fix_review.md #2, several docs still presented the retired single-argmin + PUSH-migration design as the final algorithm. Mark them superseded and document the current hybrid direction (commit `255c8e6`). - REPORT.md §1.1 / §3.9: add errata callout and section header noting the "Final Design" framing was retired after `cc6e562` / 4c583f2; point readers to docs/migration-policy-design.md. - docs/migration-policy-design.md: rewrite. Opens with the current hybrid algorithm (LMetric base + cache_ratio>0.5 affinity gate + tie-breaker), then a "What Was Retired" commit table, then the old Approach A numbers preserved as "Historical Baseline-Mode Comparison". - analysis/research_findings.md §2.2 / §5: correct the LMetric framing. LMetric isn't "neutralized by affinity constraints" (pure --policy lmetric has no affinity at all); it converges to similar placements because P_tokens includes new_uncached_tokens, giving it implicit soft affinity. - analysis/elastic_hypotheses.md: same LMetric correction in the "DOESN'T work" summary, plus a footer cross-referencing the current routing direction. - analysis/unified_routing_fix_review.md: track this file (was untracked); it is the review handoff cited from the updated docs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 10:47:14 +08:00
Gahow Wang	ac6534c3ff	Cleanup: retire dead PUSH path + extract hybrid picker - Delete unreachable best_needs_push block in _handle_combined and the four orphaned helpers (_handle_cached_prefill_offload, _handle_direct_read_offload, _query_bootstrap_hit, _get_bootstrap_client). Their only caller was the retired PUSH gate; see REPORT §3.9 errata for the rejected experiments (`cc6e562`, `4c583f2`). - Extract pick_instance_unified_hybrid as a pure function returning (chosen, idx, decision_dict). The decision dict carries the review #7 breakdown fields (decision, affinity_idx/chosen_idx, cache_hit/ratio, avg_num_requests, fallback_score, tie_break_used). - Add LMetric-fallback tie-breaker (primary score, then new_uncached, num_requests, round-robin) so new sessions don't all pin to inst 0 when BS=0 across the board. - Drop the lmetric-policy affinity write so --policy lmetric stays affinity-free per review #3. - Mark --max-offload-inflight / --offload-mode / --cache-gate-ratio / --decode-iteration-s as [DEPRECATED] in --help; flags remain accepted so scripts/bench.sh and legacy launchers don't break. - Revert uncommitted overload_factor 2.0->1.5 default; H7 sweep already rejected this knob (within noise). Future sweeps should go via CLI. Tests: add 6 hybrid-policy tests in tests/test_proxy_pick.py covering affinity-hit, overload break, low-cache fallback, tie-break rotation, lmetric purity, and breakdown field shape. 19/19 pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 10:46:57 +08:00
Gahow Wang	255c8e6884	Hybrid routing: LMetric for LB + explicit affinity for high-cache sessions Replace the full unified cost model with a simpler hybrid: - If session has >50% cache on affinity instance AND instance not overloaded (num_requests <= avg * overload_factor) → stick to affinity - Otherwise → use LMetric (P × BS) for best load balance This combines LMetric's superior load balance with explicit session affinity for high-value sessions that have significant cache accumulation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-25 09:05:08 +08:00
Gahow Wang	448361cf83	Update design doc: final results + review findings Unified routing (baseline mode) beats LMetric E2E mean/p50/p90. PD-sep offload consistently degrades performance (5-134 offloads tested). Independent review: fair comparison, no reward hacking, needs multi-run significance verification (running 3x paired test). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-25 03:48:18 +08:00
Gahow Wang	4c583f2f1c	Revert relaxed gate + push_cost fix: 134 offloads destroyed performance PD-sep offload overhead (C queue + prefill + KV transfer + D schedule) far exceeds any load balance benefit. With relaxed gate, cost model triggered 134 offloads → E2E p90 went from 37s to 82s. The proven winning configuration is Unified routing in baseline mode (no Mooncake connector), which beats LMetric on E2E mean/p50/p90 purely through better routing (contention-aware + session affinity). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-25 03:38:59 +08:00
Gahow Wang	bf4469a150	Fix cost model: accurate push_cost + aligned hard gate 1. push_cost now models both C and D: max(c_cost, d_cost) where c_cost includes C's queue + prefill, d_cost includes D's queue + RDMA overhead. Old formula only had D's contention + RDMA. 2. Hard gate uses num_requests instead of ongoing_tokens, aligning with the contention-based cost model. 3. Fix migration_discount: min(cap, 5) instead of hardcoded min(cap, 3). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-25 01:01:03 +08:00
Gahow Wang	1d2148cf65	Remove second push_new gate that caused downgrade-to-cold-LOCAL After _push_allowed was relaxed, the cost model correctly chose push for high-cache sessions on overloaded instances. But a second gate at execution time (push_new < heavy_threshold) blocked the actual offload, downgrading to LOCAL on the target instance — which had no cache. Worse, session affinity was already updated to the target, so all subsequent turns also hit cold prefill. This was the root cause of relaxed gate's performance regression: affinity broken + push blocked = worst of both worlds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-25 00:42:31 +08:00
Gahow Wang	3ae99293fd	Relax _push_allowed: gate on request size, not cache savings The old gate blocked offload when push_new (= input - cache_hit) < 20K, which prevented migration of high-cache sessions — exactly the ones that benefit most. After PD-sep, the target receives full KV via RDMA and has the same cache as the source, so cache_hit is irrelevant to the offload decision. New gate: only check input_length >= heavy_threshold (request must be HEAVY) and max_offload_inflight (concurrency cap). Let the cost model decide whether the contention difference justifies migration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-25 00:03:28 +08:00
Gahow Wang	cc6e5625bb	Revert Approach B (session migration): overhead exceeds LB benefit Reverts 3 commits: `e991960`, `5772149`, `5b1d360`. 57 migrations triggered but PD-sep overhead (C queue + KV transfer + D cold start) caused HEAVY TTFT p90 to regress from 15.9s to 59.1s. Migration mechanism needs fundamental rework before it can help. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 23:43:47 +08:00
Gahow Wang	5b1d36080a	Fix B2 migration: correct offload call signature (c_inst/d_inst order + cache_hit arg) The session migration path was calling _handle_cached_prefill_offload with swapped c_inst/d_inst and missing cache_hit parameter, causing TypeError on every migration attempt (13 of 41 errors in the test run). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 22:46:46 +08:00
Gahow Wang	5772149d36	Approach B v2: TTFT-based migration trigger Replace num_requests threshold with recent TTFT median as migration trigger. Track per-instance rolling TTFT (last 8 requests) and trigger migration when median > 5s (configurable). Target is the instance with lowest recent TTFT, requiring > 2x improvement to justify migration. This is more responsive than the instantaneous num_requests signal because TTFT directly measures the user-facing impact of contention. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 21:54:06 +08:00
Gahow Wang	45b82272c3	Add migration policy design doc with A/B experiment results Approach A (contention-aware cost model): TTFT p90 -52% vs baseline. Approach B (session migration): 0 triggers at 1.5x threshold — needs tuning. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 18:24:49 +08:00
Gahow Wang	e9919605af	Approach B: session-level lazy migration trigger When a request arrives for a session on an overloaded instance, force migration if three conditions hold: 1. Instance busy: num_requests > avg * migration_request_factor (1.5x) 2. Session has cache value: cache_ratio > 50% 3. Request is HEAVY (>= heavy_threshold) 4. A meaningfully less-loaded target exists (num_requests gap > 2) This bypasses the cost model for migration decisions — the cost model's cache-inflated costs prevented migration even when instances had 150s queue times with 99% cache hit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 17:34:06 +08:00
Gahow Wang	e06de5144b	Approach A: contention-aware cost model with migration discount Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 17:24:27 +08:00
Gahow Wang	e13391eeab	Evict migrated blocks from prefix cache after KV send completes After a session migrates from C to D via offload, C's blocks were freed to the LRU tail (most-recently-used position), making them the last to be evicted. Since the session won't return to C, these blocks are dead weight occupying cache capacity. Now capture block IDs before _free_blocks and call evict_blocks to remove them from the prefix cache hash table, so they can be reused sooner for active sessions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 16:56:34 +08:00
Gahow Wang	4b50c5a08d	Fix unified cost model: include decode load in queue + hard overload gate Two bugs caused elastic to concentrate load on cached instances (10x token imbalance vs 2.7x baseline): 1. _instance_cost queue only counted pending_prefill_tokens, missing ongoing_decode_tokens entirely — instances with 50 decoding requests appeared idle to the cost model. 2. Cache hits made overloaded instances look "cheap", creating a positive feedback loop: more sessions → more cache → lower cost → more routing. Added a hard gate (ongoing_tokens > avg * overload_factor) that breaks affinity before the cost model runs, matching linear policy behavior. Result: token imbalance 10.3x → 2.6x, TTFT p90 -37% vs baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 16:25:02 +08:00
Gahow Wang	9cebdb6b9b	Fix multi-turn replay fidelity: track realized output tokens across all components The replayer and proxy were building multi-turn prompts from trace tokens, but the model generates different output tokens. Subsequent turns had wrong prefix tokens, causing cache misses and invalid experimental measurements. - replay.py: min_tokens=max_tokens for deterministic length, return_token_ids to capture actual output, _apply_realized_prefix for next-turn correction - proxy: extract output token_ids from SSE, record prompt+output as realized prefix in shadow cache, extract _handle_local_request to deduplicate - bench.sh/launch_elastic_p2p.sh: default elastic mode to unified policy - mooncake_connector: only send prompt blocks (not stale output blocks), track failed_recving_block_ids for error recovery Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 14:47:51 +08:00
Gahow Wang	cc4a9c91e7	Fix estimate_hit: reuse _lookup_by_tokens instead of reimplementing hash The standalone hash computation in estimate_hit produced different hashes than the hash_table (synced from scheduler). Root cause unclear (possibly pickle serialization differences or hash chain state). Fix: delegate to _lookup_by_tokens which is proven to work (push_blocks uses it). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 12:41:53 +08:00
Gahow Wang	657812f8c4	Add deploy_vllm_patches.sh: sync third_party/vllm patches to site-packages Copies mooncake_connector.py, mooncake_utils.py, scheduler.py from third_party/vllm to the pip-installed vllm's site-packages. C extensions stay from the pip package; only Python files are overridden. Usage: bash scripts/deploy_vllm_patches.sh [HOST] Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 11:59:52 +08:00
Gahow Wang	bf76273778	Add --offload-mode switch for ablation (direct_read vs cached_prefill) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 11:24:15 +08:00
Gahow Wang	cdf83493ab	Fix A+C: real cache sync + cached-prefill-on-C architecture A: Add /estimate_hit endpoint to bootstrap server for real-time cache probing. Proxy queries this before committing to PUSH, eliminating 24% zero-match PUSH requests (shadow cache divergence). C: Add _handle_cached_prefill_offload: C (cache source) does fast cached prefill → KV to Mooncake → D pulls and decodes. Replaces broken direct_read PUSH where D waited for RDMA transfer while occupying KV blocks without doing compute. Also: update §3.9 baseline to plain vLLM with full mean/p50/p90/p99. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 11:22:38 +08:00
Gahow Wang	2b9eae0d54	Report §3.9: Unified routing final results — TTFT -25%, E2E -7% 850/850, 0 errors. Single argmin(latency) with soft affinity. 116 PUSH_MIGRATE (all with cache, avg 25k tokens), 723 LOCAL. TPOT p90 +15% tradeoff from kv_both overhead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 03:15:32 +08:00
Gahow Wang	97f4fe5164	Fix: rename inst->chosen in generate function (NameError crash) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 02:55:01 +08:00
Gahow Wang	5892739159	Add session affinity as soft preference in unified routing Without affinity, all cached requests route to the same instance (cache source always has lowest prefill cost), causing 149s queue. Fix: if the session's last instance has cost <= 2x the global best, use it (preserves cache locality). Only re-route when the affinity instance is significantly more expensive (overloaded). The 2x threshold is intentionally loose — it's not a hardcoded magic number but a "prefer locality unless clearly worse" heuristic. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 02:37:58 +08:00
Gahow Wang	6b255fad91	Unified routing: single argmin(expected_latency) over all instances Replace two-phase routing (pick_instance → offload gate) with a single cost function evaluated per instance: latency(D) = queue(D) + prefill_time(D) + transfer_cost(D) - If D has local cache: prefill = (input - local_hit) / throughput - If D can receive PUSH from cache source: prefill = (input - push_hit) / throughput + rdma - Otherwise: prefill = input / throughput (cold) Choose argmin(latency). If the winner needs PUSH → trigger migration. Removed: - WARM/MEDIUM/HEAVY classification (no routing purpose) - heavy_threshold, overload_factor, max_offload_inflight, cache_gate_ratio - Interference penalty magic number (0.3) - Separate pick_instance + offload gate stages Only 2 measured parameters remain: - prefill_throughput = 7000 tokens/s (H20 measured) - rdma_overhead_s = 0.1s (RDMA PUSH measured) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 02:21:34 +08:00

1 2 3

131 Commits