agentic-kvc

Author	SHA1	Message	Date
Gahow Wang	4b833d33b7	unified_v2.1: relax gates + add unified_kv_both isolation control v2.0 ran on B3 and triggered PD-sep only 2 / 1214 times (0.2%). The gates were too conservative; the v2-vs-v1 latency gap (TTFT p90 7.35 -> 8.96 s) is therefore probably attributable to kv_both always-on overhead, not to the PD-sep mechanism itself. v2.1 has two fixes plus an isolation control. Bug fix: - The "chosen has live decodes worth protecting" gate combined num_requests and ongoing_decode_tokens with AND, falling through when EITHER was small. Under agentic workloads each worker rarely stacks more than 1-2 concurrent requests, so the gate killed 84% of v2.0 candidates that reached it. Replace with a pure ongoing_decode_tokens == 0 check ("chosen_no_active_decode") — same semantic, much higher recall. Threshold relaxation (B2 microbench is the calibration source): - pd_sep_min_new_tokens: 16000 -> 8000 (B2 TPOT idx 1.9x already at 8k, TTFT idx 12x — strictly worth migrating) - pd_sep_min_decodes_protected: 2 -> 1 - pd_sep_min_src_cache_tokens: 8000 -> 4000 - pd_sep_min_extra_cache_tokens: 4000 -> 2000 Isolation control: - New --policy unified_kv_both option. Uses the exact same picker as --policy unified but the vLLMs are launched in kv_role=kv_both (the same launch mode unified_v2 requires). PD-sep never fires. Compares against unified_v2 to attribute any v2 effect to the PD-sep branch alone, not the kv_both always-on overhead. - Both unified_kv_both and unified_v2 auto-enable kv_both launch in b3_isolated_policy.sh. Tests: - Updated the existing "chosen has no decodes" test for the new gate name and semantic. - All 24 proxy tests pass. Refs: window_1_results/v2_breakdown analysis (88.7% of candidates caught by old new_local_below_threshold; 84% of the remainder caught by the old few_decodes gate). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 10:40:57 +08:00
Gahow Wang	19f69a9d2e	unified_v2: selective per-request PD-sep via Mooncake (E3+E4) Adds a sixth routing policy --policy unified_v2 that wraps the existing unified hybrid picker with a selective PD-sep branch. When all of the following hold, a request is split prefill-on-src, decode-on-chosen via Mooncake kv_role=kv_both transfer: 1. new_local = input_length - chosen.cache_hit > 16k (B2 microbench shows same-worker TTFT idx >= 3x from this size up) 2. chosen has live decodes worth protecting (>= 2 in-flight) 3. some other instance holds materially more cache for this prefix (>= 8k tokens, and >= 4k more than chosen) 4. cost(src_interference + RDMA xfer) + 0.2s margin < cost(chosen_interference) The cost model is the audit-blessed shape from E1's post-mortem: - gate on new_tokens (post-cache), NOT input_length (the old PUSH gate) - bind to a single transfer mechanism (kv_both peer-to-peer pull) - realistic RDMA cost as a function of bytes: 0.3s base + bytes / 2.7 GB/s (calibrated against contention_16s_elastic p50) - both source and target decode counts considered E2 mechanism-level patches not yet applied (this commit is policy-only). Patches 6.2 / 6.3 / 6.5 remain on the table. Patch 6.6 (per-request xfer timeout, 60s default) is implemented on the proxy side as an httpx per-chunk read timeout on the dst streaming call, so a stuck KV transfer fails the request instead of hanging for 600s. cache_aware_proxy.py: - Settings: kv_bytes_per_token, prefill_throughput_kv_both, rdma_base_overhead_s, rdma_effective_gb_per_s, pd_sep_* gating knobs - estimate_transfer_cost(bytes) replaces the constant rdma_overhead_s - estimate_same_worker_interference_s(new_tokens, num_decodes) reads off the B2 penalty curve in 4 bins - pick_instance_unified_v2: inherits unified, returns extra (src_inst, src_idx) tuple when PD-sep wins the cost compare - _handle_combined_pd_sep_v2: prefill on src (do_remote_decode=True, max_tokens=1), Mooncake xfer, decode-stream on dst with httpx Timeout(read=pd_sep_xfer_timeout_s) - --policy unified_v2 added to argparse choices - lifespan auto-runs init_prefill_bootstrap when policy is unified_v2 b3_isolated_policy.sh: - ENABLE_KV_BOTH env var, auto-set when POLICY=unified_v2, threads kv_role=kv_both + VLLM_MOONCAKE_BOOTSTRAP_PORT to vllm and --bootstrap-ports to the proxy Tests: 8 new unit tests cover the gating predicates and the cost estimators; all 32 proxy tests still pass. Refs: E1 (PUSH post-mortem) + E2 (Mooncake audit) reports. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 09:25:45 +08:00
Gahow Wang	c63dc151a0	Agentic PD / Unified routing story plan draft User's 2026-05-25 draft aligning three threads (agentic-kv vLLM experiments, dash0 artifacts, agentic-pd-hybrid SGLang work) into a single story for the paper. Tracked so future iterations and review history are in version control. Co-Authored-By: Gahow Wang <chiahaco@gmail.com>	2026-05-26 01:12:42 +08:00
Gahow Wang	0881942cf3	Window 1 results: recompute with fixed metrics + reframe limitations After the B3 audit bug fixes (joined_analysis hotspot median + b3_analyze percentile interp), regenerate b3_policy_comparison.json and the per-policy hotspot_index.json from the same raw run on dash0 and re-render the three affected figures (apc-vs-hotspot, latency-bars, per-worker TTFT). Key number changes in window_1_results.md: - hotspot_index magnitudes corrected (all five policies; lmetric smallest delta at +0.7%, sticky largest at +16.1%) - "capped reduces hotspot 13%" -> "~10% (2.253 -> 2.020)" - TTFT/E2E/TPOT percentiles shift by <1% from floor->interp (unified TTFT p90 7.24 -> 7.35 s) Restructured "Caveats" into "Limitations (read this before quoting B3 numbers)": 1. Agentic dispatch coupling is by design — promoted from caveat to top-level methodology framing, tied to agentic_dispatch_coupling.md 2. B3 interference_index is binary (not size-graded) — added 3. Hot-sweep cache contamination (<1%) — kept 4. Unified interference unrecoverable — kept with explicit warning not to read unified's failure attribution as causal 5. w600 is a sample, not full trace — kept 6. Reuse decomposition is per-token in expectation — added current_results/characterization_claim_matrix.md updates: - The "heavy-tail not sole cause" claim now cites the corrected ~10% drop with the median bug noted - New supported claim: "B3 saturated-replay latency gaps include an agentic dispatch-coupling feedback term, which is intentional and matches production"; cited against agentic_dispatch_coupling.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 01:08:55 +08:00
Gahow Wang	0e82612100	Fix B3 analysis bugs from subagent audit (median + percentile + sweep) Three fixes from the B3 audit: 1) joined_analysis.hotspot_index used sorted[n//2] as median, which returns the ~60th percentile for n=8 (even-length). Systematically under-states the hotspot index. Recomputed values: lmetric 2.238 -> 2.253 (+0.7%) load_only 1.140 -> 1.294 (+13.5%) sticky 2.349 -> 2.728 (+16.1%) unified 3.350 -> 3.667 (+9.5%) capped 1.937 -> 2.020 (+4.3%) Qualitative ranking preserved; "capped only modestly reduces hotspot" story holds with ~10% drop instead of the previously reported 13%. Added test_hotspot_index_uses_true_median_for_even_n to lock in the fix. 2) b3_analyze.sh's pct() helper used floor-indexed percentile sorted[int(p*(n-1))], inconsistent with metrics._percentile and joined_analysis._percentile which both use linear interpolation. Now matches. 3) b3_sweep.sh's capped step called run_policy "capped", but the proxy's argparse has no "capped" choice, so the hot-sweep variant would have crashed on this step. The actual capped data was produced via b3_isolated_policy.sh with --policy lmetric. Replace the broken inline call with an explicit launch_proxy lmetric + inline replayer block so the sweep script matches the data path it documents. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 01:08:37 +08:00
Gahow Wang	8ac41a8684	Agentic dispatch coupling: trace-replay session-sequentiality is realistic The B3 audit flagged the trace replayer's "fire turn N+1 immediately if turn N is behind schedule" semantics as a potential benchmark crime, because under saturation the effective arrival process becomes policy-dependent (slow policy -> longer session lifetimes -> more concurrent in-flight -> harder system -> still slower). The audit called this dispatch slip. But in agentic workloads, turn N+1 is generated by a tool-call response or an autonomous-loop step, not by a human reading the previous reply. There is no inter-turn think-time. So the replayer's "no think-time, sequential within session, fire-immediately-when- ready" behavior is the correct model of agentic production, and the feedback amplification is a real property of production systems under saturation rather than an artifact of the replayer. The note (analysis/characterization/agentic_dispatch_coupling.md) lays out: - The dispatch rule and the apparent feedback loop - Why agentic workloads do not have user think-time - Application of Little's Law: slower policy carries higher concurrent in-flight load, so the policy x feedback gap is real, not artifact - Reframes B3 as the "production-replay" experiment and B4 as the orthogonal "controlled-load" experiment, complementary not hierarchical - Calls the feedback amplification itself out as a finding worth reporting (e.g. unified's ~2x latency-p90 gap over lmetric in B3 reflects both the routing improvement and the in-flight reduction) - Contrasts with chat workloads (human think-time partially breaks the feedback loop, agentic removes that floor) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 01:00:25 +08:00
Gahow Wang	f784e49c07	Microbench: prefill-decode interference + PD transfer lifecycle Two microbenchmarks quantifying the elastic offload decision: 1. Interference (corrected): cold prefill causes 14-214x TPOT p90 degradation on same-worker decode (D∈{1,2,4,8} × P∈{2k,8k,16k,32k}). Earlier run had a prefix-cache bug (deterministic prompts hit cache after rep 0); fixed with uuid+time_ns unique prompts. 2. Transfer lifecycle: PD-sep TTFT breakdown via Mooncake proxy, measuring prefill→RDMA→decode startup overhead. Key finding: offload wins at all P≥2048 operating points — transfer cost is 25-50% of interference cost even with bulk Mooncake.	2026-05-26 00:57:06 +08:00
Gahow Wang	559faa1e26	B2 finding: TPOT idx peaks at 32k, not 65k — cost migrates to TTFT The B2 same-worker TPOT p90 idx is non-monotone: 7.89x at 32k drops to 2.26x at 65k. The naive reading is "interference gets weaker for huge prefills"; the actual mechanism is a regime shift, and reading TPOT p90 alone is misleading. Three superimposed effects: 1. Cost migration TPOT -> TTFT. A 32k prefill is short enough that chunked-prefill keeps interleaving decode steps, so overlapping decodes trickle tokens out at painful per-token rates. A 65k prefill is long enough that overlapping decodes are fully blocked for ~10s; once they break through, the injection is winding down and subsequent iterations run unobstructed. The cost lands on the TTFT clock (14s) instead of inflating TPOT. 2. Bimodal TPOT distribution. At 65k overlap, decodes split into "blocked entire prefill then normal rate" and "trickled slowly through prefill chunks". p99 sits on the second population and grows 59 -> 169.5 ms; p90 sits on the first and shrinks. 3. "Clean" stops being clean. With 4x ~10s injections in 60s, the 110 "clean" decodes at 65k are squeezed into 2-3s recovery pockets. TPOT p90 clean rises 6.9 -> 9.6 ms (40%), shrinking the denominator of the ratio. window_1_results.md adds a new B2 subsection laying out the mechanism with the per-cell data table and the explicit reading rule: headline interference metric is TTFT idx (monotone); TPOT p99 is the right tail indicator; TPOT p90 alone is unsafe across regime shifts. Direct implication: TTFT and TPOT need separate SLO thresholds under PD-colo, because they measure costs from different points in the request lifecycle and the cost migration between them is workload-dependent. current_results/characterization_claim_matrix.md adds a new supported claim for the cost migration, listed against the existing B2 evidence. current_results/reviewer_risk_register.md adds a low-severity entry warning future readers off TPOT p90 alone. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-26 00:35:45 +08:00
Gahow Wang	4722883903	Audit package refresh: Window 1 supported claims + risk register Refresh the standing audit package now that B1' / B2 / B3 are complete. current_results/characterization_claim_matrix.md Flips seven entries from "not_yet_supported" / "partially_supported" to "supported" with pointers into window_1_results/. New entries cover per-session sequentiality, KV per request, real reuse decomposition, theoretical APC ceiling, the LMetric locality gap, Unified breaking the locality-vs-latency tradeoff, B2 causal interference proof, sticky's interference inflation, and the partial heavy-tail / hot-spot story. B4 SRR + B5 attribution stay "not_yet_supported" (Window 2 work). current_results/main_claim_allowed_runs.md New "Allowed For Routing-Policy Comparison" section pins the five B3 policy directories. New "Allowed For PD-colo Interference" section pins the B2 sweep. Legacy section retained for the pre-instrumentation 200/500/1000-req runs. current_results/reviewer_risk_register.md Marks the two old "high"-severity risks (sequentiality / reuse decomposition) as resolved; adds new entries for the APC contamination empirics, the b3_analyze.sh truncate-write bug that cost unified's interference index, the GPU-0 EngineCore ghost cleanup, the saturated-replay caveat for trace-timestamp dispatch, and the synthetic B2 decode workload. current_results/all_figures_index.md Adds the 8 new Window 1 figures alongside the existing 6 from the legacy summarize_runs run. current_results/reproduction_commands.sh Records the full B3 + B2 + figure pipeline. analysis/characterization_todo_for_interns.md Updates the Progress Snapshot table: B0, B1, B2, B3, B6 all DONE; only B4 and B5 remain (Window 2). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 23:25:27 +08:00
Gahow Wang	0c3220cbb8	Window 1 results: combined B1' + B2 + B3 report and artifacts analysis/characterization/window_1_results.md is the headline write-up for Window 1: workload characterization (KV per request, real reuse decomposition, APC theoretical ceilings), B3 5-policy sweep with per-policy interpretation, B2 same-vs-different-worker interference microbench with causal reading, and an explicit list of what Window 1 does not answer (deferred to B4 SRR sweep + B5 attribution). Under window_1_results/: - 5 raw result JSONs from the B3 sweep, the B2 microbench, the APC upper bound, and the KV footprint - per-policy hotspot_index.json snapshots so render_window1_figures.py can plot per-worker TTFT p90 distributions - 8 PNG figures (figures/) covering the headline claims Three takeaways the figures pin down: 1) intra-session reuse dominates (93.2%), so session-affinity routing is the right primary lever 2) unified hybrid affinity hits 79.4% APC (97% of the 79.6% intra- session ceiling) AND cuts TTFT p90 from lmetric's 15.6s to 7.24s 3) B2 different-worker control sits at idx ≈ 1.0 across 32× prefill- size variation; same-worker TTFT idx scales 2.15× -> 218×, which is the cleanest causal evidence for same-worker prefill-decode interference Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 23:25:09 +08:00
Gahow Wang	b7902061d1	Window 1 analysis: APC upper bound, B2 window-overlap, figure renderer Three CPU-only analysis pieces that turn raw Window 1 artifacts into publishable numbers and figures. scripts/compute_apc_upper_bound.py Block-level trie walk over hash_ids to compute the theoretical APC ceiling on a trace, decomposed into intra-session / any-session / shared-prefix-only. Gives a fixed reference for what each routing policy could possibly achieve. w600 result: 79.6% intra-session, 80.3% any-session, 0.1% shared-prefix. analysis/characterization/b2_sweep_analysis.py (rewrite) Previous version used joined_analysis.interference_index() which labeled overlap = "any prefill in any other request during this decode". With short-prompt decode load this is always true (everyone's prefill overlaps everyone else's decode); n_overlap was 239/240 even in the different-worker control. New version labels overlap iff the decode's [t_first_token, t_finish] intersects an actual large injection window, computed from the cell's "prefill"-tagged metric rows. Different-worker control now cleanly sits at idx ≈ 1.0, same-worker scales monotonically. analysis/characterization/render_window1_figures.py Renders 8 PNGs from the result JSONs: B3 latency / APC vs ceiling / APC vs hotspot scatter / per-worker TTFT / failure breakdown, B2 TPOT and TTFT curves (overlap vs clean and idx), reuse decomposition, KV footprint. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 23:24:54 +08:00
Gahow Wang	b9f324f2e6	B2 interference driver: request return_token_ids + text fallback The first B2 run produced metrics with ttft_s=null/tpot_s=null for every decode request because the OpenAI-style payload did not set return_token_ids: true, and the parser only inspected choices[0].token_ids. With token_ids missing the loop skipped every chunk, so no per-token timestamps were captured and the aggregator returned interference_index=null on all 10 cells. Fix: - send return_token_ids: true in the payload (matches replayer.replay) - also accept text-delta chunks as token signals (fallback for servers that drop token_ids despite the flag) vLLM engine_state was fine; only the load-gen metric capture was broken. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 22:39:54 +08:00
Gahow Wang	df3249925b	B3 analyze: prefer per-policy engine_state over slicing shared dir The hot-sweep variant of B3 writes one shared engine_state across all policies; the isolated variant writes per-policy. Previously slice_engine_state.py was called unconditionally and would overwrite an isolated policy's real data with an empty slice (the isolated policy's run-window doesn't overlap with the shared dir's contents). Now we check the policy directory's engine_state for any non-empty engine_*.jsonl first; if present, use it directly; else slice from the shared one as before. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 22:19:43 +08:00
Gahow Wang	1d87082ca1	B3: cold-start isolated policy runner (clean APC per cell) scripts/b3_isolated_policy.sh wraps one policy run in a fresh 8-instance vLLM lifecycle: hard reset -> launch -> health -> proxy -> replayer -> snapshot artifacts -> cleanup. Used when cross- policy APC contamination matters more than the ~25-min vLLM warmup overhead per policy. Counterpart to the existing b3_sweep.sh which keeps vLLM warm across all policies (faster but warm-cache; we found via the sticky pre-flight that contamination is < 1% on this trace, so b3_sweep.sh stays the default). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 20:33:44 +08:00
Gahow Wang	08530b3915	B3 policies: pseudocode reference for the five-policy sweep Documents each pick_instance_* function from cache_aware_proxy.py in pseudocode so the policy semantics can be cited without re-reading implementation details. Covers lmetric (main baseline), load_only (no cache / no affinity control), sticky (hard affinity control), unified (gated affinity + LMetric fallback), and capped (lmetric on a per-session turn-capped trace). Includes a decision matrix that maps each policy to whether it uses session affinity, cache awareness, load awareness, and overload break, plus a one-liner per control explaining what comparison isolates which factor. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 19:57:02 +08:00
Gahow Wang	123a74a4b9	B3 report renderer: incremental markdown table from comparison JSON Reads b3_policy_comparison.json (produced by b3_analyze.sh) and emits a markdown report with three tables: headline latency + APC, mechanism indices (interference / hotspot / reuse), and slow-request cause breakdown. Rows for policies not yet present in the sweep are left as "pending" so the same renderer can be re-invoked as each policy finishes, producing an evolving report rather than waiting for the full sweep. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 18:58:21 +08:00
Gahow Wang	92db1c4370	B3 post-run helpers: engine_state slicer + per-policy aggregator scripts/slice_engine_state.py filters a shared engine_*.jsonl by a [t_start_unix, t_end_unix] window. Needed because the patched scheduler appends to one file per engine across the whole sweep; per-policy analysis requires the per-policy slice. scripts/b3_analyze.sh drives the slice + joined_analysis loop for every policy directory in a completed sweep, then aggregates one row per policy (latency percentiles, APC, interference_index, hotspot_index, reuse fractions, failure-cause counts) into b3_policy_comparison.json. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 18:51:33 +08:00
Gahow Wang	e23128ad65	B2: PD-colo interference microbench harness + sweep aggregator scripts/b2_interference.py is the controlled microbench. It runs two coroutines against the open proxy bypass (direct vLLM endpoints): - decode_load: continuous short-prompt requests at fixed QPS into a designated decode instance, to keep it decode-saturated. - prefill_injections: N large one-token requests at fixed interval, pointed at either the same instance (same-worker variant) or a paired one (different-worker control). Each cell (variant × prefill_size) gets its own metrics.jsonl plus a run_window.json containing t_start_unix/t_end_unix. The shared engine_*.jsonl from the scheduler patch is sliced by that window in the aggregator. analysis/characterization/b2_sweep_analysis.py walks the cell tree, slices the per-worker step log by each cell's window, runs the A5 interference_index() against the slice, and emits a single b2_sweep_summary.json with one row per cell. This is what feeds the "interference vs uncached prefill size" figure. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 17:54:51 +08:00
Gahow Wang	c6b7c3471b	B3: load_only + sticky policies, capped-trace builder, sweep driver Three additions land together because B3's whole point is comparing LMetric against meaningful controls. - scripts/cache_aware_proxy.py: two new --policy values. - load_only: pure min(num_requests) routing, no cache or affinity. The B3 control that strips locality so the LMetric-vs-load gap is legible. - sticky: first turn goes to min-load, subsequent turns ALWAYS return to the same instance, even under saturation. The B3 control that maxes out locality so the hot-spot cost is legible. - scripts/build_capped_trace.py: per-session turn cap (default 8). Generates the session-mass-equalized variant the TODO calls for so that hot-spot index can be re-measured with the heavy-tail removed. - scripts/b3_sweep.sh: orchestrates the 5-cell sweep. - GPU_INDICES makes it easy to skip a dead GPU. - EXTRA_VLLM_ARGS defaults to --enable-prompt-tokens-details so usage.prompt_tokens_details.cached_tokens is populated. vLLM 0.18.1 omits the field by default and breaks the reuse-decomp pipeline; the smoke run surfaced this. - Trap kills EngineCore by name in addition to "vllm serve" — the parent dies first but the child holds GPU memory. Was the root cause of the 89 GB ghost on GPU 0 earlier today. - Proxy readiness is a polling loop, not a fixed sleep. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 17:54:24 +08:00
Gahow Wang	763355b825	A5 fix: worker-id resolution and vLLM cmpl- rid stripping Smoke validation on dash0 surfaced three real bugs that broke interference and failure-attribution labels end-to-end: 1. endpoint_url in metrics is the proxy URL (e.g. http://h:9200); the vLLM worker URL lives in breakdown's routed_to. The interference index and label path were taking endpoint_url first, so every request looked routed to a non-existent worker and the overlap counter stayed at zero. 2. _normalize_worker hard-coded base port 8000, so a smoke run on port 9100 resolved to engine_1100 instead of engine_0. Added a --worker-map URL=engine_id CLI flag and _resolve_worker() that prefers the explicit map and falls back to the heuristic. 3. vLLM rewrites the per-step rid as cmpl-<proxy_id>-<i>-<hash>, so the str equality check between per_req rid and our proxy request_id never matched -> every prefill step looked like "other request prefill", which would have flipped overlap to 100%. Added _vllm_rid_matches() that strips the cmpl-/chatcmpl- prefix. After the fix, the same smoke run reports interference_index = 22.9 across 24 overlap / 6 clean requests on a single instance, which is the expected shape for serial dispatch into a cold engine. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 16:47:23 +08:00
Gahow Wang	cd82b8c2a2	PD-sep matrix results: C2/C3/C4 figures + empirical mechanism refined Captures 5 runs from the experiment matrix (combined-ca x3 seeds, pdsep-4p4d seed1, pdsep-6p2d seed1) on traces/w600_r0.0015_st30.jsonl with cuda graphs enabled. The headline: combined-ca: TTFT p50 0.91s success 99.5% pdsep-4p4d: TTFT p50 62.8s success 52% (69x worse, half dropped) pdsep-6p2d: TTFT p50 51.1s success 68% (56x worse, third dropped) C2 (fig_c2): headline bars per config with error bars. C3 (fig_c3): per-instance KV utilization time-series. Both PD-sep splits hit the memory wall, but the side differs by P:D ratio -- 4P+4D pins the P-side, 6P+2D pins both sides (D-side back-pressures P-side). C4 (fig_c4): TTFT stacked breakdown. 99% of PD-sep TTFT is P-side prefill compute; D-side wait + first token is <=1.2s. The bottleneck is P-side prefill queueing, not D-side decode wait as the original analytical model assumed. system_analysis.md gains a Layer 5b that reconciles the analytical KV-wall model (which considered D-side only) with the empirical finding that the wall hits whichever side has fewer GPUs, and co-saturates both at extreme splits via D-side back-pressure. plot_pd_matrix.py ingests outputs/pd_matrix/* into all four figures. bench.sh gained AGENTIC_STEP_LOG_DIR hooks for future runs (set during this work but not used by the current matrix's data). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 16:23:52 +08:00
Gahow Wang	25445e3d18	A5: joined analysis with reuse decomp, interference, hot-spot, labels New analysis/characterization/joined_analysis.py joins replayer metrics.jsonl + proxy breakdown.json + worker_state.jsonl by request_id, plus engine_*.jsonl by worker_id, and emits: - joined.jsonl per-request merged record - reuse_decomposition.json real intra/cross/shared classification using session_id + hash_ids + cached_tokens - interference_index.json TPOT_p90(same-worker prefill overlap) / TPOT_p90(clean), per Batch 2 - hotspot_index.json max/median worker TTFT-p90, per Batch 3 - failure_label.jsonl per-slow-request cause label, per Batch 5 - failure_breakdown.json label histogram - window_summary.json SRR warmup/steady/drain aggregates Closes the analyzer side of Phase A; replaces the status: unavailable placeholders the existing scaffold emits when join sources are missing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 16:19:33 +08:00
Gahow Wang	f42c715ec1	A4: open-loop session-causal SRR loadgen New replayer/srr.py drives a Poisson session-arrival load against the existing proxy, with strict per-session turn sequentiality, explicit warmup/steady/drain windows, and per-arrival fresh session_id + request_id so APC/session-affinity counters are not contaminated by repeated draws from the trace pool. Writes window_summary.json with attempted/completed/errored split by window so latency tails can be read on the steady-state window only. Required by Batch 4 SRR sweep; trace-timestamp dispatch in replay.py cannot drive arrival rate independently. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 16:19:20 +08:00
Gahow Wang	5816aad731	A3: vLLM scheduler patch for step-level JSONL log When AGENTIC_STEP_LOG_PATH is set, the scheduler emits one JSONL line per scheduler step with t_unix, worker_id, prefill/decode token counts, n_running/n_waiting, preempted ids, and per-request phase labels. No-op when the env var is unset, so production engines are not impacted. bench.sh now threads AGENTIC_STEP_LOG_DIR through to each per-engine launch so step logs end up at engine_${i}.jsonl. Required by Batch 2 (PD-colo interference index) and Batch 5 (same-worker overlap attribution); engine /metrics polling cannot provide per-step granularity. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 16:19:11 +08:00
Gahow Wang	fe556b5d98	A2: proxy worker-state snapshot and request-id passthrough Honor incoming X-Request-Id so replayer metrics and proxy breakdown share a join key. Each route decision now captures session_id, the full per-worker candidate-score snapshot (ongoing/pending/num_requests /cached_blocks plus both linear and lmetric scores), the chosen score, and unix timestamps for first-token and done events. A separate _worker_state_log records one row per decision and is exposed via GET /worker_state; GET /worker_state/latest returns a live snapshot without recording it. Required by Batch 3 (session hot-spot proof) and Batch 5 (failure attribution); existing breakdown.json had no per-worker state at decision time. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 16:19:01 +08:00
Gahow Wang	d57e338366	A1: replayer instrumentation for cross-process join RequestMetrics gains absolute unix timestamps (t_dispatch_unix, t_first_token_unix, t_finish_unix), the proxy_request_id, the chosen endpoint URL, and the trace hash_ids. Replayer sends X-Request-Id: <session_id>:<turn_id>:<chat_id>:<idx> so proxy breakdown rows can be joined to metrics by exact key. Required by Batch 0 (online sequentiality proof) and Batch 1 reuse decomposition; existing metrics.jsonl couldn't establish either. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 16:18:52 +08:00
Gahow Wang	e5761fa6f3	Characterization plan: progress snapshot + Claude work plan - Add Progress Snapshot table to the intern TODO so per-batch status (DONE / partial / blocked-on-instrumentation) is visible at a glance. - New analysis/claude_characterization_work_plan.md scopes the Phase A instrumentation tasks (A1-A5) plus Window 1 (B1'+B2+B3) and Window 2 (B4+B5) on dash0, with locked decisions for model, topology, trace, SLO style, and GPU phasing. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 16:18:41 +08:00
Gahow Wang	5ed6f6fe5b	Add characterization result figures	2026-05-25 15:15:10 +08:00
Gahow Wang	0f64fb3261	Add agentic workload characterization audit scaffold	2026-05-25 15:01:18 +08:00
Gahow Wang	21ffb3d4f7	PD-sep matrix infrastructure: bench.sh pdsep mode + matrix driver Adds the experiment harness that gates the empirical claims (C2/C3/C4/C5) in the PD-sep paper section. Three pieces: 1. scripts/bench.sh: new --mode pdsep with --pd-ratio P:D, and an --eager flag to re-enable --enforce-eager for the cuda-graph ablation. pdsep reuses the elastic-mode Mooncake kv_both launch and swaps the proxy command from --combined to --prefill/--decode. baseline and elastic flows are unchanged. 2. analysis/pd_sep_paper_section/scripts/bench_pd_matrix.sh: matrix driver that runs {combined-ca, pdsep-4p4d, pdsep-6p2d} x cudagraph x 3 seeds by default (~2 h on dash0). --with-rr adds combined-rr; --with-eager doubles to ~5 h with the cuda-graph ablation. Skips completed runs, captures per-instance vLLM logs (needed for C3 step-level KV-utilization mining). 3. fig_kv_memory_wall.pdf: empirical anchor (star) at REPORT.md §3.3's observed 6P+2D 97% KV utilization. The marker lands on the model's predicted curve at p90 input, confirming the steady-state analysis. README updated with the run command, output layout, and the followup plotters that consume outputs/pd_matrix/. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 11:47:33 +08:00
Gahow Wang	4028c587b1	Paper section: system analysis + workload figures + KV-wall model Adds the system-level argument resolving the roofline/PD-sep paradox. Even at 95% cache reuse prefill stays compute-bound (the C6 roofline fact), yet PD separation regresses TTFT 72%. The new system_analysis.md walks through six layers showing why the roofline claim is necessary but not sufficient, with the falsifiable condition being decode-side KV memory budget: concurrent_decode * KV_per_req / (N_D * HBM_pool). For chatbot this ratio is << 1 at any layout; for agentic at p90+ context it goes >> 1 under 4P+4D and 6P+2D, predicting the empirical 97% decode KV occupancy. fig_kv_memory_wall.pdf visualizes the model with audit-able constants; fig_c1a/b ground the per-request KV-size inputs in the actual sampled trace (input p50=33.5k, p90=101k, intra-session reuse 79.2%). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 11:41:31 +08:00
Gahow Wang	d71a111099	Paper section: PD-sep scaffold + drop --enforce-eager from launch scripts Adds analysis/pd_sep_paper_section/ as the home for the "PD separation is net negative under agentic workloads" paper section: plot scripts for C1 (workload chars), C6 (roofline), C7 (routing-vs-PD-sep lever), the C6/C7 PDFs already rendered, and a README mapping candidate claims to required figures plus open re-run items. Removes --enforce-eager from bench.sh and all active launch scripts so cuda graphs are captured -- the prior methodology suppressed one of PD-sep's structural advantages (D-node fixed-shape decode). Legacy scripts under scripts/legacy/ are intentionally untouched as historical records. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 11:24:16 +08:00
Gahow Wang	6a27f75337	Docs: reconcile routing docs with current hybrid direction Per analysis/unified_routing_fix_review.md #2, several docs still presented the retired single-argmin + PUSH-migration design as the final algorithm. Mark them superseded and document the current hybrid direction (commit `255c8e6`). - REPORT.md §1.1 / §3.9: add errata callout and section header noting the "Final Design" framing was retired after `cc6e562` / 4c583f2; point readers to docs/migration-policy-design.md. - docs/migration-policy-design.md: rewrite. Opens with the current hybrid algorithm (LMetric base + cache_ratio>0.5 affinity gate + tie-breaker), then a "What Was Retired" commit table, then the old Approach A numbers preserved as "Historical Baseline-Mode Comparison". - analysis/research_findings.md §2.2 / §5: correct the LMetric framing. LMetric isn't "neutralized by affinity constraints" (pure --policy lmetric has no affinity at all); it converges to similar placements because P_tokens includes new_uncached_tokens, giving it implicit soft affinity. - analysis/elastic_hypotheses.md: same LMetric correction in the "DOESN'T work" summary, plus a footer cross-referencing the current routing direction. - analysis/unified_routing_fix_review.md: track this file (was untracked); it is the review handoff cited from the updated docs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 10:47:14 +08:00
Gahow Wang	ac6534c3ff	Cleanup: retire dead PUSH path + extract hybrid picker - Delete unreachable best_needs_push block in _handle_combined and the four orphaned helpers (_handle_cached_prefill_offload, _handle_direct_read_offload, _query_bootstrap_hit, _get_bootstrap_client). Their only caller was the retired PUSH gate; see REPORT §3.9 errata for the rejected experiments (`cc6e562`, `4c583f2`). - Extract pick_instance_unified_hybrid as a pure function returning (chosen, idx, decision_dict). The decision dict carries the review #7 breakdown fields (decision, affinity_idx/chosen_idx, cache_hit/ratio, avg_num_requests, fallback_score, tie_break_used). - Add LMetric-fallback tie-breaker (primary score, then new_uncached, num_requests, round-robin) so new sessions don't all pin to inst 0 when BS=0 across the board. - Drop the lmetric-policy affinity write so --policy lmetric stays affinity-free per review #3. - Mark --max-offload-inflight / --offload-mode / --cache-gate-ratio / --decode-iteration-s as [DEPRECATED] in --help; flags remain accepted so scripts/bench.sh and legacy launchers don't break. - Revert uncommitted overload_factor 2.0->1.5 default; H7 sweep already rejected this knob (within noise). Future sweeps should go via CLI. Tests: add 6 hybrid-policy tests in tests/test_proxy_pick.py covering affinity-hit, overload break, low-cache fallback, tie-break rotation, lmetric purity, and breakdown field shape. 19/19 pass. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 10:46:57 +08:00
Gahow Wang	255c8e6884	Hybrid routing: LMetric for LB + explicit affinity for high-cache sessions Replace the full unified cost model with a simpler hybrid: - If session has >50% cache on affinity instance AND instance not overloaded (num_requests <= avg * overload_factor) → stick to affinity - Otherwise → use LMetric (P × BS) for best load balance This combines LMetric's superior load balance with explicit session affinity for high-value sessions that have significant cache accumulation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-25 09:05:08 +08:00
Gahow Wang	448361cf83	Update design doc: final results + review findings Unified routing (baseline mode) beats LMetric E2E mean/p50/p90. PD-sep offload consistently degrades performance (5-134 offloads tested). Independent review: fair comparison, no reward hacking, needs multi-run significance verification (running 3x paired test). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-25 03:48:18 +08:00
Gahow Wang	4c583f2f1c	Revert relaxed gate + push_cost fix: 134 offloads destroyed performance PD-sep offload overhead (C queue + prefill + KV transfer + D schedule) far exceeds any load balance benefit. With relaxed gate, cost model triggered 134 offloads → E2E p90 went from 37s to 82s. The proven winning configuration is Unified routing in baseline mode (no Mooncake connector), which beats LMetric on E2E mean/p50/p90 purely through better routing (contention-aware + session affinity). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-25 03:38:59 +08:00
Gahow Wang	bf4469a150	Fix cost model: accurate push_cost + aligned hard gate 1. push_cost now models both C and D: max(c_cost, d_cost) where c_cost includes C's queue + prefill, d_cost includes D's queue + RDMA overhead. Old formula only had D's contention + RDMA. 2. Hard gate uses num_requests instead of ongoing_tokens, aligning with the contention-based cost model. 3. Fix migration_discount: min(cap, 5) instead of hardcoded min(cap, 3). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-25 01:01:03 +08:00
Gahow Wang	1d2148cf65	Remove second push_new gate that caused downgrade-to-cold-LOCAL After _push_allowed was relaxed, the cost model correctly chose push for high-cache sessions on overloaded instances. But a second gate at execution time (push_new < heavy_threshold) blocked the actual offload, downgrading to LOCAL on the target instance — which had no cache. Worse, session affinity was already updated to the target, so all subsequent turns also hit cold prefill. This was the root cause of relaxed gate's performance regression: affinity broken + push blocked = worst of both worlds. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-25 00:42:31 +08:00
Gahow Wang	3ae99293fd	Relax _push_allowed: gate on request size, not cache savings The old gate blocked offload when push_new (= input - cache_hit) < 20K, which prevented migration of high-cache sessions — exactly the ones that benefit most. After PD-sep, the target receives full KV via RDMA and has the same cache as the source, so cache_hit is irrelevant to the offload decision. New gate: only check input_length >= heavy_threshold (request must be HEAVY) and max_offload_inflight (concurrency cap). Let the cost model decide whether the contention difference justifies migration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-25 00:03:28 +08:00
Gahow Wang	cc6e5625bb	Revert Approach B (session migration): overhead exceeds LB benefit Reverts 3 commits: `e991960`, `5772149`, `5b1d360`. 57 migrations triggered but PD-sep overhead (C queue + KV transfer + D cold start) caused HEAVY TTFT p90 to regress from 15.9s to 59.1s. Migration mechanism needs fundamental rework before it can help. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 23:43:47 +08:00
Gahow Wang	5b1d36080a	Fix B2 migration: correct offload call signature (c_inst/d_inst order + cache_hit arg) The session migration path was calling _handle_cached_prefill_offload with swapped c_inst/d_inst and missing cache_hit parameter, causing TypeError on every migration attempt (13 of 41 errors in the test run). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 22:46:46 +08:00
Gahow Wang	5772149d36	Approach B v2: TTFT-based migration trigger Replace num_requests threshold with recent TTFT median as migration trigger. Track per-instance rolling TTFT (last 8 requests) and trigger migration when median > 5s (configurable). Target is the instance with lowest recent TTFT, requiring > 2x improvement to justify migration. This is more responsive than the instantaneous num_requests signal because TTFT directly measures the user-facing impact of contention. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 21:54:06 +08:00
Gahow Wang	45b82272c3	Add migration policy design doc with A/B experiment results Approach A (contention-aware cost model): TTFT p90 -52% vs baseline. Approach B (session migration): 0 triggers at 1.5x threshold — needs tuning. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 18:24:49 +08:00
Gahow Wang	e9919605af	Approach B: session-level lazy migration trigger When a request arrives for a session on an overloaded instance, force migration if three conditions hold: 1. Instance busy: num_requests > avg * migration_request_factor (1.5x) 2. Session has cache value: cache_ratio > 50% 3. Request is HEAVY (>= heavy_threshold) 4. A meaningfully less-loaded target exists (num_requests gap > 2) This bypasses the cost model for migration decisions — the cost model's cache-inflated costs prevented migration even when instances had 150s queue times with 99% cache hit. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 17:34:06 +08:00
Gahow Wang	e06de5144b	Approach A: contention-aware cost model with migration discount Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 17:24:27 +08:00
Gahow Wang	e13391eeab	Evict migrated blocks from prefix cache after KV send completes After a session migrates from C to D via offload, C's blocks were freed to the LRU tail (most-recently-used position), making them the last to be evicted. Since the session won't return to C, these blocks are dead weight occupying cache capacity. Now capture block IDs before _free_blocks and call evict_blocks to remove them from the prefix cache hash table, so they can be reused sooner for active sessions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 16:56:34 +08:00
Gahow Wang	4b50c5a08d	Fix unified cost model: include decode load in queue + hard overload gate Two bugs caused elastic to concentrate load on cached instances (10x token imbalance vs 2.7x baseline): 1. _instance_cost queue only counted pending_prefill_tokens, missing ongoing_decode_tokens entirely — instances with 50 decoding requests appeared idle to the cost model. 2. Cache hits made overloaded instances look "cheap", creating a positive feedback loop: more sessions → more cache → lower cost → more routing. Added a hard gate (ongoing_tokens > avg * overload_factor) that breaks affinity before the cost model runs, matching linear policy behavior. Result: token imbalance 10.3x → 2.6x, TTFT p90 -37% vs baseline. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 16:25:02 +08:00
Gahow Wang	9cebdb6b9b	Fix multi-turn replay fidelity: track realized output tokens across all components The replayer and proxy were building multi-turn prompts from trace tokens, but the model generates different output tokens. Subsequent turns had wrong prefix tokens, causing cache misses and invalid experimental measurements. - replay.py: min_tokens=max_tokens for deterministic length, return_token_ids to capture actual output, _apply_realized_prefix for next-turn correction - proxy: extract output token_ids from SSE, record prompt+output as realized prefix in shadow cache, extract _handle_local_request to deduplicate - bench.sh/launch_elastic_p2p.sh: default elastic mode to unified policy - mooncake_connector: only send prompt blocks (not stale output blocks), track failed_recving_block_ids for error recovery Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 14:47:51 +08:00
Gahow Wang	cc4a9c91e7	Fix estimate_hit: reuse _lookup_by_tokens instead of reimplementing hash The standalone hash computation in estimate_hit produced different hashes than the hash_table (synced from scheduler). Root cause unclear (possibly pickle serialization differences or hash chain state). Fix: delegate to _lookup_by_tokens which is proven to work (push_blocks uses it). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 12:41:53 +08:00

1 2 3

138 Commits