Commit Graph

229 Commits

Author SHA1 Message Date
0c3220cbb8 Window 1 results: combined B1' + B2 + B3 report and artifacts
analysis/characterization/window_1_results.md is the headline write-up
for Window 1: workload characterization (KV per request, real reuse
decomposition, APC theoretical ceilings), B3 5-policy sweep with
per-policy interpretation, B2 same-vs-different-worker interference
microbench with causal reading, and an explicit list of what Window 1
does *not* answer (deferred to B4 SRR sweep + B5 attribution).

Under window_1_results/:
- 5 raw result JSONs from the B3 sweep, the B2 microbench, the APC
  upper bound, and the KV footprint
- per-policy hotspot_index.json snapshots so render_window1_figures.py
  can plot per-worker TTFT p90 distributions
- 8 PNG figures (figures/) covering the headline claims

Three takeaways the figures pin down:
1) intra-session reuse dominates (93.2%), so session-affinity routing
   is the right primary lever
2) unified hybrid affinity hits 79.4% APC (97% of the 79.6% intra-
   session ceiling) AND cuts TTFT p90 from lmetric's 15.6s to 7.24s
3) B2 different-worker control sits at idx ≈ 1.0 across 32× prefill-
   size variation; same-worker TTFT idx scales 2.15× -> 218×, which
   is the cleanest causal evidence for same-worker prefill-decode
   interference

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 23:25:09 +08:00
b7902061d1 Window 1 analysis: APC upper bound, B2 window-overlap, figure renderer
Three CPU-only analysis pieces that turn raw Window 1 artifacts into
publishable numbers and figures.

scripts/compute_apc_upper_bound.py
  Block-level trie walk over hash_ids to compute the theoretical APC
  ceiling on a trace, decomposed into intra-session / any-session /
  shared-prefix-only. Gives a fixed reference for what each routing
  policy could *possibly* achieve. w600 result: 79.6% intra-session,
  80.3% any-session, 0.1% shared-prefix.

analysis/characterization/b2_sweep_analysis.py (rewrite)
  Previous version used joined_analysis.interference_index() which
  labeled overlap = "any prefill in any other request during this
  decode". With short-prompt decode load this is always true
  (everyone's prefill overlaps everyone else's decode); n_overlap
  was 239/240 even in the different-worker control.

  New version labels overlap iff the decode's [t_first_token, t_finish]
  intersects an actual large *injection* window, computed from the
  cell's "prefill"-tagged metric rows. Different-worker control now
  cleanly sits at idx ≈ 1.0, same-worker scales monotonically.

analysis/characterization/render_window1_figures.py
  Renders 8 PNGs from the result JSONs: B3 latency / APC vs ceiling
  / APC vs hotspot scatter / per-worker TTFT / failure breakdown,
  B2 TPOT and TTFT curves (overlap vs clean and idx), reuse
  decomposition, KV footprint.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 23:24:54 +08:00
b9f324f2e6 B2 interference driver: request return_token_ids + text fallback
The first B2 run produced metrics with ttft_s=null/tpot_s=null for
every decode request because the OpenAI-style payload did not set
return_token_ids: true, and the parser only inspected
choices[0].token_ids. With token_ids missing the loop skipped every
chunk, so no per-token timestamps were captured and the aggregator
returned interference_index=null on all 10 cells.

Fix:
- send return_token_ids: true in the payload (matches replayer.replay)
- also accept text-delta chunks as token signals (fallback for
  servers that drop token_ids despite the flag)

vLLM engine_state was fine; only the load-gen metric capture was
broken.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 22:39:54 +08:00
df3249925b B3 analyze: prefer per-policy engine_state over slicing shared dir
The hot-sweep variant of B3 writes one shared engine_state across
all policies; the isolated variant writes per-policy. Previously
slice_engine_state.py was called unconditionally and would
overwrite an isolated policy's real data with an empty slice (the
isolated policy's run-window doesn't overlap with the shared dir's
contents).

Now we check the policy directory's engine_state for any non-empty
engine_*.jsonl first; if present, use it directly; else slice from
the shared one as before.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 22:19:43 +08:00
1d87082ca1 B3: cold-start isolated policy runner (clean APC per cell)
scripts/b3_isolated_policy.sh wraps one policy run in a fresh
8-instance vLLM lifecycle: hard reset -> launch -> health -> proxy
-> replayer -> snapshot artifacts -> cleanup. Used when cross-
policy APC contamination matters more than the ~25-min vLLM
warmup overhead per policy.

Counterpart to the existing b3_sweep.sh which keeps vLLM warm
across all policies (faster but warm-cache; we found via the
sticky pre-flight that contamination is < 1% on this trace, so
b3_sweep.sh stays the default).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 20:33:44 +08:00
08530b3915 B3 policies: pseudocode reference for the five-policy sweep
Documents each pick_instance_* function from cache_aware_proxy.py in
pseudocode so the policy semantics can be cited without re-reading
implementation details. Covers lmetric (main baseline), load_only
(no cache / no affinity control), sticky (hard affinity control),
unified (gated affinity + LMetric fallback), and capped (lmetric on
a per-session turn-capped trace).

Includes a decision matrix that maps each policy to whether it uses
session affinity, cache awareness, load awareness, and overload
break, plus a one-liner per control explaining what comparison
isolates which factor.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 19:57:02 +08:00
123a74a4b9 B3 report renderer: incremental markdown table from comparison JSON
Reads b3_policy_comparison.json (produced by b3_analyze.sh) and emits
a markdown report with three tables: headline latency + APC,
mechanism indices (interference / hotspot / reuse), and slow-request
cause breakdown. Rows for policies not yet present in the sweep are
left as "pending" so the same renderer can be re-invoked as each
policy finishes, producing an evolving report rather than waiting
for the full sweep.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 18:58:21 +08:00
92db1c4370 B3 post-run helpers: engine_state slicer + per-policy aggregator
scripts/slice_engine_state.py filters a shared engine_*.jsonl by a
[t_start_unix, t_end_unix] window. Needed because the patched
scheduler appends to one file per engine across the whole sweep;
per-policy analysis requires the per-policy slice.

scripts/b3_analyze.sh drives the slice + joined_analysis loop for
every policy directory in a completed sweep, then aggregates one row
per policy (latency percentiles, APC, interference_index,
hotspot_index, reuse fractions, failure-cause counts) into
b3_policy_comparison.json.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 18:51:33 +08:00
e23128ad65 B2: PD-colo interference microbench harness + sweep aggregator
scripts/b2_interference.py is the controlled microbench. It runs two
coroutines against the open proxy bypass (direct vLLM endpoints):

- decode_load: continuous short-prompt requests at fixed QPS into a
  designated decode instance, to keep it decode-saturated.
- prefill_injections: N large one-token requests at fixed interval,
  pointed at either the same instance (same-worker variant) or a
  paired one (different-worker control).

Each cell (variant × prefill_size) gets its own metrics.jsonl plus a
run_window.json containing t_start_unix/t_end_unix. The shared
engine_*.jsonl from the scheduler patch is sliced by that window in
the aggregator.

analysis/characterization/b2_sweep_analysis.py walks the cell tree,
slices the per-worker step log by each cell's window, runs the A5
interference_index() against the slice, and emits a single
b2_sweep_summary.json with one row per cell. This is what feeds the
"interference vs uncached prefill size" figure.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 17:54:51 +08:00
c6b7c3471b B3: load_only + sticky policies, capped-trace builder, sweep driver
Three additions land together because B3's whole point is comparing
LMetric against meaningful controls.

- scripts/cache_aware_proxy.py: two new --policy values.
  - load_only: pure min(num_requests) routing, no cache or affinity.
    The B3 control that strips locality so the LMetric-vs-load gap is
    legible.
  - sticky: first turn goes to min-load, subsequent turns ALWAYS
    return to the same instance, even under saturation. The B3
    control that maxes out locality so the hot-spot cost is legible.
- scripts/build_capped_trace.py: per-session turn cap (default 8).
  Generates the session-mass-equalized variant the TODO calls for so
  that hot-spot index can be re-measured with the heavy-tail removed.
- scripts/b3_sweep.sh: orchestrates the 5-cell sweep.
  - GPU_INDICES makes it easy to skip a dead GPU.
  - EXTRA_VLLM_ARGS defaults to --enable-prompt-tokens-details so
    usage.prompt_tokens_details.cached_tokens is populated. vLLM
    0.18.1 omits the field by default and breaks the reuse-decomp
    pipeline; the smoke run surfaced this.
  - Trap kills EngineCore by name in addition to "vllm serve" — the
    parent dies first but the child holds GPU memory. Was the root
    cause of the 89 GB ghost on GPU 0 earlier today.
  - Proxy readiness is a polling loop, not a fixed sleep.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 17:54:24 +08:00
763355b825 A5 fix: worker-id resolution and vLLM cmpl- rid stripping
Smoke validation on dash0 surfaced three real bugs that broke
interference and failure-attribution labels end-to-end:

1. endpoint_url in metrics is the proxy URL (e.g. http://h:9200);
   the vLLM worker URL lives in breakdown's routed_to. The
   interference index and label path were taking endpoint_url first,
   so every request looked routed to a non-existent worker and the
   overlap counter stayed at zero.
2. _normalize_worker hard-coded base port 8000, so a smoke run on
   port 9100 resolved to engine_1100 instead of engine_0. Added a
   --worker-map URL=engine_id CLI flag and _resolve_worker() that
   prefers the explicit map and falls back to the heuristic.
3. vLLM rewrites the per-step rid as cmpl-<proxy_id>-<i>-<hash>, so
   the str equality check between per_req rid and our proxy
   request_id never matched -> every prefill step looked like
   "other request prefill", which would have flipped overlap to
   100%. Added _vllm_rid_matches() that strips the cmpl-/chatcmpl-
   prefix.

After the fix, the same smoke run reports interference_index = 22.9
across 24 overlap / 6 clean requests on a single instance, which is
the expected shape for serial dispatch into a cold engine.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 16:47:23 +08:00
cd82b8c2a2 PD-sep matrix results: C2/C3/C4 figures + empirical mechanism refined
Captures 5 runs from the experiment matrix (combined-ca x3 seeds,
pdsep-4p4d seed1, pdsep-6p2d seed1) on traces/w600_r0.0015_st30.jsonl
with cuda graphs enabled. The headline:

  combined-ca:  TTFT p50 0.91s   success 99.5%
  pdsep-4p4d:   TTFT p50 62.8s   success 52%   (69x worse, half dropped)
  pdsep-6p2d:   TTFT p50 51.1s   success 68%   (56x worse, third dropped)

C2 (fig_c2): headline bars per config with error bars.
C3 (fig_c3): per-instance KV utilization time-series. Both PD-sep
  splits hit the memory wall, but the side differs by P:D ratio --
  4P+4D pins the P-side, 6P+2D pins both sides (D-side back-pressures
  P-side).
C4 (fig_c4): TTFT stacked breakdown. 99% of PD-sep TTFT is P-side
  prefill compute; D-side wait + first token is <=1.2s. The bottleneck
  is P-side prefill queueing, not D-side decode wait as the original
  analytical model assumed.

system_analysis.md gains a Layer 5b that reconciles the analytical
KV-wall model (which considered D-side only) with the empirical
finding that the wall hits whichever side has fewer GPUs, and
co-saturates both at extreme splits via D-side back-pressure.

plot_pd_matrix.py ingests outputs/pd_matrix/* into all four figures.
bench.sh gained AGENTIC_STEP_LOG_DIR hooks for future runs (set during
this work but not used by the current matrix's data).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 16:23:52 +08:00
25445e3d18 A5: joined analysis with reuse decomp, interference, hot-spot, labels
New analysis/characterization/joined_analysis.py joins replayer
metrics.jsonl + proxy breakdown.json + worker_state.jsonl by
request_id, plus engine_*.jsonl by worker_id, and emits:

- joined.jsonl              per-request merged record
- reuse_decomposition.json  real intra/cross/shared classification
                            using session_id + hash_ids + cached_tokens
- interference_index.json   TPOT_p90(same-worker prefill overlap)
                            / TPOT_p90(clean), per Batch 2
- hotspot_index.json        max/median worker TTFT-p90, per Batch 3
- failure_label.jsonl       per-slow-request cause label, per Batch 5
- failure_breakdown.json    label histogram
- window_summary.json       SRR warmup/steady/drain aggregates

Closes the analyzer side of Phase A; replaces the
status: unavailable placeholders the existing scaffold emits when
join sources are missing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 16:19:33 +08:00
f42c715ec1 A4: open-loop session-causal SRR loadgen
New replayer/srr.py drives a Poisson session-arrival load against the
existing proxy, with strict per-session turn sequentiality, explicit
warmup/steady/drain windows, and per-arrival fresh session_id +
request_id so APC/session-affinity counters are not contaminated by
repeated draws from the trace pool. Writes window_summary.json with
attempted/completed/errored split by window so latency tails can be
read on the steady-state window only.

Required by Batch 4 SRR sweep; trace-timestamp dispatch in replay.py
cannot drive arrival rate independently.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 16:19:20 +08:00
5816aad731 A3: vLLM scheduler patch for step-level JSONL log
When AGENTIC_STEP_LOG_PATH is set, the scheduler emits one JSONL line
per scheduler step with t_unix, worker_id, prefill/decode token
counts, n_running/n_waiting, preempted ids, and per-request phase
labels. No-op when the env var is unset, so production engines are
not impacted. bench.sh now threads AGENTIC_STEP_LOG_DIR through to
each per-engine launch so step logs end up at engine_${i}.jsonl.

Required by Batch 2 (PD-colo interference index) and Batch 5
(same-worker overlap attribution); engine /metrics polling cannot
provide per-step granularity.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 16:19:11 +08:00
fe556b5d98 A2: proxy worker-state snapshot and request-id passthrough
Honor incoming X-Request-Id so replayer metrics and proxy breakdown
share a join key. Each route decision now captures session_id, the
full per-worker candidate-score snapshot (ongoing/pending/num_requests
/cached_blocks plus both linear and lmetric scores), the chosen score,
and unix timestamps for first-token and done events. A separate
_worker_state_log records one row per decision and is exposed via
GET /worker_state; GET /worker_state/latest returns a live snapshot
without recording it.

Required by Batch 3 (session hot-spot proof) and Batch 5 (failure
attribution); existing breakdown.json had no per-worker state at
decision time.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 16:19:01 +08:00
d57e338366 A1: replayer instrumentation for cross-process join
RequestMetrics gains absolute unix timestamps (t_dispatch_unix,
t_first_token_unix, t_finish_unix), the proxy_request_id, the chosen
endpoint URL, and the trace hash_ids. Replayer sends
X-Request-Id: <session_id>:<turn_id>:<chat_id>:<idx> so proxy
breakdown rows can be joined to metrics by exact key.

Required by Batch 0 (online sequentiality proof) and Batch 1 reuse
decomposition; existing metrics.jsonl couldn't establish either.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 16:18:52 +08:00
e5761fa6f3 Characterization plan: progress snapshot + Claude work plan
- Add Progress Snapshot table to the intern TODO so per-batch status
  (DONE / partial / blocked-on-instrumentation) is visible at a glance.
- New analysis/claude_characterization_work_plan.md scopes the Phase A
  instrumentation tasks (A1-A5) plus Window 1 (B1'+B2+B3) and Window 2
  (B4+B5) on dash0, with locked decisions for model, topology, trace,
  SLO style, and GPU phasing.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 16:18:41 +08:00
5ed6f6fe5b Add characterization result figures 2026-05-25 15:15:10 +08:00
0f64fb3261 Add agentic workload characterization audit scaffold 2026-05-25 15:01:18 +08:00
21ffb3d4f7 PD-sep matrix infrastructure: bench.sh pdsep mode + matrix driver
Adds the experiment harness that gates the empirical claims (C2/C3/C4/C5)
in the PD-sep paper section. Three pieces:

  1. scripts/bench.sh: new --mode pdsep with --pd-ratio P:D, and an
     --eager flag to re-enable --enforce-eager for the cuda-graph
     ablation. pdsep reuses the elastic-mode Mooncake kv_both launch and
     swaps the proxy command from --combined to --prefill/--decode.
     baseline and elastic flows are unchanged.

  2. analysis/pd_sep_paper_section/scripts/bench_pd_matrix.sh: matrix
     driver that runs {combined-ca, pdsep-4p4d, pdsep-6p2d} x cudagraph
     x 3 seeds by default (~2 h on dash0). --with-rr adds combined-rr;
     --with-eager doubles to ~5 h with the cuda-graph ablation. Skips
     completed runs, captures per-instance vLLM logs (needed for C3
     step-level KV-utilization mining).

  3. fig_kv_memory_wall.pdf: empirical anchor (star) at REPORT.md §3.3's
     observed 6P+2D 97% KV utilization. The marker lands on the model's
     predicted curve at p90 input, confirming the steady-state analysis.

README updated with the run command, output layout, and the followup
plotters that consume outputs/pd_matrix/.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 11:47:33 +08:00
4028c587b1 Paper section: system analysis + workload figures + KV-wall model
Adds the system-level argument resolving the roofline/PD-sep paradox.
Even at 95% cache reuse prefill stays compute-bound (the C6 roofline
fact), yet PD separation regresses TTFT 72%. The new system_analysis.md
walks through six layers showing why the roofline claim is necessary
but not sufficient, with the falsifiable condition being decode-side
KV memory budget: concurrent_decode * KV_per_req / (N_D * HBM_pool).

For chatbot this ratio is << 1 at any layout; for agentic at p90+
context it goes >> 1 under 4P+4D and 6P+2D, predicting the empirical
97% decode KV occupancy. fig_kv_memory_wall.pdf visualizes the model
with audit-able constants; fig_c1a/b ground the per-request KV-size
inputs in the actual sampled trace (input p50=33.5k, p90=101k,
intra-session reuse 79.2%).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 11:41:31 +08:00
d71a111099 Paper section: PD-sep scaffold + drop --enforce-eager from launch scripts
Adds analysis/pd_sep_paper_section/ as the home for the "PD separation is
net negative under agentic workloads" paper section: plot scripts for C1
(workload chars), C6 (roofline), C7 (routing-vs-PD-sep lever), the C6/C7
PDFs already rendered, and a README mapping candidate claims to required
figures plus open re-run items.

Removes --enforce-eager from bench.sh and all active launch scripts so
cuda graphs are captured -- the prior methodology suppressed one of
PD-sep's structural advantages (D-node fixed-shape decode). Legacy
scripts under scripts/legacy/ are intentionally untouched as historical
records.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 11:24:16 +08:00
6a27f75337 Docs: reconcile routing docs with current hybrid direction
Per analysis/unified_routing_fix_review.md #2, several docs still
presented the retired single-argmin + PUSH-migration design as the
final algorithm. Mark them superseded and document the current hybrid
direction (commit 255c8e6).

- REPORT.md §1.1 / §3.9: add errata callout and section header noting
  the "Final Design" framing was retired after cc6e562 / 4c583f2;
  point readers to docs/migration-policy-design.md.

- docs/migration-policy-design.md: rewrite. Opens with the current
  hybrid algorithm (LMetric base + cache_ratio>0.5 affinity gate +
  tie-breaker), then a "What Was Retired" commit table, then the old
  Approach A numbers preserved as "Historical Baseline-Mode Comparison".

- analysis/research_findings.md §2.2 / §5: correct the LMetric framing.
  LMetric isn't "neutralized by affinity constraints" (pure --policy
  lmetric has no affinity at all); it converges to similar placements
  because P_tokens includes new_uncached_tokens, giving it implicit
  soft affinity.

- analysis/elastic_hypotheses.md: same LMetric correction in the
  "DOESN'T work" summary, plus a footer cross-referencing the current
  routing direction.

- analysis/unified_routing_fix_review.md: track this file (was
  untracked); it is the review handoff cited from the updated docs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 10:47:14 +08:00
ac6534c3ff Cleanup: retire dead PUSH path + extract hybrid picker
- Delete unreachable best_needs_push block in _handle_combined and the
  four orphaned helpers (_handle_cached_prefill_offload,
  _handle_direct_read_offload, _query_bootstrap_hit,
  _get_bootstrap_client). Their only caller was the retired PUSH gate;
  see REPORT §3.9 errata for the rejected experiments (cc6e562, 4c583f2).

- Extract pick_instance_unified_hybrid as a pure function returning
  (chosen, idx, decision_dict). The decision dict carries the review #7
  breakdown fields (decision, affinity_idx/chosen_idx, cache_hit/ratio,
  avg_num_requests, fallback_score, tie_break_used).

- Add LMetric-fallback tie-breaker (primary score, then new_uncached,
  num_requests, round-robin) so new sessions don't all pin to inst 0
  when BS=0 across the board.

- Drop the lmetric-policy affinity write so --policy lmetric stays
  affinity-free per review #3.

- Mark --max-offload-inflight / --offload-mode / --cache-gate-ratio /
  --decode-iteration-s as [DEPRECATED] in --help; flags remain accepted
  so scripts/bench.sh and legacy launchers don't break.

- Revert uncommitted overload_factor 2.0->1.5 default; H7 sweep already
  rejected this knob (within noise). Future sweeps should go via CLI.

Tests: add 6 hybrid-policy tests in tests/test_proxy_pick.py covering
affinity-hit, overload break, low-cache fallback, tie-break rotation,
lmetric purity, and breakdown field shape. 19/19 pass.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 10:46:57 +08:00
255c8e6884 Hybrid routing: LMetric for LB + explicit affinity for high-cache sessions
Replace the full unified cost model with a simpler hybrid:
- If session has >50% cache on affinity instance AND instance not overloaded
  (num_requests <= avg * overload_factor) → stick to affinity
- Otherwise → use LMetric (P × BS) for best load balance

This combines LMetric's superior load balance with explicit session
affinity for high-value sessions that have significant cache accumulation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-25 09:05:08 +08:00
448361cf83 Update design doc: final results + review findings
Unified routing (baseline mode) beats LMetric E2E mean/p50/p90.
PD-sep offload consistently degrades performance (5-134 offloads tested).
Independent review: fair comparison, no reward hacking, needs multi-run
significance verification (running 3x paired test).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-25 03:48:18 +08:00
4c583f2f1c Revert relaxed gate + push_cost fix: 134 offloads destroyed performance
PD-sep offload overhead (C queue + prefill + KV transfer + D schedule)
far exceeds any load balance benefit. With relaxed gate, cost model
triggered 134 offloads → E2E p90 went from 37s to 82s.

The proven winning configuration is Unified routing in baseline mode
(no Mooncake connector), which beats LMetric on E2E mean/p50/p90
purely through better routing (contention-aware + session affinity).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-25 03:38:59 +08:00
bf4469a150 Fix cost model: accurate push_cost + aligned hard gate
1. push_cost now models both C and D: max(c_cost, d_cost) where
   c_cost includes C's queue + prefill, d_cost includes D's queue +
   RDMA overhead. Old formula only had D's contention + RDMA.
2. Hard gate uses num_requests instead of ongoing_tokens, aligning
   with the contention-based cost model.
3. Fix migration_discount: min(cap, 5) instead of hardcoded min(cap, 3).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-25 01:01:03 +08:00
1d2148cf65 Remove second push_new gate that caused downgrade-to-cold-LOCAL
After _push_allowed was relaxed, the cost model correctly chose push
for high-cache sessions on overloaded instances. But a second gate at
execution time (push_new < heavy_threshold) blocked the actual offload,
downgrading to LOCAL on the target instance — which had no cache.
Worse, session affinity was already updated to the target, so all
subsequent turns also hit cold prefill.

This was the root cause of relaxed gate's performance regression:
affinity broken + push blocked = worst of both worlds.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-25 00:42:31 +08:00
3ae99293fd Relax _push_allowed: gate on request size, not cache savings
The old gate blocked offload when push_new (= input - cache_hit) < 20K,
which prevented migration of high-cache sessions — exactly the ones
that benefit most. After PD-sep, the target receives full KV via RDMA
and has the same cache as the source, so cache_hit is irrelevant to
the offload decision.

New gate: only check input_length >= heavy_threshold (request must be
HEAVY) and max_offload_inflight (concurrency cap). Let the cost model
decide whether the contention difference justifies migration.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-25 00:03:28 +08:00
cc6e5625bb Revert Approach B (session migration): overhead exceeds LB benefit
Reverts 3 commits: e991960, 5772149, 5b1d360.

57 migrations triggered but PD-sep overhead (C queue + KV transfer + D
cold start) caused HEAVY TTFT p90 to regress from 15.9s to 59.1s.
Migration mechanism needs fundamental rework before it can help.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 23:43:47 +08:00
5b1d36080a Fix B2 migration: correct offload call signature (c_inst/d_inst order + cache_hit arg)
The session migration path was calling _handle_cached_prefill_offload
with swapped c_inst/d_inst and missing cache_hit parameter, causing
TypeError on every migration attempt (13 of 41 errors in the test run).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 22:46:46 +08:00
5772149d36 Approach B v2: TTFT-based migration trigger
Replace num_requests threshold with recent TTFT median as migration
trigger. Track per-instance rolling TTFT (last 8 requests) and trigger
migration when median > 5s (configurable). Target is the instance with
lowest recent TTFT, requiring > 2x improvement to justify migration.

This is more responsive than the instantaneous num_requests signal
because TTFT directly measures the user-facing impact of contention.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 21:54:06 +08:00
45b82272c3 Add migration policy design doc with A/B experiment results
Approach A (contention-aware cost model): TTFT p90 -52% vs baseline.
Approach B (session migration): 0 triggers at 1.5x threshold — needs tuning.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 18:24:49 +08:00
e9919605af Approach B: session-level lazy migration trigger
When a request arrives for a session on an overloaded instance, force
migration if three conditions hold:
1. Instance busy: num_requests > avg * migration_request_factor (1.5x)
2. Session has cache value: cache_ratio > 50%
3. Request is HEAVY (>= heavy_threshold)
4. A meaningfully less-loaded target exists (num_requests gap > 2)

This bypasses the cost model for migration decisions — the cost model's
cache-inflated costs prevented migration even when instances had 150s
queue times with 99% cache hit.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 17:34:06 +08:00
e06de5144b Approach A: contention-aware cost model with migration discount
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 17:24:27 +08:00
e13391eeab Evict migrated blocks from prefix cache after KV send completes
After a session migrates from C to D via offload, C's blocks were freed
to the LRU tail (most-recently-used position), making them the last to
be evicted. Since the session won't return to C, these blocks are dead
weight occupying cache capacity.

Now capture block IDs before _free_blocks and call evict_blocks to
remove them from the prefix cache hash table, so they can be reused
sooner for active sessions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 16:56:34 +08:00
4b50c5a08d Fix unified cost model: include decode load in queue + hard overload gate
Two bugs caused elastic to concentrate load on cached instances (10x token
imbalance vs 2.7x baseline):

1. _instance_cost queue only counted pending_prefill_tokens, missing
   ongoing_decode_tokens entirely — instances with 50 decoding requests
   appeared idle to the cost model.

2. Cache hits made overloaded instances look "cheap", creating a positive
   feedback loop: more sessions → more cache → lower cost → more routing.
   Added a hard gate (ongoing_tokens > avg * overload_factor) that breaks
   affinity before the cost model runs, matching linear policy behavior.

Result: token imbalance 10.3x → 2.6x, TTFT p90 -37% vs baseline.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 16:25:02 +08:00
9cebdb6b9b Fix multi-turn replay fidelity: track realized output tokens across all components
The replayer and proxy were building multi-turn prompts from trace tokens,
but the model generates different output tokens. Subsequent turns had wrong
prefix tokens, causing cache misses and invalid experimental measurements.

- replay.py: min_tokens=max_tokens for deterministic length, return_token_ids
  to capture actual output, _apply_realized_prefix for next-turn correction
- proxy: extract output token_ids from SSE, record prompt+output as realized
  prefix in shadow cache, extract _handle_local_request to deduplicate
- bench.sh/launch_elastic_p2p.sh: default elastic mode to unified policy
- mooncake_connector: only send prompt blocks (not stale output blocks),
  track failed_recving_block_ids for error recovery

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 14:47:51 +08:00
cc4a9c91e7 Fix estimate_hit: reuse _lookup_by_tokens instead of reimplementing hash
The standalone hash computation in estimate_hit produced different hashes
than the hash_table (synced from scheduler). Root cause unclear (possibly
pickle serialization differences or hash chain state). Fix: delegate to
_lookup_by_tokens which is proven to work (push_blocks uses it).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 12:41:53 +08:00
657812f8c4 Add deploy_vllm_patches.sh: sync third_party/vllm patches to site-packages
Copies mooncake_connector.py, mooncake_utils.py, scheduler.py from
third_party/vllm to the pip-installed vllm's site-packages. C extensions
stay from the pip package; only Python files are overridden.

Usage: bash scripts/deploy_vllm_patches.sh [HOST]

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 11:59:52 +08:00
bf76273778 Add --offload-mode switch for ablation (direct_read vs cached_prefill)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 11:24:15 +08:00
cdf83493ab Fix A+C: real cache sync + cached-prefill-on-C architecture
A: Add /estimate_hit endpoint to bootstrap server for real-time cache
   probing. Proxy queries this before committing to PUSH, eliminating
   24% zero-match PUSH requests (shadow cache divergence).

C: Add _handle_cached_prefill_offload: C (cache source) does fast
   cached prefill → KV to Mooncake → D pulls and decodes.
   Replaces broken direct_read PUSH where D waited for RDMA transfer
   while occupying KV blocks without doing compute.

Also: update §3.9 baseline to plain vLLM with full mean/p50/p90/p99.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 11:22:38 +08:00
2b9eae0d54 Report §3.9: Unified routing final results — TTFT -25%, E2E -7%
850/850, 0 errors. Single argmin(latency) with soft affinity.
116 PUSH_MIGRATE (all with cache, avg 25k tokens), 723 LOCAL.
TPOT p90 +15% tradeoff from kv_both overhead.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 03:15:32 +08:00
97f4fe5164 Fix: rename inst->chosen in generate function (NameError crash)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 02:55:01 +08:00
5892739159 Add session affinity as soft preference in unified routing
Without affinity, all cached requests route to the same instance
(cache source always has lowest prefill cost), causing 149s queue.

Fix: if the session's last instance has cost <= 2x the global best,
use it (preserves cache locality). Only re-route when the affinity
instance is significantly more expensive (overloaded).

The 2x threshold is intentionally loose — it's not a hardcoded magic
number but a "prefer locality unless clearly worse" heuristic.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 02:37:58 +08:00
6b255fad91 Unified routing: single argmin(expected_latency) over all instances
Replace two-phase routing (pick_instance → offload gate) with a single
cost function evaluated per instance:

  latency(D) = queue(D) + prefill_time(D) + transfer_cost(D)

  - If D has local cache: prefill = (input - local_hit) / throughput
  - If D can receive PUSH from cache source: prefill = (input - push_hit) / throughput + rdma
  - Otherwise: prefill = input / throughput (cold)

Choose argmin(latency). If the winner needs PUSH → trigger migration.

Removed:
- WARM/MEDIUM/HEAVY classification (no routing purpose)
- heavy_threshold, overload_factor, max_offload_inflight, cache_gate_ratio
- Interference penalty magic number (0.3)
- Separate pick_instance + offload gate stages

Only 2 measured parameters remain:
- prefill_throughput = 7000 tokens/s (H20 measured)
- rdma_overhead_s = 0.1s (RDMA PUSH measured)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 02:21:34 +08:00
1cd0a18e2c Report §3.8: Document direct KV cache migration architecture + bugs fixed
Complete documentation of bootstrap-triggered PUSH implementation:
hash table sync, token-based lookup, RDMA WRITE path, cost model,
PYTHONHASHSEED requirement, and all 6 bugs fixed during development.

Verified: 640/640 blocks pushed, External APC 80%, TTFT 0.367s
(vs local cache 0.338s, +0.03s overhead).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 01:52:38 +08:00
8c267ec54e VERIFIED: Bootstrap-triggered PUSH works end-to-end!
Test results:
- 640/640 blocks matched and pushed (ret=0)
- External prefix cache hit rate: 80.0% on D
- Turn 2 TTFT: inst_0 (cached) = 0.338s, inst_1 (RDMA push) = 0.367s
- C's scheduler was NOT involved (0 GPU compute on C)

The complete direct KV cache migration pipeline is working:
D → /push_blocks → C bootstrap matches tokens → C RDMA WRITE → D GPU

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 01:50:37 +08:00