The B2 same-worker TPOT p90 idx is non-monotone: 7.89x at 32k drops
to 2.26x at 65k. The naive reading is "interference gets weaker for
huge prefills"; the actual mechanism is a regime shift, and reading
TPOT p90 alone is misleading.
Three superimposed effects:
1. Cost migration TPOT -> TTFT. A 32k prefill is short enough that
chunked-prefill keeps interleaving decode steps, so overlapping
decodes trickle tokens out at painful per-token rates. A 65k
prefill is long enough that overlapping decodes are *fully*
blocked for ~10s; once they break through, the injection is
winding down and subsequent iterations run unobstructed. The
cost lands on the TTFT clock (14s) instead of inflating TPOT.
2. Bimodal TPOT distribution. At 65k overlap, decodes split into
"blocked entire prefill then normal rate" and "trickled slowly
through prefill chunks". p99 sits on the second population and
grows 59 -> 169.5 ms; p90 sits on the first and shrinks.
3. "Clean" stops being clean. With 4x ~10s injections in 60s, the
110 "clean" decodes at 65k are squeezed into 2-3s recovery
pockets. TPOT p90 clean rises 6.9 -> 9.6 ms (40%), shrinking
the denominator of the ratio.
window_1_results.md adds a new B2 subsection laying out the
mechanism with the per-cell data table and the explicit reading
rule: headline interference metric is TTFT idx (monotone); TPOT
p99 is the right tail indicator; TPOT p90 alone is unsafe across
regime shifts. Direct implication: TTFT and TPOT need separate
SLO thresholds under PD-colo, because they measure costs from
different points in the request lifecycle and the cost migration
between them is workload-dependent.
current_results/characterization_claim_matrix.md adds a new
supported claim for the cost migration, listed against the existing
B2 evidence. current_results/reviewer_risk_register.md adds a
low-severity entry warning future readers off TPOT p90 alone.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Refresh the standing audit package now that B1' / B2 / B3 are complete.
current_results/characterization_claim_matrix.md
Flips seven entries from "not_yet_supported" / "partially_supported"
to "supported" with pointers into window_1_results/. New entries
cover per-session sequentiality, KV per request, real reuse
decomposition, theoretical APC ceiling, the LMetric locality gap,
Unified breaking the locality-vs-latency tradeoff, B2 causal
interference proof, sticky's interference inflation, and the
partial heavy-tail / hot-spot story. B4 SRR + B5 attribution stay
"not_yet_supported" (Window 2 work).
current_results/main_claim_allowed_runs.md
New "Allowed For Routing-Policy Comparison" section pins the five
B3 policy directories. New "Allowed For PD-colo Interference"
section pins the B2 sweep. Legacy section retained for the
pre-instrumentation 200/500/1000-req runs.
current_results/reviewer_risk_register.md
Marks the two old "high"-severity risks (sequentiality / reuse
decomposition) as resolved; adds new entries for the APC
contamination empirics, the b3_analyze.sh truncate-write bug that
cost unified's interference index, the GPU-0 EngineCore ghost
cleanup, the saturated-replay caveat for trace-timestamp dispatch,
and the synthetic B2 decode workload.
current_results/all_figures_index.md
Adds the 8 new Window 1 figures alongside the existing 6 from the
legacy summarize_runs run.
current_results/reproduction_commands.sh
Records the full B3 + B2 + figure pipeline.
analysis/characterization_todo_for_interns.md
Updates the Progress Snapshot table: B0, B1, B2, B3, B6 all DONE;
only B4 and B5 remain (Window 2).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
analysis/characterization/window_1_results.md is the headline write-up
for Window 1: workload characterization (KV per request, real reuse
decomposition, APC theoretical ceilings), B3 5-policy sweep with
per-policy interpretation, B2 same-vs-different-worker interference
microbench with causal reading, and an explicit list of what Window 1
does *not* answer (deferred to B4 SRR sweep + B5 attribution).
Under window_1_results/:
- 5 raw result JSONs from the B3 sweep, the B2 microbench, the APC
upper bound, and the KV footprint
- per-policy hotspot_index.json snapshots so render_window1_figures.py
can plot per-worker TTFT p90 distributions
- 8 PNG figures (figures/) covering the headline claims
Three takeaways the figures pin down:
1) intra-session reuse dominates (93.2%), so session-affinity routing
is the right primary lever
2) unified hybrid affinity hits 79.4% APC (97% of the 79.6% intra-
session ceiling) AND cuts TTFT p90 from lmetric's 15.6s to 7.24s
3) B2 different-worker control sits at idx ≈ 1.0 across 32× prefill-
size variation; same-worker TTFT idx scales 2.15× -> 218×, which
is the cleanest causal evidence for same-worker prefill-decode
interference
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three CPU-only analysis pieces that turn raw Window 1 artifacts into
publishable numbers and figures.
scripts/compute_apc_upper_bound.py
Block-level trie walk over hash_ids to compute the theoretical APC
ceiling on a trace, decomposed into intra-session / any-session /
shared-prefix-only. Gives a fixed reference for what each routing
policy could *possibly* achieve. w600 result: 79.6% intra-session,
80.3% any-session, 0.1% shared-prefix.
analysis/characterization/b2_sweep_analysis.py (rewrite)
Previous version used joined_analysis.interference_index() which
labeled overlap = "any prefill in any other request during this
decode". With short-prompt decode load this is always true
(everyone's prefill overlaps everyone else's decode); n_overlap
was 239/240 even in the different-worker control.
New version labels overlap iff the decode's [t_first_token, t_finish]
intersects an actual large *injection* window, computed from the
cell's "prefill"-tagged metric rows. Different-worker control now
cleanly sits at idx ≈ 1.0, same-worker scales monotonically.
analysis/characterization/render_window1_figures.py
Renders 8 PNGs from the result JSONs: B3 latency / APC vs ceiling
/ APC vs hotspot scatter / per-worker TTFT / failure breakdown,
B2 TPOT and TTFT curves (overlap vs clean and idx), reuse
decomposition, KV footprint.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The first B2 run produced metrics with ttft_s=null/tpot_s=null for
every decode request because the OpenAI-style payload did not set
return_token_ids: true, and the parser only inspected
choices[0].token_ids. With token_ids missing the loop skipped every
chunk, so no per-token timestamps were captured and the aggregator
returned interference_index=null on all 10 cells.
Fix:
- send return_token_ids: true in the payload (matches replayer.replay)
- also accept text-delta chunks as token signals (fallback for
servers that drop token_ids despite the flag)
vLLM engine_state was fine; only the load-gen metric capture was
broken.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The hot-sweep variant of B3 writes one shared engine_state across
all policies; the isolated variant writes per-policy. Previously
slice_engine_state.py was called unconditionally and would
overwrite an isolated policy's real data with an empty slice (the
isolated policy's run-window doesn't overlap with the shared dir's
contents).
Now we check the policy directory's engine_state for any non-empty
engine_*.jsonl first; if present, use it directly; else slice from
the shared one as before.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
scripts/b3_isolated_policy.sh wraps one policy run in a fresh
8-instance vLLM lifecycle: hard reset -> launch -> health -> proxy
-> replayer -> snapshot artifacts -> cleanup. Used when cross-
policy APC contamination matters more than the ~25-min vLLM
warmup overhead per policy.
Counterpart to the existing b3_sweep.sh which keeps vLLM warm
across all policies (faster but warm-cache; we found via the
sticky pre-flight that contamination is < 1% on this trace, so
b3_sweep.sh stays the default).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Documents each pick_instance_* function from cache_aware_proxy.py in
pseudocode so the policy semantics can be cited without re-reading
implementation details. Covers lmetric (main baseline), load_only
(no cache / no affinity control), sticky (hard affinity control),
unified (gated affinity + LMetric fallback), and capped (lmetric on
a per-session turn-capped trace).
Includes a decision matrix that maps each policy to whether it uses
session affinity, cache awareness, load awareness, and overload
break, plus a one-liner per control explaining what comparison
isolates which factor.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reads b3_policy_comparison.json (produced by b3_analyze.sh) and emits
a markdown report with three tables: headline latency + APC,
mechanism indices (interference / hotspot / reuse), and slow-request
cause breakdown. Rows for policies not yet present in the sweep are
left as "pending" so the same renderer can be re-invoked as each
policy finishes, producing an evolving report rather than waiting
for the full sweep.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
scripts/slice_engine_state.py filters a shared engine_*.jsonl by a
[t_start_unix, t_end_unix] window. Needed because the patched
scheduler appends to one file per engine across the whole sweep;
per-policy analysis requires the per-policy slice.
scripts/b3_analyze.sh drives the slice + joined_analysis loop for
every policy directory in a completed sweep, then aggregates one row
per policy (latency percentiles, APC, interference_index,
hotspot_index, reuse fractions, failure-cause counts) into
b3_policy_comparison.json.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
scripts/b2_interference.py is the controlled microbench. It runs two
coroutines against the open proxy bypass (direct vLLM endpoints):
- decode_load: continuous short-prompt requests at fixed QPS into a
designated decode instance, to keep it decode-saturated.
- prefill_injections: N large one-token requests at fixed interval,
pointed at either the same instance (same-worker variant) or a
paired one (different-worker control).
Each cell (variant × prefill_size) gets its own metrics.jsonl plus a
run_window.json containing t_start_unix/t_end_unix. The shared
engine_*.jsonl from the scheduler patch is sliced by that window in
the aggregator.
analysis/characterization/b2_sweep_analysis.py walks the cell tree,
slices the per-worker step log by each cell's window, runs the A5
interference_index() against the slice, and emits a single
b2_sweep_summary.json with one row per cell. This is what feeds the
"interference vs uncached prefill size" figure.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three additions land together because B3's whole point is comparing
LMetric against meaningful controls.
- scripts/cache_aware_proxy.py: two new --policy values.
- load_only: pure min(num_requests) routing, no cache or affinity.
The B3 control that strips locality so the LMetric-vs-load gap is
legible.
- sticky: first turn goes to min-load, subsequent turns ALWAYS
return to the same instance, even under saturation. The B3
control that maxes out locality so the hot-spot cost is legible.
- scripts/build_capped_trace.py: per-session turn cap (default 8).
Generates the session-mass-equalized variant the TODO calls for so
that hot-spot index can be re-measured with the heavy-tail removed.
- scripts/b3_sweep.sh: orchestrates the 5-cell sweep.
- GPU_INDICES makes it easy to skip a dead GPU.
- EXTRA_VLLM_ARGS defaults to --enable-prompt-tokens-details so
usage.prompt_tokens_details.cached_tokens is populated. vLLM
0.18.1 omits the field by default and breaks the reuse-decomp
pipeline; the smoke run surfaced this.
- Trap kills EngineCore by name in addition to "vllm serve" — the
parent dies first but the child holds GPU memory. Was the root
cause of the 89 GB ghost on GPU 0 earlier today.
- Proxy readiness is a polling loop, not a fixed sleep.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Smoke validation on dash0 surfaced three real bugs that broke
interference and failure-attribution labels end-to-end:
1. endpoint_url in metrics is the proxy URL (e.g. http://h:9200);
the vLLM worker URL lives in breakdown's routed_to. The
interference index and label path were taking endpoint_url first,
so every request looked routed to a non-existent worker and the
overlap counter stayed at zero.
2. _normalize_worker hard-coded base port 8000, so a smoke run on
port 9100 resolved to engine_1100 instead of engine_0. Added a
--worker-map URL=engine_id CLI flag and _resolve_worker() that
prefers the explicit map and falls back to the heuristic.
3. vLLM rewrites the per-step rid as cmpl-<proxy_id>-<i>-<hash>, so
the str equality check between per_req rid and our proxy
request_id never matched -> every prefill step looked like
"other request prefill", which would have flipped overlap to
100%. Added _vllm_rid_matches() that strips the cmpl-/chatcmpl-
prefix.
After the fix, the same smoke run reports interference_index = 22.9
across 24 overlap / 6 clean requests on a single instance, which is
the expected shape for serial dispatch into a cold engine.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Captures 5 runs from the experiment matrix (combined-ca x3 seeds,
pdsep-4p4d seed1, pdsep-6p2d seed1) on traces/w600_r0.0015_st30.jsonl
with cuda graphs enabled. The headline:
combined-ca: TTFT p50 0.91s success 99.5%
pdsep-4p4d: TTFT p50 62.8s success 52% (69x worse, half dropped)
pdsep-6p2d: TTFT p50 51.1s success 68% (56x worse, third dropped)
C2 (fig_c2): headline bars per config with error bars.
C3 (fig_c3): per-instance KV utilization time-series. Both PD-sep
splits hit the memory wall, but the side differs by P:D ratio --
4P+4D pins the P-side, 6P+2D pins both sides (D-side back-pressures
P-side).
C4 (fig_c4): TTFT stacked breakdown. 99% of PD-sep TTFT is P-side
prefill compute; D-side wait + first token is <=1.2s. The bottleneck
is P-side prefill queueing, not D-side decode wait as the original
analytical model assumed.
system_analysis.md gains a Layer 5b that reconciles the analytical
KV-wall model (which considered D-side only) with the empirical
finding that the wall hits whichever side has fewer GPUs, and
co-saturates both at extreme splits via D-side back-pressure.
plot_pd_matrix.py ingests outputs/pd_matrix/* into all four figures.
bench.sh gained AGENTIC_STEP_LOG_DIR hooks for future runs (set during
this work but not used by the current matrix's data).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
New analysis/characterization/joined_analysis.py joins replayer
metrics.jsonl + proxy breakdown.json + worker_state.jsonl by
request_id, plus engine_*.jsonl by worker_id, and emits:
- joined.jsonl per-request merged record
- reuse_decomposition.json real intra/cross/shared classification
using session_id + hash_ids + cached_tokens
- interference_index.json TPOT_p90(same-worker prefill overlap)
/ TPOT_p90(clean), per Batch 2
- hotspot_index.json max/median worker TTFT-p90, per Batch 3
- failure_label.jsonl per-slow-request cause label, per Batch 5
- failure_breakdown.json label histogram
- window_summary.json SRR warmup/steady/drain aggregates
Closes the analyzer side of Phase A; replaces the
status: unavailable placeholders the existing scaffold emits when
join sources are missing.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
New replayer/srr.py drives a Poisson session-arrival load against the
existing proxy, with strict per-session turn sequentiality, explicit
warmup/steady/drain windows, and per-arrival fresh session_id +
request_id so APC/session-affinity counters are not contaminated by
repeated draws from the trace pool. Writes window_summary.json with
attempted/completed/errored split by window so latency tails can be
read on the steady-state window only.
Required by Batch 4 SRR sweep; trace-timestamp dispatch in replay.py
cannot drive arrival rate independently.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When AGENTIC_STEP_LOG_PATH is set, the scheduler emits one JSONL line
per scheduler step with t_unix, worker_id, prefill/decode token
counts, n_running/n_waiting, preempted ids, and per-request phase
labels. No-op when the env var is unset, so production engines are
not impacted. bench.sh now threads AGENTIC_STEP_LOG_DIR through to
each per-engine launch so step logs end up at engine_${i}.jsonl.
Required by Batch 2 (PD-colo interference index) and Batch 5
(same-worker overlap attribution); engine /metrics polling cannot
provide per-step granularity.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Honor incoming X-Request-Id so replayer metrics and proxy breakdown
share a join key. Each route decision now captures session_id, the
full per-worker candidate-score snapshot (ongoing/pending/num_requests
/cached_blocks plus both linear and lmetric scores), the chosen score,
and unix timestamps for first-token and done events. A separate
_worker_state_log records one row per decision and is exposed via
GET /worker_state; GET /worker_state/latest returns a live snapshot
without recording it.
Required by Batch 3 (session hot-spot proof) and Batch 5 (failure
attribution); existing breakdown.json had no per-worker state at
decision time.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
RequestMetrics gains absolute unix timestamps (t_dispatch_unix,
t_first_token_unix, t_finish_unix), the proxy_request_id, the chosen
endpoint URL, and the trace hash_ids. Replayer sends
X-Request-Id: <session_id>:<turn_id>:<chat_id>:<idx> so proxy
breakdown rows can be joined to metrics by exact key.
Required by Batch 0 (online sequentiality proof) and Batch 1 reuse
decomposition; existing metrics.jsonl couldn't establish either.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Add Progress Snapshot table to the intern TODO so per-batch status
(DONE / partial / blocked-on-instrumentation) is visible at a glance.
- New analysis/claude_characterization_work_plan.md scopes the Phase A
instrumentation tasks (A1-A5) plus Window 1 (B1'+B2+B3) and Window 2
(B4+B5) on dash0, with locked decisions for model, topology, trace,
SLO style, and GPU phasing.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the experiment harness that gates the empirical claims (C2/C3/C4/C5)
in the PD-sep paper section. Three pieces:
1. scripts/bench.sh: new --mode pdsep with --pd-ratio P:D, and an
--eager flag to re-enable --enforce-eager for the cuda-graph
ablation. pdsep reuses the elastic-mode Mooncake kv_both launch and
swaps the proxy command from --combined to --prefill/--decode.
baseline and elastic flows are unchanged.
2. analysis/pd_sep_paper_section/scripts/bench_pd_matrix.sh: matrix
driver that runs {combined-ca, pdsep-4p4d, pdsep-6p2d} x cudagraph
x 3 seeds by default (~2 h on dash0). --with-rr adds combined-rr;
--with-eager doubles to ~5 h with the cuda-graph ablation. Skips
completed runs, captures per-instance vLLM logs (needed for C3
step-level KV-utilization mining).
3. fig_kv_memory_wall.pdf: empirical anchor (star) at REPORT.md §3.3's
observed 6P+2D 97% KV utilization. The marker lands on the model's
predicted curve at p90 input, confirming the steady-state analysis.
README updated with the run command, output layout, and the followup
plotters that consume outputs/pd_matrix/.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the system-level argument resolving the roofline/PD-sep paradox.
Even at 95% cache reuse prefill stays compute-bound (the C6 roofline
fact), yet PD separation regresses TTFT 72%. The new system_analysis.md
walks through six layers showing why the roofline claim is necessary
but not sufficient, with the falsifiable condition being decode-side
KV memory budget: concurrent_decode * KV_per_req / (N_D * HBM_pool).
For chatbot this ratio is << 1 at any layout; for agentic at p90+
context it goes >> 1 under 4P+4D and 6P+2D, predicting the empirical
97% decode KV occupancy. fig_kv_memory_wall.pdf visualizes the model
with audit-able constants; fig_c1a/b ground the per-request KV-size
inputs in the actual sampled trace (input p50=33.5k, p90=101k,
intra-session reuse 79.2%).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds analysis/pd_sep_paper_section/ as the home for the "PD separation is
net negative under agentic workloads" paper section: plot scripts for C1
(workload chars), C6 (roofline), C7 (routing-vs-PD-sep lever), the C6/C7
PDFs already rendered, and a README mapping candidate claims to required
figures plus open re-run items.
Removes --enforce-eager from bench.sh and all active launch scripts so
cuda graphs are captured -- the prior methodology suppressed one of
PD-sep's structural advantages (D-node fixed-shape decode). Legacy
scripts under scripts/legacy/ are intentionally untouched as historical
records.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per analysis/unified_routing_fix_review.md #2, several docs still
presented the retired single-argmin + PUSH-migration design as the
final algorithm. Mark them superseded and document the current hybrid
direction (commit 255c8e6).
- REPORT.md §1.1 / §3.9: add errata callout and section header noting
the "Final Design" framing was retired after cc6e562 / 4c583f2;
point readers to docs/migration-policy-design.md.
- docs/migration-policy-design.md: rewrite. Opens with the current
hybrid algorithm (LMetric base + cache_ratio>0.5 affinity gate +
tie-breaker), then a "What Was Retired" commit table, then the old
Approach A numbers preserved as "Historical Baseline-Mode Comparison".
- analysis/research_findings.md §2.2 / §5: correct the LMetric framing.
LMetric isn't "neutralized by affinity constraints" (pure --policy
lmetric has no affinity at all); it converges to similar placements
because P_tokens includes new_uncached_tokens, giving it implicit
soft affinity.
- analysis/elastic_hypotheses.md: same LMetric correction in the
"DOESN'T work" summary, plus a footer cross-referencing the current
routing direction.
- analysis/unified_routing_fix_review.md: track this file (was
untracked); it is the review handoff cited from the updated docs.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Delete unreachable best_needs_push block in _handle_combined and the
four orphaned helpers (_handle_cached_prefill_offload,
_handle_direct_read_offload, _query_bootstrap_hit,
_get_bootstrap_client). Their only caller was the retired PUSH gate;
see REPORT §3.9 errata for the rejected experiments (cc6e562, 4c583f2).
- Extract pick_instance_unified_hybrid as a pure function returning
(chosen, idx, decision_dict). The decision dict carries the review #7
breakdown fields (decision, affinity_idx/chosen_idx, cache_hit/ratio,
avg_num_requests, fallback_score, tie_break_used).
- Add LMetric-fallback tie-breaker (primary score, then new_uncached,
num_requests, round-robin) so new sessions don't all pin to inst 0
when BS=0 across the board.
- Drop the lmetric-policy affinity write so --policy lmetric stays
affinity-free per review #3.
- Mark --max-offload-inflight / --offload-mode / --cache-gate-ratio /
--decode-iteration-s as [DEPRECATED] in --help; flags remain accepted
so scripts/bench.sh and legacy launchers don't break.
- Revert uncommitted overload_factor 2.0->1.5 default; H7 sweep already
rejected this knob (within noise). Future sweeps should go via CLI.
Tests: add 6 hybrid-policy tests in tests/test_proxy_pick.py covering
affinity-hit, overload break, low-cache fallback, tie-break rotation,
lmetric purity, and breakdown field shape. 19/19 pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the full unified cost model with a simpler hybrid:
- If session has >50% cache on affinity instance AND instance not overloaded
(num_requests <= avg * overload_factor) → stick to affinity
- Otherwise → use LMetric (P × BS) for best load balance
This combines LMetric's superior load balance with explicit session
affinity for high-value sessions that have significant cache accumulation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PD-sep offload overhead (C queue + prefill + KV transfer + D schedule)
far exceeds any load balance benefit. With relaxed gate, cost model
triggered 134 offloads → E2E p90 went from 37s to 82s.
The proven winning configuration is Unified routing in baseline mode
(no Mooncake connector), which beats LMetric on E2E mean/p50/p90
purely through better routing (contention-aware + session affinity).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. push_cost now models both C and D: max(c_cost, d_cost) where
c_cost includes C's queue + prefill, d_cost includes D's queue +
RDMA overhead. Old formula only had D's contention + RDMA.
2. Hard gate uses num_requests instead of ongoing_tokens, aligning
with the contention-based cost model.
3. Fix migration_discount: min(cap, 5) instead of hardcoded min(cap, 3).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After _push_allowed was relaxed, the cost model correctly chose push
for high-cache sessions on overloaded instances. But a second gate at
execution time (push_new < heavy_threshold) blocked the actual offload,
downgrading to LOCAL on the target instance — which had no cache.
Worse, session affinity was already updated to the target, so all
subsequent turns also hit cold prefill.
This was the root cause of relaxed gate's performance regression:
affinity broken + push blocked = worst of both worlds.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The old gate blocked offload when push_new (= input - cache_hit) < 20K,
which prevented migration of high-cache sessions — exactly the ones
that benefit most. After PD-sep, the target receives full KV via RDMA
and has the same cache as the source, so cache_hit is irrelevant to
the offload decision.
New gate: only check input_length >= heavy_threshold (request must be
HEAVY) and max_offload_inflight (concurrency cap). Let the cost model
decide whether the contention difference justifies migration.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reverts 3 commits: e991960, 5772149, 5b1d360.
57 migrations triggered but PD-sep overhead (C queue + KV transfer + D
cold start) caused HEAVY TTFT p90 to regress from 15.9s to 59.1s.
Migration mechanism needs fundamental rework before it can help.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The session migration path was calling _handle_cached_prefill_offload
with swapped c_inst/d_inst and missing cache_hit parameter, causing
TypeError on every migration attempt (13 of 41 errors in the test run).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace num_requests threshold with recent TTFT median as migration
trigger. Track per-instance rolling TTFT (last 8 requests) and trigger
migration when median > 5s (configurable). Target is the instance with
lowest recent TTFT, requiring > 2x improvement to justify migration.
This is more responsive than the instantaneous num_requests signal
because TTFT directly measures the user-facing impact of contention.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Approach A (contention-aware cost model): TTFT p90 -52% vs baseline.
Approach B (session migration): 0 triggers at 1.5x threshold — needs tuning.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a request arrives for a session on an overloaded instance, force
migration if three conditions hold:
1. Instance busy: num_requests > avg * migration_request_factor (1.5x)
2. Session has cache value: cache_ratio > 50%
3. Request is HEAVY (>= heavy_threshold)
4. A meaningfully less-loaded target exists (num_requests gap > 2)
This bypasses the cost model for migration decisions — the cost model's
cache-inflated costs prevented migration even when instances had 150s
queue times with 99% cache hit.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After a session migrates from C to D via offload, C's blocks were freed
to the LRU tail (most-recently-used position), making them the last to
be evicted. Since the session won't return to C, these blocks are dead
weight occupying cache capacity.
Now capture block IDs before _free_blocks and call evict_blocks to
remove them from the prefix cache hash table, so they can be reused
sooner for active sessions.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two bugs caused elastic to concentrate load on cached instances (10x token
imbalance vs 2.7x baseline):
1. _instance_cost queue only counted pending_prefill_tokens, missing
ongoing_decode_tokens entirely — instances with 50 decoding requests
appeared idle to the cost model.
2. Cache hits made overloaded instances look "cheap", creating a positive
feedback loop: more sessions → more cache → lower cost → more routing.
Added a hard gate (ongoing_tokens > avg * overload_factor) that breaks
affinity before the cost model runs, matching linear policy behavior.
Result: token imbalance 10.3x → 2.6x, TTFT p90 -37% vs baseline.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The replayer and proxy were building multi-turn prompts from trace tokens,
but the model generates different output tokens. Subsequent turns had wrong
prefix tokens, causing cache misses and invalid experimental measurements.
- replay.py: min_tokens=max_tokens for deterministic length, return_token_ids
to capture actual output, _apply_realized_prefix for next-turn correction
- proxy: extract output token_ids from SSE, record prompt+output as realized
prefix in shadow cache, extract _handle_local_request to deduplicate
- bench.sh/launch_elastic_p2p.sh: default elastic mode to unified policy
- mooncake_connector: only send prompt blocks (not stale output blocks),
track failed_recving_block_ids for error recovery
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The standalone hash computation in estimate_hit produced different hashes
than the hash_table (synced from scheduler). Root cause unclear (possibly
pickle serialization differences or hash chain state). Fix: delegate to
_lookup_by_tokens which is proven to work (push_blocks uses it).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copies mooncake_connector.py, mooncake_utils.py, scheduler.py from
third_party/vllm to the pip-installed vllm's site-packages. C extensions
stay from the pip package; only Python files are overridden.
Usage: bash scripts/deploy_vllm_patches.sh [HOST]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
A: Add /estimate_hit endpoint to bootstrap server for real-time cache
probing. Proxy queries this before committing to PUSH, eliminating
24% zero-match PUSH requests (shadow cache divergence).
C: Add _handle_cached_prefill_offload: C (cache source) does fast
cached prefill → KV to Mooncake → D pulls and decodes.
Replaces broken direct_read PUSH where D waited for RDMA transfer
while occupying KV blocks without doing compute.
Also: update §3.9 baseline to plain vLLM with full mean/p50/p90/p99.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Without affinity, all cached requests route to the same instance
(cache source always has lowest prefill cost), causing 149s queue.
Fix: if the session's last instance has cost <= 2x the global best,
use it (preserves cache locality). Only re-route when the affinity
instance is significantly more expensive (overloaded).
The 2x threshold is intentionally loose — it's not a hardcoded magic
number but a "prefer locality unless clearly worse" heuristic.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>