v2.0 ran on B3 and triggered PD-sep only 2 / 1214 times (0.2%). The
gates were too conservative; the v2-vs-v1 latency gap (TTFT p90
7.35 -> 8.96 s) is therefore probably attributable to kv_both
always-on overhead, not to the PD-sep mechanism itself. v2.1 has two
fixes plus an isolation control.
Bug fix:
- The "chosen has live decodes worth protecting" gate combined
num_requests and ongoing_decode_tokens with AND, falling through
when EITHER was small. Under agentic workloads each worker rarely
stacks more than 1-2 concurrent requests, so the gate killed 84%
of v2.0 candidates that reached it. Replace with a pure
ongoing_decode_tokens == 0 check ("chosen_no_active_decode") —
same semantic, much higher recall.
Threshold relaxation (B2 microbench is the calibration source):
- pd_sep_min_new_tokens: 16000 -> 8000 (B2 TPOT idx 1.9x already
at 8k, TTFT idx 12x — strictly worth migrating)
- pd_sep_min_decodes_protected: 2 -> 1
- pd_sep_min_src_cache_tokens: 8000 -> 4000
- pd_sep_min_extra_cache_tokens: 4000 -> 2000
Isolation control:
- New --policy unified_kv_both option. Uses the exact same picker as
--policy unified but the vLLMs are launched in kv_role=kv_both
(the same launch mode unified_v2 requires). PD-sep never fires.
Compares against unified_v2 to attribute any v2 effect to the
PD-sep branch alone, not the kv_both always-on overhead.
- Both unified_kv_both and unified_v2 auto-enable kv_both launch in
b3_isolated_policy.sh.
Tests:
- Updated the existing "chosen has no decodes" test for the new
gate name and semantic.
- All 24 proxy tests pass.
Refs: window_1_results/v2_breakdown analysis (88.7% of candidates
caught by old new_local_below_threshold; 84% of the remainder
caught by the old few_decodes gate).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds a sixth routing policy --policy unified_v2 that wraps the
existing unified hybrid picker with a selective PD-sep branch.
When all of the following hold, a request is split prefill-on-src,
decode-on-chosen via Mooncake kv_role=kv_both transfer:
1. new_local = input_length - chosen.cache_hit > 16k
(B2 microbench shows same-worker TTFT idx >= 3x from this size up)
2. chosen has live decodes worth protecting (>= 2 in-flight)
3. some other instance holds materially more cache for this prefix
(>= 8k tokens, and >= 4k more than chosen)
4. cost(src_interference + RDMA xfer) + 0.2s margin < cost(chosen_interference)
The cost model is the audit-blessed shape from E1's post-mortem:
- gate on new_tokens (post-cache), NOT input_length (the old PUSH gate)
- bind to a single transfer mechanism (kv_both peer-to-peer pull)
- realistic RDMA cost as a function of bytes: 0.3s base +
bytes / 2.7 GB/s (calibrated against contention_16s_elastic p50)
- both source and target decode counts considered
E2 mechanism-level patches not yet applied (this commit is policy-only).
Patches 6.2 / 6.3 / 6.5 remain on the table. Patch 6.6 (per-request
xfer timeout, 60s default) is implemented on the proxy side as an
httpx per-chunk read timeout on the dst streaming call, so a stuck
KV transfer fails the request instead of hanging for 600s.
cache_aware_proxy.py:
- Settings: kv_bytes_per_token, prefill_throughput_kv_both,
rdma_base_overhead_s, rdma_effective_gb_per_s, pd_sep_* gating knobs
- estimate_transfer_cost(bytes) replaces the constant rdma_overhead_s
- estimate_same_worker_interference_s(new_tokens, num_decodes) reads off
the B2 penalty curve in 4 bins
- pick_instance_unified_v2: inherits unified, returns extra
(src_inst, src_idx) tuple when PD-sep wins the cost compare
- _handle_combined_pd_sep_v2: prefill on src (do_remote_decode=True,
max_tokens=1), Mooncake xfer, decode-stream on dst with httpx
Timeout(read=pd_sep_xfer_timeout_s)
- --policy unified_v2 added to argparse choices
- lifespan auto-runs init_prefill_bootstrap when policy is unified_v2
b3_isolated_policy.sh:
- ENABLE_KV_BOTH env var, auto-set when POLICY=unified_v2, threads
kv_role=kv_both + VLLM_MOONCAKE_BOOTSTRAP_PORT to vllm and
--bootstrap-ports to the proxy
Tests: 8 new unit tests cover the gating predicates and the cost
estimators; all 32 proxy tests still pass.
Refs: E1 (PUSH post-mortem) + E2 (Mooncake audit) reports.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three fixes from the B3 audit:
1) joined_analysis.hotspot_index used sorted[n//2] as median, which
returns the ~60th percentile for n=8 (even-length). Systematically
under-states the hotspot index. Recomputed values:
lmetric 2.238 -> 2.253 (+0.7%)
load_only 1.140 -> 1.294 (+13.5%)
sticky 2.349 -> 2.728 (+16.1%)
unified 3.350 -> 3.667 (+9.5%)
capped 1.937 -> 2.020 (+4.3%)
Qualitative ranking preserved; "capped only modestly reduces hotspot"
story holds with ~10% drop instead of the previously reported 13%.
Added test_hotspot_index_uses_true_median_for_even_n to lock in the
fix.
2) b3_analyze.sh's pct() helper used floor-indexed percentile
sorted[int(p*(n-1))], inconsistent with metrics._percentile and
joined_analysis._percentile which both use linear interpolation.
Now matches.
3) b3_sweep.sh's capped step called run_policy "capped", but the
proxy's argparse has no "capped" choice, so the hot-sweep variant
would have crashed on this step. The actual capped data was
produced via b3_isolated_policy.sh with --policy lmetric. Replace
the broken inline call with an explicit launch_proxy lmetric +
inline replayer block so the sweep script matches the data path
it documents.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three CPU-only analysis pieces that turn raw Window 1 artifacts into
publishable numbers and figures.
scripts/compute_apc_upper_bound.py
Block-level trie walk over hash_ids to compute the theoretical APC
ceiling on a trace, decomposed into intra-session / any-session /
shared-prefix-only. Gives a fixed reference for what each routing
policy could *possibly* achieve. w600 result: 79.6% intra-session,
80.3% any-session, 0.1% shared-prefix.
analysis/characterization/b2_sweep_analysis.py (rewrite)
Previous version used joined_analysis.interference_index() which
labeled overlap = "any prefill in any other request during this
decode". With short-prompt decode load this is always true
(everyone's prefill overlaps everyone else's decode); n_overlap
was 239/240 even in the different-worker control.
New version labels overlap iff the decode's [t_first_token, t_finish]
intersects an actual large *injection* window, computed from the
cell's "prefill"-tagged metric rows. Different-worker control now
cleanly sits at idx ≈ 1.0, same-worker scales monotonically.
analysis/characterization/render_window1_figures.py
Renders 8 PNGs from the result JSONs: B3 latency / APC vs ceiling
/ APC vs hotspot scatter / per-worker TTFT / failure breakdown,
B2 TPOT and TTFT curves (overlap vs clean and idx), reuse
decomposition, KV footprint.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The first B2 run produced metrics with ttft_s=null/tpot_s=null for
every decode request because the OpenAI-style payload did not set
return_token_ids: true, and the parser only inspected
choices[0].token_ids. With token_ids missing the loop skipped every
chunk, so no per-token timestamps were captured and the aggregator
returned interference_index=null on all 10 cells.
Fix:
- send return_token_ids: true in the payload (matches replayer.replay)
- also accept text-delta chunks as token signals (fallback for
servers that drop token_ids despite the flag)
vLLM engine_state was fine; only the load-gen metric capture was
broken.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The hot-sweep variant of B3 writes one shared engine_state across
all policies; the isolated variant writes per-policy. Previously
slice_engine_state.py was called unconditionally and would
overwrite an isolated policy's real data with an empty slice (the
isolated policy's run-window doesn't overlap with the shared dir's
contents).
Now we check the policy directory's engine_state for any non-empty
engine_*.jsonl first; if present, use it directly; else slice from
the shared one as before.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
scripts/b3_isolated_policy.sh wraps one policy run in a fresh
8-instance vLLM lifecycle: hard reset -> launch -> health -> proxy
-> replayer -> snapshot artifacts -> cleanup. Used when cross-
policy APC contamination matters more than the ~25-min vLLM
warmup overhead per policy.
Counterpart to the existing b3_sweep.sh which keeps vLLM warm
across all policies (faster but warm-cache; we found via the
sticky pre-flight that contamination is < 1% on this trace, so
b3_sweep.sh stays the default).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reads b3_policy_comparison.json (produced by b3_analyze.sh) and emits
a markdown report with three tables: headline latency + APC,
mechanism indices (interference / hotspot / reuse), and slow-request
cause breakdown. Rows for policies not yet present in the sweep are
left as "pending" so the same renderer can be re-invoked as each
policy finishes, producing an evolving report rather than waiting
for the full sweep.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
scripts/slice_engine_state.py filters a shared engine_*.jsonl by a
[t_start_unix, t_end_unix] window. Needed because the patched
scheduler appends to one file per engine across the whole sweep;
per-policy analysis requires the per-policy slice.
scripts/b3_analyze.sh drives the slice + joined_analysis loop for
every policy directory in a completed sweep, then aggregates one row
per policy (latency percentiles, APC, interference_index,
hotspot_index, reuse fractions, failure-cause counts) into
b3_policy_comparison.json.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
scripts/b2_interference.py is the controlled microbench. It runs two
coroutines against the open proxy bypass (direct vLLM endpoints):
- decode_load: continuous short-prompt requests at fixed QPS into a
designated decode instance, to keep it decode-saturated.
- prefill_injections: N large one-token requests at fixed interval,
pointed at either the same instance (same-worker variant) or a
paired one (different-worker control).
Each cell (variant × prefill_size) gets its own metrics.jsonl plus a
run_window.json containing t_start_unix/t_end_unix. The shared
engine_*.jsonl from the scheduler patch is sliced by that window in
the aggregator.
analysis/characterization/b2_sweep_analysis.py walks the cell tree,
slices the per-worker step log by each cell's window, runs the A5
interference_index() against the slice, and emits a single
b2_sweep_summary.json with one row per cell. This is what feeds the
"interference vs uncached prefill size" figure.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three additions land together because B3's whole point is comparing
LMetric against meaningful controls.
- scripts/cache_aware_proxy.py: two new --policy values.
- load_only: pure min(num_requests) routing, no cache or affinity.
The B3 control that strips locality so the LMetric-vs-load gap is
legible.
- sticky: first turn goes to min-load, subsequent turns ALWAYS
return to the same instance, even under saturation. The B3
control that maxes out locality so the hot-spot cost is legible.
- scripts/build_capped_trace.py: per-session turn cap (default 8).
Generates the session-mass-equalized variant the TODO calls for so
that hot-spot index can be re-measured with the heavy-tail removed.
- scripts/b3_sweep.sh: orchestrates the 5-cell sweep.
- GPU_INDICES makes it easy to skip a dead GPU.
- EXTRA_VLLM_ARGS defaults to --enable-prompt-tokens-details so
usage.prompt_tokens_details.cached_tokens is populated. vLLM
0.18.1 omits the field by default and breaks the reuse-decomp
pipeline; the smoke run surfaced this.
- Trap kills EngineCore by name in addition to "vllm serve" — the
parent dies first but the child holds GPU memory. Was the root
cause of the 89 GB ghost on GPU 0 earlier today.
- Proxy readiness is a polling loop, not a fixed sleep.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
When AGENTIC_STEP_LOG_PATH is set, the scheduler emits one JSONL line
per scheduler step with t_unix, worker_id, prefill/decode token
counts, n_running/n_waiting, preempted ids, and per-request phase
labels. No-op when the env var is unset, so production engines are
not impacted. bench.sh now threads AGENTIC_STEP_LOG_DIR through to
each per-engine launch so step logs end up at engine_${i}.jsonl.
Required by Batch 2 (PD-colo interference index) and Batch 5
(same-worker overlap attribution); engine /metrics polling cannot
provide per-step granularity.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Honor incoming X-Request-Id so replayer metrics and proxy breakdown
share a join key. Each route decision now captures session_id, the
full per-worker candidate-score snapshot (ongoing/pending/num_requests
/cached_blocks plus both linear and lmetric scores), the chosen score,
and unix timestamps for first-token and done events. A separate
_worker_state_log records one row per decision and is exposed via
GET /worker_state; GET /worker_state/latest returns a live snapshot
without recording it.
Required by Batch 3 (session hot-spot proof) and Batch 5 (failure
attribution); existing breakdown.json had no per-worker state at
decision time.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the experiment harness that gates the empirical claims (C2/C3/C4/C5)
in the PD-sep paper section. Three pieces:
1. scripts/bench.sh: new --mode pdsep with --pd-ratio P:D, and an
--eager flag to re-enable --enforce-eager for the cuda-graph
ablation. pdsep reuses the elastic-mode Mooncake kv_both launch and
swaps the proxy command from --combined to --prefill/--decode.
baseline and elastic flows are unchanged.
2. analysis/pd_sep_paper_section/scripts/bench_pd_matrix.sh: matrix
driver that runs {combined-ca, pdsep-4p4d, pdsep-6p2d} x cudagraph
x 3 seeds by default (~2 h on dash0). --with-rr adds combined-rr;
--with-eager doubles to ~5 h with the cuda-graph ablation. Skips
completed runs, captures per-instance vLLM logs (needed for C3
step-level KV-utilization mining).
3. fig_kv_memory_wall.pdf: empirical anchor (star) at REPORT.md §3.3's
observed 6P+2D 97% KV utilization. The marker lands on the model's
predicted curve at p90 input, confirming the steady-state analysis.
README updated with the run command, output layout, and the followup
plotters that consume outputs/pd_matrix/.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds analysis/pd_sep_paper_section/ as the home for the "PD separation is
net negative under agentic workloads" paper section: plot scripts for C1
(workload chars), C6 (roofline), C7 (routing-vs-PD-sep lever), the C6/C7
PDFs already rendered, and a README mapping candidate claims to required
figures plus open re-run items.
Removes --enforce-eager from bench.sh and all active launch scripts so
cuda graphs are captured -- the prior methodology suppressed one of
PD-sep's structural advantages (D-node fixed-shape decode). Legacy
scripts under scripts/legacy/ are intentionally untouched as historical
records.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
- Delete unreachable best_needs_push block in _handle_combined and the
four orphaned helpers (_handle_cached_prefill_offload,
_handle_direct_read_offload, _query_bootstrap_hit,
_get_bootstrap_client). Their only caller was the retired PUSH gate;
see REPORT §3.9 errata for the rejected experiments (cc6e562, 4c583f2).
- Extract pick_instance_unified_hybrid as a pure function returning
(chosen, idx, decision_dict). The decision dict carries the review #7
breakdown fields (decision, affinity_idx/chosen_idx, cache_hit/ratio,
avg_num_requests, fallback_score, tie_break_used).
- Add LMetric-fallback tie-breaker (primary score, then new_uncached,
num_requests, round-robin) so new sessions don't all pin to inst 0
when BS=0 across the board.
- Drop the lmetric-policy affinity write so --policy lmetric stays
affinity-free per review #3.
- Mark --max-offload-inflight / --offload-mode / --cache-gate-ratio /
--decode-iteration-s as [DEPRECATED] in --help; flags remain accepted
so scripts/bench.sh and legacy launchers don't break.
- Revert uncommitted overload_factor 2.0->1.5 default; H7 sweep already
rejected this knob (within noise). Future sweeps should go via CLI.
Tests: add 6 hybrid-policy tests in tests/test_proxy_pick.py covering
affinity-hit, overload break, low-cache fallback, tie-break rotation,
lmetric purity, and breakdown field shape. 19/19 pass.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Replace the full unified cost model with a simpler hybrid:
- If session has >50% cache on affinity instance AND instance not overloaded
(num_requests <= avg * overload_factor) → stick to affinity
- Otherwise → use LMetric (P × BS) for best load balance
This combines LMetric's superior load balance with explicit session
affinity for high-value sessions that have significant cache accumulation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PD-sep offload overhead (C queue + prefill + KV transfer + D schedule)
far exceeds any load balance benefit. With relaxed gate, cost model
triggered 134 offloads → E2E p90 went from 37s to 82s.
The proven winning configuration is Unified routing in baseline mode
(no Mooncake connector), which beats LMetric on E2E mean/p50/p90
purely through better routing (contention-aware + session affinity).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1. push_cost now models both C and D: max(c_cost, d_cost) where
c_cost includes C's queue + prefill, d_cost includes D's queue +
RDMA overhead. Old formula only had D's contention + RDMA.
2. Hard gate uses num_requests instead of ongoing_tokens, aligning
with the contention-based cost model.
3. Fix migration_discount: min(cap, 5) instead of hardcoded min(cap, 3).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After _push_allowed was relaxed, the cost model correctly chose push
for high-cache sessions on overloaded instances. But a second gate at
execution time (push_new < heavy_threshold) blocked the actual offload,
downgrading to LOCAL on the target instance — which had no cache.
Worse, session affinity was already updated to the target, so all
subsequent turns also hit cold prefill.
This was the root cause of relaxed gate's performance regression:
affinity broken + push blocked = worst of both worlds.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The old gate blocked offload when push_new (= input - cache_hit) < 20K,
which prevented migration of high-cache sessions — exactly the ones
that benefit most. After PD-sep, the target receives full KV via RDMA
and has the same cache as the source, so cache_hit is irrelevant to
the offload decision.
New gate: only check input_length >= heavy_threshold (request must be
HEAVY) and max_offload_inflight (concurrency cap). Let the cost model
decide whether the contention difference justifies migration.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reverts 3 commits: e991960, 5772149, 5b1d360.
57 migrations triggered but PD-sep overhead (C queue + KV transfer + D
cold start) caused HEAVY TTFT p90 to regress from 15.9s to 59.1s.
Migration mechanism needs fundamental rework before it can help.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The session migration path was calling _handle_cached_prefill_offload
with swapped c_inst/d_inst and missing cache_hit parameter, causing
TypeError on every migration attempt (13 of 41 errors in the test run).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace num_requests threshold with recent TTFT median as migration
trigger. Track per-instance rolling TTFT (last 8 requests) and trigger
migration when median > 5s (configurable). Target is the instance with
lowest recent TTFT, requiring > 2x improvement to justify migration.
This is more responsive than the instantaneous num_requests signal
because TTFT directly measures the user-facing impact of contention.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When a request arrives for a session on an overloaded instance, force
migration if three conditions hold:
1. Instance busy: num_requests > avg * migration_request_factor (1.5x)
2. Session has cache value: cache_ratio > 50%
3. Request is HEAVY (>= heavy_threshold)
4. A meaningfully less-loaded target exists (num_requests gap > 2)
This bypasses the cost model for migration decisions — the cost model's
cache-inflated costs prevented migration even when instances had 150s
queue times with 99% cache hit.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two bugs caused elastic to concentrate load on cached instances (10x token
imbalance vs 2.7x baseline):
1. _instance_cost queue only counted pending_prefill_tokens, missing
ongoing_decode_tokens entirely — instances with 50 decoding requests
appeared idle to the cost model.
2. Cache hits made overloaded instances look "cheap", creating a positive
feedback loop: more sessions → more cache → lower cost → more routing.
Added a hard gate (ongoing_tokens > avg * overload_factor) that breaks
affinity before the cost model runs, matching linear policy behavior.
Result: token imbalance 10.3x → 2.6x, TTFT p90 -37% vs baseline.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The replayer and proxy were building multi-turn prompts from trace tokens,
but the model generates different output tokens. Subsequent turns had wrong
prefix tokens, causing cache misses and invalid experimental measurements.
- replay.py: min_tokens=max_tokens for deterministic length, return_token_ids
to capture actual output, _apply_realized_prefix for next-turn correction
- proxy: extract output token_ids from SSE, record prompt+output as realized
prefix in shadow cache, extract _handle_local_request to deduplicate
- bench.sh/launch_elastic_p2p.sh: default elastic mode to unified policy
- mooncake_connector: only send prompt blocks (not stale output blocks),
track failed_recving_block_ids for error recovery
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copies mooncake_connector.py, mooncake_utils.py, scheduler.py from
third_party/vllm to the pip-installed vllm's site-packages. C extensions
stay from the pip package; only Python files are overridden.
Usage: bash scripts/deploy_vllm_patches.sh [HOST]
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
A: Add /estimate_hit endpoint to bootstrap server for real-time cache
probing. Proxy queries this before committing to PUSH, eliminating
24% zero-match PUSH requests (shadow cache divergence).
C: Add _handle_cached_prefill_offload: C (cache source) does fast
cached prefill → KV to Mooncake → D pulls and decodes.
Replaces broken direct_read PUSH where D waited for RDMA transfer
while occupying KV blocks without doing compute.
Also: update §3.9 baseline to plain vLLM with full mean/p50/p90/p99.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Without affinity, all cached requests route to the same instance
(cache source always has lowest prefill cost), causing 149s queue.
Fix: if the session's last instance has cost <= 2x the global best,
use it (preserves cache locality). Only re-route when the affinity
instance is significantly more expensive (overloaded).
The 2x threshold is intentionally loose — it's not a hardcoded magic
number but a "prefer locality unless clearly worse" heuristic.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause confirmed: NONE_HASH = os.urandom(32) differs between
scheduler and bootstrap server even in the same process (init_none_hash
called separately by each import path). PYTHONHASHSEED makes it
deterministic: NONE_HASH = hash_fn(seed), same across all code paths.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause: each vLLM instance has a random NONE_HASH (os.urandom(32))
when PYTHONHASHSEED is not set. All block hashes are chained from
NONE_HASH, so D's hashes never match C's hashes.
Fix: C's bootstrap server now accepts token_ids and does the prefix
cache lookup locally using C's own hash function and block pool.
No cross-instance hash matching needed.
New flow: D sends prompt token_ids → C computes hashes on C's side →
C looks up in C's own BlockPool → returns block_ids.
Also: module-level _shared_block_pool for scheduler→bootstrap bridge,
prompt_token_ids passed through PullReqMeta, test script added.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Old cost model: offload_cost = colocated_cost + RDMA_overhead, so offload
was always 0.1s more expensive. Result: only 19/117 HEAVY offloaded.
New: colocated_cost includes interference penalty when C_s has decode
requests: penalty = prefill_time × min(num_requests, 3) × 0.3.
Offload now wins when C_s has 1+ concurrent request.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
retry on ConnectError to handle kv_both connection instability
With RDMA_overhead=0.1s, offload triggers when C_s has just 700 tokens
pending (0.1s queue), vs 38k tokens (5.4s) with the old 2.0s estimate.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The cache_gate_ratio=0.3 check blocked 83/112 HEAVY requests (75%)
because they were cold (cache_ratio=0). But with direct RDMA read,
D reads C's cached blocks via RDMA regardless of cache ratio — the
gate was protecting against the OLD flow (C does prefill + push).
Also fixed cost model: offload_cost now reflects direct read reality:
OLD: P_queue + P_full_prefill + RDMA (P has no cache → expensive)
NEW: D_queue + RDMA_read + D_local_prefill(new_tokens)
Offload wins when C_s queue > RDMA_overhead (~2s).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Replace the global session_affinity dict with two namespace-isolated
ones (combined / prefill) so a session_id never indexes the wrong
instance list across mode switches. Keep `session_affinity` as a
read-only alias to the combined dict for any existing tooling.
- Add a startup _verify_vllm_patch() that scans
vllm.v1.core.sched.scheduler.Scheduler for the original
`assert req_id in self.requests` line. If the patch was not
re-applied after a vLLM upgrade we now print a loud warning at
lifespan startup instead of dying mid-experiment on a KV-transfer
abort race.
- Replace mutable module constants (HEAVY_THRESHOLD/OVERLOAD_FACTOR/
MAX_OFFLOAD_INFLIGHT/PREFILL_THROUGHPUT/RDMA_OVERHEAD_S/
CACHE_CAPACITY_BLOCKS) with a Settings dataclass + SETTINGS singleton.
__main__ now mutates SETTINGS so CLI overrides survive even when the
module is imported as a library (e.g. by tests/) (D5).
- Add --max-offload-inflight CLI flag (M3) and read it from SETTINGS.
- Add --cache-gate-ratio CLI flag and a real gate before the cost-model
branch: if cache_hit/input_length < ratio, mark cache_gate_REASON and
fall back to colocated. cache_ratio is no longer a write-only field
(B4).
- P candidate selection penalises instances already running offloaded
HEAVY prefills, so back-to-back HEAVY requests don't pile onto the
same P (M2).
- bench.sh forwards --max-offload-inflight / --cache-gate-ratio to the
proxy.
- Tests cover SETTINGS knobs + the heavy_threshold-driven P-offload
penalty.
Complete implementation of direct RDMA read for KV cache migration:
vLLM Mooncake connector (mooncake_connector.py):
- PullReqMeta: add direct_read flag + block_hashes
- MooncakeConnectorMetadata: add hash_table_updates/removals for
scheduler->worker block hash sync
- MooncakeConnectorScheduler: set_block_pool() to access BlockPool,
build_connector_meta() computes hash table deltas each step,
update_state_after_alloc() captures request block hashes for direct_read
- MooncakeConnectorWorker: _start_direct_read() + _direct_read_single()
implements D-side RDMA read via batch_transfer_sync_read, with
HTTP query/unpin to C's bootstrap server
Bootstrap server (mooncake_utils.py):
- POST /query_blocks: look up block hashes, return block_ids + GPU layout
- POST /unpin_blocks: release pin tracking
- set_worker_kv_info(): register GPU addresses at init
- update_hash_table(): receive scheduler deltas each step
Scheduler (scheduler.py):
- One-line hookup: pass block_pool to connector after KVCacheManager init
Proxy (cache_aware_proxy.py):
- _handle_direct_read_offload: sends request ONLY to D with
direct_read=True + remote_bootstrap_addr. No request to C at all.
- C's scheduler is completely uninvolved (0 GPU time on C)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
M1: cached_blocks was a plain set with a "trim half via list slicing"
eviction. CPython does not guarantee set iteration order, so the trim
discarded an arbitrary half of the entries — completely unlike vLLM's
LRU and a known contributor to the router's cache_hit estimate
diverging from real APC. Replace with an OrderedDict-backed LRU:
move_to_end on hits, popitem(last=False) on overflow. Capacity exposed
as CACHE_CAPACITY_BLOCKS module constant (200000 by default).
M5: streamed responses decrement load counters in their generator's
finally block. If a client disconnects before consuming the body the
generator is never entered and the decrement is lost, causing
ongoing_tokens / num_requests / pending_prefill_tokens to drift
negative under load. Add a 60s background reconcile_loop that clamps
those counters at zero as a safety net. Started in lifespan, cancelled
on shutdown. Does not replace proper vLLM exact-state syncing.
The hardcoded traces/sampled_1000req_seed42.jsonl no longer exists; switch
the default to the current sampled trace file w600_r0.0015_st30.jsonl and
let users override via --trace. Skip Part 4 cleanly when the file is
missing instead of relying on os.path.exists.
D2: run_benchmark.sh and run_experiments.sh still pass --time-scale and
--max-inflight-sessions to the replayer, but those flags were removed when
the project moved to trace-driven dispatch. The scripts cannot run as-is.
D3: ~25 ad-hoc analyze_* / compare_* / profile_* / final_* scripts and a
handful of single-experiment run_*.sh point at /home/admin/cpfs paths,
deleted output directories, or a sampled trace file that no longer exists.
Keep them in scripts/legacy/ for historical reference; the scripts that
remain in scripts/ (analyze_trace, analyze_breakdown, analyze_cache_hit,
analyze_eviction, compare_results, compute_roofline, sample_trace,
analyze_agentic_patterns, simulate_cache_policies, plus launch_*.sh,
gpu_monitor.sh, bench.sh) cover the current workflow.
Adds scripts/legacy/README.md to document the archival policy.
The default MODEL pointed at /home/admin/cpfs/... which never existed on
the public dev machines (other launch_*.sh and TODO.md use $HOME/models),
and the default TRACE pointed at traces/sampled_1000req_seed42.jsonl
which was deleted when the sampler moved to window+thin output. Update
both to the values the rest of the repo already standardized on.
B1: _inst_cumulative_tokens was written by pick_instance but never read
anywhere; delete the variable, global declaration, and per-call increment.
Load is already tracked via inst.ongoing_tokens.
D1: _send_prefill_async + the --fire-and-forget branch were unreachable
in practice (no launch/bench script enabled the flag) and broken even if
exercised: D-decode would fire before P registered the transfer_id,
guaranteeing a Mooncake 502. Collapse _handle_pd_sep to its synchronous
path and drop the CLI flag.
Added EXIT/INT/TERM traps to ensure vLLM, proxy, and gpu_monitor
processes are cleaned up even when bench.sh is killed externally.
Also includes gpu_monitor in cleanup_gpu pattern matching.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>