- b3_isolated_policy.sh: HEALTH_MAX_TRIES now env-overridable (default 180 ->
360s unchanged); slow-node launches can pass HEALTH_MAX_TRIES=300 (600s) to
ride out a single-instance startup flake without aborting the whole arm.
- run_5policy_both_modes.sh: runs run_5policy_600s.sh twice on the SAME ttp
trace with REPLAY_DISPATCH_MODE={tracets,thinktime}, so the only variable is
dispatch mode. Outputs to outputs/policy5_600s_{mode}_<date>/.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
traces/w600_r0.0015_st30_first600s.jsonl: first-600s cut of the shipped w600
trace (807 reqs, 274 sessions, all turn-1s + early later-turns; theoretical
APC ceiling ~70% vs 80% full). Faster iteration (~18 min/arm) but a colder,
lower-locality regime; whitelisted alongside the parent anonymized trace.
analysis/lpwl_5policy_600s.md: LPWL vs LMetric/sticky/unified/unified+A+B on
the 600s trace (dash1 8xH20, cold APC, n=1). LPWL is overall best with zero
knobs — TTFT p90 7983ms vs tuned A+B 11562 (-31%), E2E p90 -16%, best request
balance; APC 0.648 (emergent affinity, far above LMetric 0.507); only loss is
E2E p99 from heavy-class decode concentration. Demonstrates anti-overfit: A+B
was tuned on full w600 yet is beaten by the knob-free policy on this regime.
Includes the run_5policy_600s.sh repro driver.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Standalone smoke tests validating KV-migration correctness paths before
trace replay: full migrate-cache, partial-prefill transfer, and a
NIXL-connector variant, each with a runner.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
MIGRATION_TRANSFER_COST.md: under real load, migration KV transfer runs at
~3 GB/s vs ~10 GB/s idle. Decomposed (instruments + MB6 microbench) into
~55% RDMA-actual (HBM/PCIe contention with running kernels: 7.6->4.0 GB/s)
+ ~45% control-plane GIL starvation during long prefills. Reproduced on a
fresh upstream venv (byte-identical transfer path) -> upstream/hardware
inherent, not our patch. Layerwise is the wrong lever; the tax is structural
on a loaded agentic cluster. Includes mb6_transfer_under_load + run_mb6,
instrument_dst_migration/mooncake, and the dst/transfer decomposition analyzers.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
cache_aware_proxy: add lmetric_decode_weight (decode-load penalty in the
LMetric fallback score) and a v3 anti-hotspot recent-migration penalty
(effective_load = num_req + recent-migration count over a sliding window),
preventing back-to-back migration clustering. UNIFIED_ABLATION.md documents
the A (overload_factor=1.3) + B' (decode-weight, max(num_req,1)) + RaceFix
sweep: A+B'+RaceFix reaches TTFT p90 7770ms, beating v3 PD-sep migration by
~20%. Runners/analyzer for the b3 trace replay included.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Re-runs the elastic_migration_v2 trace (w600 r0.0015 st30, 1214 reqs,
274 sessions, 8×TP1 vLLM + cache_aware_proxy) with three configs:
- plain unified
- unified + Mooncake kv_both
- unified + Mooncake kv_both + DR-fix (env-gated O(|cache|) hash sync removal)
TTFT p90: 11.97 s → 9.74 s (−18.6%) → 7.58 s (−36.6% vs plain)
E2E p90: 23.48 s → 21.25 s (−9.5%) → 17.93 s (−23.6% vs plain)
Two findings:
1. The "+45% kv_both penalty" claim from elastic_migration_v2 is OBSOLETE
on current codebase — kv_both is now *faster* than plain at p90.
Likely fixed by e3a1d70 (RDMA-READ → bootstrap PUSH refactor) and
the connector-mode delay_free_blocks extending cross-turn prefix
cache hits on a 93%-intra-session-reuse trace.
2. DR-fix removes another 22% from TTFT p90 by skipping the
O(|cache|) hash sync in build_connector_meta. Cache-sweep with
DR-fix shows slope drops from +94.5 to +2.3 μs/1k blocks.
Adds:
- run_trace_replay_drfix.sh: A/B/C harness (env CT_DR_FIX gates patch)
- analyze_trace_replay.py: TTFT/TPOT/E2E delta analysis
- REPORT_TRACE_REPLAY.md: summary + reproduction
- results/20260526_1627_drfix/: cache-sweep with DR-fix
- results/trace_replay_20260526_1652/: full trace-replay A/B/C
Implication for EAR paper: the kv_both substrate is no longer the
bottleneck blocking session migration. The prior 4 migration reverts
were dominated by transfer overhead that has now been characterized
and (partially) removed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds an env-gated skip for the per-step `set(cache.keys())` walk in
MooncakeConnectorScheduler.build_connector_meta() that was introduced
in our own commit a7df84b (Direct RDMA read). Re-runs the cache_sweep
A/B with three configs: plain (control), mooncake_both (baseline), and
mooncake_both_drfix (VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1).
Files:
apply_direct_read_fix.py one-line env-gate patch (markered revert)
run_drfix.sh orchestrator for plain + mooncake_both + drfix
analyze.py extended to compare mooncake_both_drfix vs plain
and mooncake_both vs mooncake_both_drfix
REPORT_DRFIX.md findings
results/20260526_1543_drfix/ run artifacts
Headline:
config | slope (μs/1k blocks) | step_dur p50 @ 16.6k
----------------------|----------------------|---------------------
mooncake_both | +81.0 | 1 550 μs
mooncake_both_drfix | -0.7 (≈ 0) | 95 μs
plain (control) | -1.8 (≈ 0) | 72 μs
build_meta p50 @ 16.6k blocks:
mooncake_both = 1 459 μs
mooncake_both_drfix = 6 μs (residual loop bookkeeping)
worker get_finished p50:
mooncake_both = 178 μs (unchanged; this fix doesn't touch it)
mooncake_both_drfix = 183 μs
The fix recovers 1 453 μs (99.6 %) of the scheduler-side cost at
|cache|=16.6k blocks. drfix's per-bin step_dur tracks plain within
±50 μs across the full cache range — that's noise-level. The slope
goes from +81 to essentially zero.
Worker-side get_finished (180 μs constant) is unchanged because the
DR-fix touches scheduler.build_connector_meta only. That's the next
target if we want to bring kv_both fully back to plain-level.
Extrapolation to trace-replay (|cache|≈13k, APC≈79%):
before: build_meta 1 060 μs + get_finished 180 μs = 1.24 ms/step
after DR-fix: build_meta 6 μs + get_finished 180 μs = ~0.19 ms/step
→ 85% reduction in per-step connector cost
→ TPOT inflation drops from ~+18% to ~+3% on a 7 ms decode step
Confirms: the entire O(|cache|) slope was introduced by our own
direct-RDMA-read implementation (commit a7df84b), not upstream
Mooncake. Production fix: gate the sync on the presence of any
direct_read consumer, or replace per-step diff with an incremental
delta listener fed by block_pool add/remove callbacks.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Follow-up to Microbench 3 that finally tests H5 (cache-size
dependence) and instruments worker-side connector callbacks the
original patch missed.
Patch v2 (apply_step_timing_v2.py) adds:
scheduler: `cache_size` field in engine_step.jsonl
worker: `get_finished_us` + `start_load_kv_us` in worker_step.r0.jsonl
uses BLOCK_BEGIN/END sentinels for safe multi-line revert
(the original v1 patch survives this v2's apply/revert cycle)
Driver: continuous open-loop (1.5 req/s, 4096x256 random per req)
that lets APC fill from 0 → ceiling within one vLLM lifetime so a
single run produces the full cache_size sweep. Decode-only steps
are filtered post-hoc to remove prefill-mix variance.
Findings (H20 96GB, ceiling reached ~17.5k blocks; n=15-18k decode
steps per config):
config | slope (μs / 1k blocks) | step_dur p50 @ |cache|=16.6k
---------------|------------------------|-----------------------------
mooncake_both | +85.6 | 1528 μs (build_meta=1442, 94%)
noop_connector | -0.8 (≈0) | 79 μs
plain | +1.0 (≈0) | 84 μs
Worker-side get_finished p50/p90/p99 (μs/step):
mooncake_both: 180 / 257 / 333
noop_connector: 0 / 0 / 2
H5 PASSES. mooncake_both step_duration scales linearly with |cache|
because build_connector_meta walks set(cache.keys()) every step
(`mooncake_connector.py:434-450`). plain and noop are flat.
The previously-uninstrumented get_finished() adds a constant
180 μs/step on top — two `run_coroutine_threadsafe(...).result()`
blocking waits in kv_both mode (`mooncake_connector.py:1107-1137`)
fire every step even when no transfer is pending.
Trace-replay reconciliation (APC ≈ 79% → |cache| ≈ 13k blocks):
build_meta @ 13k ≈ 1060 μs + get_finished ≈ 180 μs = 1.24 ms/step
On ~7 ms decode forward → +15-20% TPOT per step.
This explains most of the trace-replay +25% TPOT p90 gap from
single-instance per-step cost alone, leaving a smaller residual
for multi-instance coupling than originally assumed.
Two clear fixes pointed out in REPORT.md:
1. replace O(|cache|) per-step walk with incremental delta
listener using block_pool's add/remove callbacks
2. short-circuit get_finished() when both producer/consumer
queues are empty in kv_both
Heavy raw artifacts (engine_step.jsonl, vllm_stdout/stderr,
.vllm.pid) are .gitignored — they re-derive from `bash run_all.sh`
and SUMMARY.md / per_config.json fully capture the conclusions.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>