Re-runs the elastic_migration_v2 trace (w600 r0.0015 st30, 1214 reqs,
274 sessions, 8×TP1 vLLM + cache_aware_proxy) with three configs:
- plain unified
- unified + Mooncake kv_both
- unified + Mooncake kv_both + DR-fix (env-gated O(|cache|) hash sync removal)
TTFT p90: 11.97 s → 9.74 s (−18.6%) → 7.58 s (−36.6% vs plain)
E2E p90: 23.48 s → 21.25 s (−9.5%) → 17.93 s (−23.6% vs plain)
Two findings:
1. The "+45% kv_both penalty" claim from elastic_migration_v2 is OBSOLETE
on current codebase — kv_both is now *faster* than plain at p90.
Likely fixed by e3a1d70 (RDMA-READ → bootstrap PUSH refactor) and
the connector-mode delay_free_blocks extending cross-turn prefix
cache hits on a 93%-intra-session-reuse trace.
2. DR-fix removes another 22% from TTFT p90 by skipping the
O(|cache|) hash sync in build_connector_meta. Cache-sweep with
DR-fix shows slope drops from +94.5 to +2.3 μs/1k blocks.
Adds:
- run_trace_replay_drfix.sh: A/B/C harness (env CT_DR_FIX gates patch)
- analyze_trace_replay.py: TTFT/TPOT/E2E delta analysis
- REPORT_TRACE_REPLAY.md: summary + reproduction
- results/20260526_1627_drfix/: cache-sweep with DR-fix
- results/trace_replay_20260526_1652/: full trace-replay A/B/C
Implication for EAR paper: the kv_both substrate is no longer the
bottleneck blocking session migration. The prior 4 migration reverts
were dominated by transfer overhead that has now been characterized
and (partially) removed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds an env-gated skip for the per-step `set(cache.keys())` walk in
MooncakeConnectorScheduler.build_connector_meta() that was introduced
in our own commit a7df84b (Direct RDMA read). Re-runs the cache_sweep
A/B with three configs: plain (control), mooncake_both (baseline), and
mooncake_both_drfix (VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1).
Files:
apply_direct_read_fix.py one-line env-gate patch (markered revert)
run_drfix.sh orchestrator for plain + mooncake_both + drfix
analyze.py extended to compare mooncake_both_drfix vs plain
and mooncake_both vs mooncake_both_drfix
REPORT_DRFIX.md findings
results/20260526_1543_drfix/ run artifacts
Headline:
config | slope (μs/1k blocks) | step_dur p50 @ 16.6k
----------------------|----------------------|---------------------
mooncake_both | +81.0 | 1 550 μs
mooncake_both_drfix | -0.7 (≈ 0) | 95 μs
plain (control) | -1.8 (≈ 0) | 72 μs
build_meta p50 @ 16.6k blocks:
mooncake_both = 1 459 μs
mooncake_both_drfix = 6 μs (residual loop bookkeeping)
worker get_finished p50:
mooncake_both = 178 μs (unchanged; this fix doesn't touch it)
mooncake_both_drfix = 183 μs
The fix recovers 1 453 μs (99.6 %) of the scheduler-side cost at
|cache|=16.6k blocks. drfix's per-bin step_dur tracks plain within
±50 μs across the full cache range — that's noise-level. The slope
goes from +81 to essentially zero.
Worker-side get_finished (180 μs constant) is unchanged because the
DR-fix touches scheduler.build_connector_meta only. That's the next
target if we want to bring kv_both fully back to plain-level.
Extrapolation to trace-replay (|cache|≈13k, APC≈79%):
before: build_meta 1 060 μs + get_finished 180 μs = 1.24 ms/step
after DR-fix: build_meta 6 μs + get_finished 180 μs = ~0.19 ms/step
→ 85% reduction in per-step connector cost
→ TPOT inflation drops from ~+18% to ~+3% on a 7 ms decode step
Confirms: the entire O(|cache|) slope was introduced by our own
direct-RDMA-read implementation (commit a7df84b), not upstream
Mooncake. Production fix: gate the sync on the presence of any
direct_read consumer, or replace per-step diff with an incremental
delta listener fed by block_pool add/remove callbacks.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Follow-up to Microbench 3 that finally tests H5 (cache-size
dependence) and instruments worker-side connector callbacks the
original patch missed.
Patch v2 (apply_step_timing_v2.py) adds:
scheduler: `cache_size` field in engine_step.jsonl
worker: `get_finished_us` + `start_load_kv_us` in worker_step.r0.jsonl
uses BLOCK_BEGIN/END sentinels for safe multi-line revert
(the original v1 patch survives this v2's apply/revert cycle)
Driver: continuous open-loop (1.5 req/s, 4096x256 random per req)
that lets APC fill from 0 → ceiling within one vLLM lifetime so a
single run produces the full cache_size sweep. Decode-only steps
are filtered post-hoc to remove prefill-mix variance.
Findings (H20 96GB, ceiling reached ~17.5k blocks; n=15-18k decode
steps per config):
config | slope (μs / 1k blocks) | step_dur p50 @ |cache|=16.6k
---------------|------------------------|-----------------------------
mooncake_both | +85.6 | 1528 μs (build_meta=1442, 94%)
noop_connector | -0.8 (≈0) | 79 μs
plain | +1.0 (≈0) | 84 μs
Worker-side get_finished p50/p90/p99 (μs/step):
mooncake_both: 180 / 257 / 333
noop_connector: 0 / 0 / 2
H5 PASSES. mooncake_both step_duration scales linearly with |cache|
because build_connector_meta walks set(cache.keys()) every step
(`mooncake_connector.py:434-450`). plain and noop are flat.
The previously-uninstrumented get_finished() adds a constant
180 μs/step on top — two `run_coroutine_threadsafe(...).result()`
blocking waits in kv_both mode (`mooncake_connector.py:1107-1137`)
fire every step even when no transfer is pending.
Trace-replay reconciliation (APC ≈ 79% → |cache| ≈ 13k blocks):
build_meta @ 13k ≈ 1060 μs + get_finished ≈ 180 μs = 1.24 ms/step
On ~7 ms decode forward → +15-20% TPOT per step.
This explains most of the trace-replay +25% TPOT p90 gap from
single-instance per-step cost alone, leaving a smaller residual
for multi-instance coupling than originally assumed.
Two clear fixes pointed out in REPORT.md:
1. replace O(|cache|) per-step walk with incremental delta
listener using block_pool's add/remove callbacks
2. short-circuit get_finished() when both producer/consumer
queues are empty in kv_both
Heavy raw artifacts (engine_step.jsonl, vllm_stdout/stderr,
.vllm.pid) are .gitignored — they re-derive from `bash run_all.sh`
and SUMMARY.md / per_config.json fully capture the conclusions.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The prior write-up presented one specific reading of the data as
the headline without flagging methodology gaps. Three corrections:
1. The "0% low-concurrency tax" comes from a single back-to-back
mooncake_both_v2/plain_v2 rerun. The original Phase A pair
showed TTFT p90 +29%, TPOT p90 +54%, E2E p90 +55% at rate=2
— a 40 percentage-point swing between two consecutive runs
that the original write-up did not call out. The run-to-run
noise floor is too high to claim "0%" at low concurrency.
2. get_finished() was never instrumented. The patch only times
step_duration_us and build_meta_us. "100% of per-step cost is
build_meta" is an upper bound on what was timed, not a true
decomposition.
3. H5 (cache-size dependence) was the central hypothesis but
was never tested in the prior run; random content kept APC
near empty.
The +7-9% high-concurrency (single instance, 512x64, rate=8-16)
and +17% 8-instance-saturated numbers are kept; they were
measured with adequate sample sizes and are reproducible.
The follow-up sweep in cache_sweep/ tests H5 directly and
revises the decomposition.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
8×TP1 + load_only proxy, shape 512×64, rates 32/64/128 req/s total:
Rate=32 (non-saturated, thr=0.95-0.97):
plain TTFT p90=64ms, mooncake_both=65ms → +2% (noise)
Rate=64 (non-saturated, thr=0.96):
plain TTFT p90=114ms, mooncake_both=107ms → -6% (noise)
Rate=128 (saturated, thr=0.70-0.71):
plain TTFT p90=702ms, mooncake_both=822ms → +17%
plain TTFT p50=339ms, mooncake_both=470ms → +39%
Conclusion: The elastic_migration_v2 +45% is a saturation artifact.
Under SLO-compliant load (TTFT<10s, thr_ratio>0.9), mooncake_both's
1.4ms/step build_connector_meta overhead is completely masked by the
scheduler-model async pipeline. The tax only manifests when the system
is already saturated and queueing amplifies per-step differences.
For practical deployment: enabling kv_role=kv_both has effectively zero
cost as long as the serving system stays within SLO capacity bounds.
plot_interference.py reads the interference sweep summary (4 D × 4 P × 3 reps,
cold prefill prompts) and produces:
fig_interference_heatmap.png
TPOT p90 interference index over (D, P): 14x at D=8 P=2k → 214x at D=1 P=32k.
fig_interference_lines.png
(a) TPOT p90 during prefill vs P, log-y, one line per D + baseline dashed
(b) Cold prefill TTFT vs P (interference window length)
Confirms B2 finding: cold prefill on the same worker stalls overlapping
decodes for 14-214x baseline TPOT. The interference window grows linearly
with P (from ~140ms at 2k to ~4.6s at 32k) and is essentially independent
of decode batch size — prefill compute time dominates.
Instrumentation patches (microbench/patches/):
- pd_profile.py: shared event emitter (VLLM_PD_PROFILE_LOG env var)
- apply_patches.py: idempotent patch installer for mooncake_connector.py
and scheduler.py, marks insertions with # PD_PROFILE_PATCH
- analyze_events.py: joins per-process JSONL event logs by transfer_id
into per-request phase durations
Seven events captured per request:
D_get_num_matched → P_zmq_received → P_prefill_done →
P_rdma_start → P_rdma_end → D_recv_complete → D_request_promoted
Driver fix (microbench/lifecycle/driver.py):
seed_prefix_cache now sends via the proxy URL so P and D both cache
the seeded prefix with matching block hashes. Previously seeding D
directly produced different block hashes than the proxy-routed
measurement requests, making incremental transfer impossible.
Real breakdown (fig_breakdown_real.png, server_breakdown.csv, n=93):
prefill_compute 620 ms median (95% of overhead)
rdma_transfer 42 ms median (~71 Gbps effective)
other overhead 10 ms median (dispatch + params + signal + promote)
Mooncake transfer is NOT the bottleneck. Even with bulk RDMA the
transfer cost is <10% of prefill cost for Qwen3-30B-A3B on H20.
Two microbenchmarks quantifying the elastic offload decision:
1. Interference (corrected): cold prefill causes 14-214x TPOT p90
degradation on same-worker decode (D∈{1,2,4,8} × P∈{2k,8k,16k,32k}).
Earlier run had a prefix-cache bug (deterministic prompts hit cache
after rep 0); fixed with uuid+time_ns unique prompts.
2. Transfer lifecycle: PD-sep TTFT breakdown via Mooncake proxy,
measuring prefill→RDMA→decode startup overhead.
Key finding: offload wins at all P≥2048 operating points —
transfer cost is 25-50% of interference cost even with bulk Mooncake.