Re-runs the elastic_migration_v2 trace (w600 r0.0015 st30, 1214 reqs,
274 sessions, 8×TP1 vLLM + cache_aware_proxy) with three configs:
- plain unified
- unified + Mooncake kv_both
- unified + Mooncake kv_both + DR-fix (env-gated O(|cache|) hash sync removal)
TTFT p90: 11.97 s → 9.74 s (−18.6%) → 7.58 s (−36.6% vs plain)
E2E p90: 23.48 s → 21.25 s (−9.5%) → 17.93 s (−23.6% vs plain)
Two findings:
1. The "+45% kv_both penalty" claim from elastic_migration_v2 is OBSOLETE
on current codebase — kv_both is now *faster* than plain at p90.
Likely fixed by e3a1d70 (RDMA-READ → bootstrap PUSH refactor) and
the connector-mode delay_free_blocks extending cross-turn prefix
cache hits on a 93%-intra-session-reuse trace.
2. DR-fix removes another 22% from TTFT p90 by skipping the
O(|cache|) hash sync in build_connector_meta. Cache-sweep with
DR-fix shows slope drops from +94.5 to +2.3 μs/1k blocks.
Adds:
- run_trace_replay_drfix.sh: A/B/C harness (env CT_DR_FIX gates patch)
- analyze_trace_replay.py: TTFT/TPOT/E2E delta analysis
- REPORT_TRACE_REPLAY.md: summary + reproduction
- results/20260526_1627_drfix/: cache-sweep with DR-fix
- results/trace_replay_20260526_1652/: full trace-replay A/B/C
Implication for EAR paper: the kv_both substrate is no longer the
bottleneck blocking session migration. The prior 4 migration reverts
were dominated by transfer overhead that has now been characterized
and (partially) removed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
5.7 KiB
Trace-replay re-test with DR-fix
Run: results/trace_replay_20260526_1652/
Trace: traces/w600_r0.0015_st30.jsonl (1214 reqs, 274 sessions, 53.3 M tokens)
Topology: 8 × TP1 vLLM + cache_aware_proxy, Qwen3-Coder-30B-A3B-Instruct
Same trace, same proxy, same machine that produced the original
analysis/characterization/elastic_migration_v2/ paper.
TL;DR
The original elastic_migration_v2 paper claimed kv_role=kv_both (Mooncake) cost
TTFT p90 +45 % vs plain unified. That gap no longer exists. In a
same-day re-run on the same trace with the same 8-instance topology:
| metric | unified (plain) | unified_kv_both (baseline) | unified_kv_both_drfix |
|---|---|---|---|
| TTFT p90 | 11 971 ms | 9 744 ms (−18.6 % vs plain) | 7 584 ms (−36.6 % vs plain) |
| TPOT p90 | 20 ms | 22 ms (+10 %) | 18 ms (−10 %) |
| E2E p90 | 23 475 ms | 21 254 ms (−9.5 %) | 17 931 ms (−23.6 %) |
Two findings:
- The +45 % is gone. kv_both without any fix is now faster than
plain
unifiedat p90 (−18.6 %). Likely culprits in the commit chain since the elastic_migration_v2 paper:a7df84b(direct RDMA read),0500350(token-based lookup),08d5e12(NONE_HASH import fix), and especiallye3a1d70(switch from RDMA READ to bootstrap-triggered PUSH) which restructured the producer-side critical path. - DR-fix still helps. Disabling the O(|cache|) hash sync removes another 22 % from TTFT p90 (9.7 s → 7.6 s) and 16 % from E2E p90 (21.3 s → 17.9 s). The cache-sweep finding (+85 μs/1k blocks slope) translates into measurable p90/p99 wins under high APC + agentic session coupling.
How this changes the elastic_migration_v2 narrative
Original paper's four claims, re-checked today:
| original claim | today's status |
|---|---|
| "kv_role=kv_both costs TTFT p90 +45 % even without PD-sep" | OBSOLETE (now −18.6 % vs plain) |
| "Mooncake−NIXL gap of 7 pp is implementation cost" | NOT TESTED (NIXL not re-run here) |
| "PD-sep rarely fires (0.41 % trigger rate)" | unchanged — trace property |
| "When PD-sep fires, mechanism is 10-20× slower than model predicts" | NOT TESTED (v2 policy not re-run) |
The elastic_migration_v2 README should be marked as containing historical
data that is no longer reproducible on the current codebase. The story
ought to be re-cast as: "+45 % was a transient bug we fixed (whether
intentionally as part of e3a1d70 or accidentally), and the
remaining headroom (15-20 % p90) is recovered by the DR-fix."
Full per-metric A/B/C table
(10 s warmup discarded by the replayer; n=1214 each)
| metric | unified | unified_kv_both | drfix | mc vs plain | drfix vs plain | drfix vs mc |
|---|---|---|---|---|---|---|
| TTFT mean | 4 018 ms | 3 552 ms | 3 103 ms | −11.6 % | −22.8 % | −12.6 % |
| TTFT p50 | 500 ms | 501 ms | 485 ms | +0.2 % | −3.0 % | −3.2 % |
| TTFT p90 | 11 971 ms | 9 744 ms | 7 584 ms | −18.6 % | −36.6 % | −22.2 % |
| TTFT p99 | 46 695 ms | 42 432 ms | 41 883 ms | −9.1 % | −10.3 % | −1.3 % |
| TPOT mean | 15.3 ms | 14.4 ms | 14.0 ms | −5.9 % | −8.5 % | −2.8 % |
| TPOT p50 | 8.4 ms | 8.3 ms | 8.0 ms | −0.9 % | −3.9 % | −3.1 % |
| TPOT p90 | 19.6 ms | 21.6 ms | 17.7 ms | +10.0 % | −9.7 % | −17.9 % |
| TPOT p99 | 151.6 ms | 127.8 ms | 112.4 ms | −15.7 % | −25.9 % | −12.1 % |
| E2E mean | 8 180 ms | 7 967 ms | 7 184 ms | −2.6 % | −12.2 % | −9.8 % |
| E2E p50 | 1 942 ms | 1 995 ms | 1 806 ms | +2.7 % | −7.0 % | −9.5 % |
| E2E p90 | 23 475 ms | 21 254 ms | 17 931 ms | −9.5 % | −23.6 % | −15.6 % |
| E2E p99 | 73 709 ms | 76 630 ms | 71 958 ms | +4.0 % | −2.4 % | −6.1 % |
Why kv_both already beats plain (without DR-fix)
A connector-loaded vLLM has delay_free_blocks=True by default — block
eviction is deferred until the connector's bookkeeping signals it is
safe. On a 93 %-intra-session-reuse trace, this extends prefix-cache
hit windows across session turns, which more than compensates for the
per-step connector cost on the codebase as it exists today. With the
DR-fix removing the remaining O(|cache|) tax, the net swings strongly
positive.
This was also one of the explanations proposed in the cache_sweep report ("connector mode has higher effective cache utilisation") and is now confirmed at the trace-replay scale.
Reproducibility
bash microbench/connector_tax/cache_sweep/run_trace_replay_drfix.sh
Runtime: ~2.5 h on 8 × H20. The orchestrator applies CT_DR_FIX, runs the three policies serially (plain → mc baseline → mc drfix via env var), reverts the patch, and emits per-policy metrics.jsonl. Analyse with:
python microbench/connector_tax/cache_sweep/analyze_trace_replay.py \\
--root microbench/connector_tax/cache_sweep/results/trace_replay_20260526_1652
Files
trace_replay_20260526_1652/
├── trace_replay_summary.json — machine-readable per-config TTFT/TPOT/E2E
├── unified/ — plain control
│ ├── metrics.jsonl — per-request timings (1214 rows)
│ ├── metrics.summary.json — replayer's own summary
│ ├── breakdown.json — proxy per-decision metadata
│ ├── stats.json — proxy aggregate counters
│ └── run_window.json — t_start/t_end + policy + trace
├── unified_kv_both/ — Mooncake kv_both, hash sync ON
└── unified_kv_both_drfix/ — Mooncake kv_both, hash sync OFF (env-gated)
Heavy artifacts (engine_state/, vllm logs, replayer.log, proxy.log)
are .gitignored — re-derive with run_trace_replay_drfix.sh.