Files
agentic-kvc/microbench/connector_tax/cache_sweep/REPORT_TRACE_REPLAY.md
Gahow Wang ef9e0102ec Connector tax: trace-replay confirms +45% kv_both penalty is gone; DR-fix adds 22% more
Re-runs the elastic_migration_v2 trace (w600 r0.0015 st30, 1214 reqs,
274 sessions, 8×TP1 vLLM + cache_aware_proxy) with three configs:
- plain unified
- unified + Mooncake kv_both
- unified + Mooncake kv_both + DR-fix (env-gated O(|cache|) hash sync removal)

TTFT p90: 11.97 s → 9.74 s (−18.6%) → 7.58 s (−36.6% vs plain)
E2E p90:  23.48 s → 21.25 s (−9.5%) → 17.93 s (−23.6% vs plain)

Two findings:
1. The "+45% kv_both penalty" claim from elastic_migration_v2 is OBSOLETE
   on current codebase — kv_both is now *faster* than plain at p90.
   Likely fixed by e3a1d70 (RDMA-READ → bootstrap PUSH refactor) and
   the connector-mode delay_free_blocks extending cross-turn prefix
   cache hits on a 93%-intra-session-reuse trace.
2. DR-fix removes another 22% from TTFT p90 by skipping the
   O(|cache|) hash sync in build_connector_meta. Cache-sweep with
   DR-fix shows slope drops from +94.5 to +2.3 μs/1k blocks.

Adds:
- run_trace_replay_drfix.sh: A/B/C harness (env CT_DR_FIX gates patch)
- analyze_trace_replay.py: TTFT/TPOT/E2E delta analysis
- REPORT_TRACE_REPLAY.md: summary + reproduction
- results/20260526_1627_drfix/: cache-sweep with DR-fix
- results/trace_replay_20260526_1652/: full trace-replay A/B/C

Implication for EAR paper: the kv_both substrate is no longer the
bottleneck blocking session migration. The prior 4 migration reverts
were dominated by transfer overhead that has now been characterized
and (partially) removed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 09:13:50 +08:00

5.7 KiB
Raw Blame History

Trace-replay re-test with DR-fix

Run: results/trace_replay_20260526_1652/ Trace: traces/w600_r0.0015_st30.jsonl (1214 reqs, 274 sessions, 53.3 M tokens) Topology: 8 × TP1 vLLM + cache_aware_proxy, Qwen3-Coder-30B-A3B-Instruct Same trace, same proxy, same machine that produced the original analysis/characterization/elastic_migration_v2/ paper.

TL;DR

The original elastic_migration_v2 paper claimed kv_role=kv_both (Mooncake) cost TTFT p90 +45 % vs plain unified. That gap no longer exists. In a same-day re-run on the same trace with the same 8-instance topology:

metric unified (plain) unified_kv_both (baseline) unified_kv_both_drfix
TTFT p90 11 971 ms 9 744 ms (18.6 % vs plain) 7 584 ms (36.6 % vs plain)
TPOT p90 20 ms 22 ms (+10 %) 18 ms (10 %)
E2E p90 23 475 ms 21 254 ms (9.5 %) 17 931 ms (23.6 %)

Two findings:

  1. The +45 % is gone. kv_both without any fix is now faster than plain unified at p90 (18.6 %). Likely culprits in the commit chain since the elastic_migration_v2 paper: a7df84b (direct RDMA read), 0500350 (token-based lookup), 08d5e12 (NONE_HASH import fix), and especially e3a1d70 (switch from RDMA READ to bootstrap-triggered PUSH) which restructured the producer-side critical path.
  2. DR-fix still helps. Disabling the O(|cache|) hash sync removes another 22 % from TTFT p90 (9.7 s → 7.6 s) and 16 % from E2E p90 (21.3 s → 17.9 s). The cache-sweep finding (+85 μs/1k blocks slope) translates into measurable p90/p99 wins under high APC + agentic session coupling.

How this changes the elastic_migration_v2 narrative

Original paper's four claims, re-checked today:

original claim today's status
"kv_role=kv_both costs TTFT p90 +45 % even without PD-sep" OBSOLETE (now 18.6 % vs plain)
"MooncakeNIXL gap of 7 pp is implementation cost" NOT TESTED (NIXL not re-run here)
"PD-sep rarely fires (0.41 % trigger rate)" unchanged — trace property
"When PD-sep fires, mechanism is 10-20× slower than model predicts" NOT TESTED (v2 policy not re-run)

The elastic_migration_v2 README should be marked as containing historical data that is no longer reproducible on the current codebase. The story ought to be re-cast as: "+45 % was a transient bug we fixed (whether intentionally as part of e3a1d70 or accidentally), and the remaining headroom (15-20 % p90) is recovered by the DR-fix."

Full per-metric A/B/C table

(10 s warmup discarded by the replayer; n=1214 each)

metric unified unified_kv_both drfix mc vs plain drfix vs plain drfix vs mc
TTFT mean 4 018 ms 3 552 ms 3 103 ms 11.6 % 22.8 % 12.6 %
TTFT p50 500 ms 501 ms 485 ms +0.2 % 3.0 % 3.2 %
TTFT p90 11 971 ms 9 744 ms 7 584 ms 18.6 % 36.6 % 22.2 %
TTFT p99 46 695 ms 42 432 ms 41 883 ms 9.1 % 10.3 % 1.3 %
TPOT mean 15.3 ms 14.4 ms 14.0 ms 5.9 % 8.5 % 2.8 %
TPOT p50 8.4 ms 8.3 ms 8.0 ms 0.9 % 3.9 % 3.1 %
TPOT p90 19.6 ms 21.6 ms 17.7 ms +10.0 % 9.7 % 17.9 %
TPOT p99 151.6 ms 127.8 ms 112.4 ms 15.7 % 25.9 % 12.1 %
E2E mean 8 180 ms 7 967 ms 7 184 ms 2.6 % 12.2 % 9.8 %
E2E p50 1 942 ms 1 995 ms 1 806 ms +2.7 % 7.0 % 9.5 %
E2E p90 23 475 ms 21 254 ms 17 931 ms 9.5 % 23.6 % 15.6 %
E2E p99 73 709 ms 76 630 ms 71 958 ms +4.0 % 2.4 % 6.1 %

Why kv_both already beats plain (without DR-fix)

A connector-loaded vLLM has delay_free_blocks=True by default — block eviction is deferred until the connector's bookkeeping signals it is safe. On a 93 %-intra-session-reuse trace, this extends prefix-cache hit windows across session turns, which more than compensates for the per-step connector cost on the codebase as it exists today. With the DR-fix removing the remaining O(|cache|) tax, the net swings strongly positive.

This was also one of the explanations proposed in the cache_sweep report ("connector mode has higher effective cache utilisation") and is now confirmed at the trace-replay scale.

Reproducibility

bash microbench/connector_tax/cache_sweep/run_trace_replay_drfix.sh

Runtime: ~2.5 h on 8 × H20. The orchestrator applies CT_DR_FIX, runs the three policies serially (plain → mc baseline → mc drfix via env var), reverts the patch, and emits per-policy metrics.jsonl. Analyse with:

python microbench/connector_tax/cache_sweep/analyze_trace_replay.py \\
       --root microbench/connector_tax/cache_sweep/results/trace_replay_20260526_1652

Files

trace_replay_20260526_1652/
├── trace_replay_summary.json       — machine-readable per-config TTFT/TPOT/E2E
├── unified/                        — plain control
│   ├── metrics.jsonl               — per-request timings (1214 rows)
│   ├── metrics.summary.json        — replayer's own summary
│   ├── breakdown.json              — proxy per-decision metadata
│   ├── stats.json                  — proxy aggregate counters
│   └── run_window.json             — t_start/t_end + policy + trace
├── unified_kv_both/                — Mooncake kv_both, hash sync ON
└── unified_kv_both_drfix/          — Mooncake kv_both, hash sync OFF (env-gated)

Heavy artifacts (engine_state/, vllm logs, replayer.log, proxy.log) are .gitignored — re-derive with run_trace_replay_drfix.sh.