61 Commits

Author SHA1 Message Date
ef9e0102ec Connector tax: trace-replay confirms +45% kv_both penalty is gone; DR-fix adds 22% more
Re-runs the elastic_migration_v2 trace (w600 r0.0015 st30, 1214 reqs,
274 sessions, 8×TP1 vLLM + cache_aware_proxy) with three configs:
- plain unified
- unified + Mooncake kv_both
- unified + Mooncake kv_both + DR-fix (env-gated O(|cache|) hash sync removal)

TTFT p90: 11.97 s → 9.74 s (−18.6%) → 7.58 s (−36.6% vs plain)
E2E p90:  23.48 s → 21.25 s (−9.5%) → 17.93 s (−23.6% vs plain)

Two findings:
1. The "+45% kv_both penalty" claim from elastic_migration_v2 is OBSOLETE
   on current codebase — kv_both is now *faster* than plain at p90.
   Likely fixed by e3a1d70 (RDMA-READ → bootstrap PUSH refactor) and
   the connector-mode delay_free_blocks extending cross-turn prefix
   cache hits on a 93%-intra-session-reuse trace.
2. DR-fix removes another 22% from TTFT p90 by skipping the
   O(|cache|) hash sync in build_connector_meta. Cache-sweep with
   DR-fix shows slope drops from +94.5 to +2.3 μs/1k blocks.

Adds:
- run_trace_replay_drfix.sh: A/B/C harness (env CT_DR_FIX gates patch)
- analyze_trace_replay.py: TTFT/TPOT/E2E delta analysis
- REPORT_TRACE_REPLAY.md: summary + reproduction
- results/20260526_1627_drfix/: cache-sweep with DR-fix
- results/trace_replay_20260526_1652/: full trace-replay A/B/C

Implication for EAR paper: the kv_both substrate is no longer the
bottleneck blocking session migration. The prior 4 migration reverts
were dominated by transfer overhead that has now been characterized
and (partially) removed.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 09:13:50 +08:00
31cf8c9b11 DR-fix A/B: env-gate hash sync drops slope from +81 to -0.7 μs/1k blocks
Adds an env-gated skip for the per-step `set(cache.keys())` walk in
MooncakeConnectorScheduler.build_connector_meta() that was introduced
in our own commit a7df84b (Direct RDMA read). Re-runs the cache_sweep
A/B with three configs: plain (control), mooncake_both (baseline), and
mooncake_both_drfix (VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1).

Files:
  apply_direct_read_fix.py  one-line env-gate patch (markered revert)
  run_drfix.sh              orchestrator for plain + mooncake_both + drfix
  analyze.py                extended to compare mooncake_both_drfix vs plain
                            and mooncake_both vs mooncake_both_drfix
  REPORT_DRFIX.md           findings
  results/20260526_1543_drfix/ run artifacts

Headline:

  config                | slope (μs/1k blocks) | step_dur p50 @ 16.6k
  ----------------------|----------------------|---------------------
  mooncake_both         | +81.0                | 1 550 μs
  mooncake_both_drfix   | -0.7  (≈ 0)          |    95 μs
  plain (control)       | -1.8  (≈ 0)          |    72 μs

  build_meta p50 @ 16.6k blocks:
    mooncake_both        = 1 459 μs
    mooncake_both_drfix  =     6 μs    (residual loop bookkeeping)

  worker get_finished p50:
    mooncake_both        = 178 μs    (unchanged; this fix doesn't touch it)
    mooncake_both_drfix  = 183 μs

The fix recovers 1 453 μs (99.6 %) of the scheduler-side cost at
|cache|=16.6k blocks. drfix's per-bin step_dur tracks plain within
±50 μs across the full cache range — that's noise-level. The slope
goes from +81 to essentially zero.

Worker-side get_finished (180 μs constant) is unchanged because the
DR-fix touches scheduler.build_connector_meta only. That's the next
target if we want to bring kv_both fully back to plain-level.

Extrapolation to trace-replay (|cache|≈13k, APC≈79%):
  before: build_meta 1 060 μs + get_finished 180 μs = 1.24 ms/step
  after DR-fix: build_meta 6 μs + get_finished 180 μs = ~0.19 ms/step
  → 85% reduction in per-step connector cost
  → TPOT inflation drops from ~+18% to ~+3% on a 7 ms decode step

Confirms: the entire O(|cache|) slope was introduced by our own
direct-RDMA-read implementation (commit a7df84b), not upstream
Mooncake. Production fix: gate the sync on the presence of any
direct_read consumer, or replace per-step diff with an incremental
delta listener fed by block_pool add/remove callbacks.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 00:03:23 +08:00
8829928fc5 Cache-size sweep: build_meta is O(|cache|), +85.6 μs / 1k blocks
Follow-up to Microbench 3 that finally tests H5 (cache-size
dependence) and instruments worker-side connector callbacks the
original patch missed.

Patch v2 (apply_step_timing_v2.py) adds:
  scheduler: `cache_size` field in engine_step.jsonl
  worker:    `get_finished_us` + `start_load_kv_us` in worker_step.r0.jsonl
  uses BLOCK_BEGIN/END sentinels for safe multi-line revert
  (the original v1 patch survives this v2's apply/revert cycle)

Driver: continuous open-loop (1.5 req/s, 4096x256 random per req)
that lets APC fill from 0 → ceiling within one vLLM lifetime so a
single run produces the full cache_size sweep. Decode-only steps
are filtered post-hoc to remove prefill-mix variance.

Findings (H20 96GB, ceiling reached ~17.5k blocks; n=15-18k decode
steps per config):

  config         | slope (μs / 1k blocks) | step_dur p50 @ |cache|=16.6k
  ---------------|------------------------|-----------------------------
  mooncake_both  | +85.6                  | 1528 μs (build_meta=1442, 94%)
  noop_connector | -0.8 (≈0)              |  79 μs
  plain          | +1.0 (≈0)              |  84 μs

  Worker-side get_finished p50/p90/p99 (μs/step):
    mooncake_both:  180 / 257 / 333
    noop_connector:   0 /   0 /   2

H5 PASSES. mooncake_both step_duration scales linearly with |cache|
because build_connector_meta walks set(cache.keys()) every step
(`mooncake_connector.py:434-450`). plain and noop are flat.

The previously-uninstrumented get_finished() adds a constant
180 μs/step on top — two `run_coroutine_threadsafe(...).result()`
blocking waits in kv_both mode (`mooncake_connector.py:1107-1137`)
fire every step even when no transfer is pending.

Trace-replay reconciliation (APC ≈ 79% → |cache| ≈ 13k blocks):
  build_meta @ 13k ≈ 1060 μs + get_finished ≈ 180 μs = 1.24 ms/step
  On ~7 ms decode forward → +15-20% TPOT per step.
  This explains most of the trace-replay +25% TPOT p90 gap from
  single-instance per-step cost alone, leaving a smaller residual
  for multi-instance coupling than originally assumed.

Two clear fixes pointed out in REPORT.md:
  1. replace O(|cache|) per-step walk with incremental delta
     listener using block_pool's add/remove callbacks
  2. short-circuit get_finished() when both producer/consumer
     queues are empty in kv_both

Heavy raw artifacts (engine_step.jsonl, vllm_stdout/stderr,
.vllm.pid) are .gitignored — they re-derive from `bash run_all.sh`
and SUMMARY.md / per_config.json fully capture the conclusions.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 23:34:21 +08:00
54de78eb11 Connector tax RESULTS.md: errata + run-to-run variance disclosure
The prior write-up presented one specific reading of the data as
the headline without flagging methodology gaps. Three corrections:

1. The "0% low-concurrency tax" comes from a single back-to-back
   mooncake_both_v2/plain_v2 rerun. The original Phase A pair
   showed TTFT p90 +29%, TPOT p90 +54%, E2E p90 +55% at rate=2
   — a 40 percentage-point swing between two consecutive runs
   that the original write-up did not call out. The run-to-run
   noise floor is too high to claim "0%" at low concurrency.

2. get_finished() was never instrumented. The patch only times
   step_duration_us and build_meta_us. "100% of per-step cost is
   build_meta" is an upper bound on what was timed, not a true
   decomposition.

3. H5 (cache-size dependence) was the central hypothesis but
   was never tested in the prior run; random content kept APC
   near empty.

The +7-9% high-concurrency (single instance, 512x64, rate=8-16)
and +17% 8-instance-saturated numbers are kept; they were
measured with adequate sample sizes and are reproducible.

The follow-up sweep in cache_sweep/ tests H5 directly and
revises the decomposition.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 23:33:01 +08:00
e3480f7d28 8-instance connector tax: +2% at non-saturated, +17% only at saturation
8×TP1 + load_only proxy, shape 512×64, rates 32/64/128 req/s total:

  Rate=32 (non-saturated, thr=0.95-0.97):
    plain TTFT p90=64ms,  mooncake_both=65ms  → +2% (noise)
  Rate=64 (non-saturated, thr=0.96):
    plain TTFT p90=114ms, mooncake_both=107ms → -6% (noise)
  Rate=128 (saturated, thr=0.70-0.71):
    plain TTFT p90=702ms, mooncake_both=822ms → +17%
    plain TTFT p50=339ms, mooncake_both=470ms → +39%

Conclusion: The elastic_migration_v2 +45% is a saturation artifact.
Under SLO-compliant load (TTFT<10s, thr_ratio>0.9), mooncake_both's
1.4ms/step build_connector_meta overhead is completely masked by the
scheduler-model async pipeline. The tax only manifests when the system
is already saturated and queueing amplifies per-step differences.

For practical deployment: enabling kv_role=kv_both has effectively zero
cost as long as the serving system stays within SLO capacity bounds.
2026-05-26 21:32:46 +08:00
c8ec73c548 Connector tax: high-concurrency confirms +7-9% tax, resolves trace-replay gap
High-concurrency test (512 input, 64 output, rates 4-32 req/s):
  Rate=8:  plain TTFT p90=94ms, mooncake_both=102ms → +9% tax
  Rate=16: plain TTFT p90=144ms, mooncake_both=156ms → +8% tax
  Rate=32: both saturated at ~6.1s → no distinguishable difference

Low-concurrency back-to-back retest (4096 input, 256 output):
  mooncake_both_v2 vs plain_v2: tax is ≈0% (within noise)
  because scheduler's 1.4ms/step is hidden behind model forward.

Decomposition of trace-replay's +45%:
  +7-9% from build_connector_meta per-step cost (this microbench)
  +20-30% from multi-instance coupling amplification (not measurable here)
  remainder from large-cache O(|cache|) scaling (Phase B follow-up)

Also: bench_loop.py now emits mean/p50/p90/p99 for all three metrics.
2026-05-26 21:00:25 +08:00
a473c71cac Connector tax Phase A: build_connector_meta is 1.4ms/step (the tax source)
Per-step timing from engine_step.jsonl definitively resolves H3:
  plain:            53 μs/step (p50)
  noop_connector:   69 μs/step (+16 μs = negligible framework cost)
  mooncake_producer: 1461 μs/step (build_connector_meta = 1386 μs)
  mooncake_both:    1452 μs/step (same as producer)

The substrate tax is NOT in the v1 framework — it's specifically in
Mooncake's build_connector_meta() which walks set(cache.keys()) every
scheduler step (O(|cache|) per step, E2 audit §6.5).

Accumulated per-request tax: 256 decode steps × 1.4ms = 358ms.
Observed TTFT tax at rate=1.0: plain 378ms vs mooncake_both 422ms (+12%).
At rate=2.0 (near saturation): +29%, approaching trace-replay's +45%.

Also fixes kill_vllm() to properly kill EngineCore subprocesses.
2026-05-26 19:33:15 +08:00
297fed6e73 Microbench 3 (connector_tax): infrastructure for KV connector substrate tax
Validates the elastic_migration_v2 finding that kv_role=kv_both adds
TTFT p90 +45% even when PD-sep never fires. Replicates under
single-instance, synthetic, open-loop workload to disambiguate
mechanism cost from 8-instance feedback amplification.

Configurations (8):
  plain, noop_connector, mooncake_{producer,consumer,both},
  nixl_both, lmcache_only, multi_mooncake_lmcache.

Pre-flight verification gates risky configs (kv_consumer needs dummy
bootstrap, multi-connector composition, NoOp custom class loading).

Workload: two-phase sweep
  Phase A: rate {0.5..32} req/s × shape (4096, 256), saturation criteria
  Phase B: ref_safe rate × cartesian (input ∈ {512,4k,32k}, output ∈ {64,256,1024})

Step-timing patch enriches vLLM's existing AGENTIC_STEP_LOG_PATH emit
with step_duration_us and build_meta_us — directly measures per-step
substrate cost, not just user-visible TTFT/TPOT.

run_all.sh runs as 5-stage barrier:
  0 pre-flight + apply patch
  1 Phase A all configs
  2 pick ref_safe / ref_load
  3 Phase B all configs
  4 revert patch + analyze + plot

Outputs aggregate.{json,csv}, MANIFEST.tsv, and 5 figures.
Estimated runtime: 4-5.5 hours on idle dash0 H20.
2026-05-26 17:27:41 +08:00
06dd175441 Microbench 1 plots: prefill-decode interference heatmap + lines
plot_interference.py reads the interference sweep summary (4 D × 4 P × 3 reps,
cold prefill prompts) and produces:

  fig_interference_heatmap.png
    TPOT p90 interference index over (D, P): 14x at D=8 P=2k → 214x at D=1 P=32k.

  fig_interference_lines.png
    (a) TPOT p90 during prefill vs P, log-y, one line per D + baseline dashed
    (b) Cold prefill TTFT vs P (interference window length)

Confirms B2 finding: cold prefill on the same worker stalls overlapping
decodes for 14-214x baseline TPOT. The interference window grows linearly
with P (from ~140ms at 2k to ~4.6s at 32k) and is essentially independent
of decode batch size — prefill compute time dominates.
2026-05-26 14:21:30 +08:00
72790ae6c1 PD-sep server-side profiling: vLLM patches + per-request breakdown
Instrumentation patches (microbench/patches/):
  - pd_profile.py: shared event emitter (VLLM_PD_PROFILE_LOG env var)
  - apply_patches.py: idempotent patch installer for mooncake_connector.py
    and scheduler.py, marks insertions with # PD_PROFILE_PATCH
  - analyze_events.py: joins per-process JSONL event logs by transfer_id
    into per-request phase durations

Seven events captured per request:
  D_get_num_matched → P_zmq_received → P_prefill_done →
  P_rdma_start → P_rdma_end → D_recv_complete → D_request_promoted

Driver fix (microbench/lifecycle/driver.py):
  seed_prefix_cache now sends via the proxy URL so P and D both cache
  the seeded prefix with matching block hashes. Previously seeding D
  directly produced different block hashes than the proxy-routed
  measurement requests, making incremental transfer impossible.

Real breakdown (fig_breakdown_real.png, server_breakdown.csv, n=93):
  prefill_compute  620 ms median (95% of overhead)
  rdma_transfer     42 ms median (~71 Gbps effective)
  other overhead    10 ms median (dispatch + params + signal + promote)

Mooncake transfer is NOT the bottleneck. Even with bulk RDMA the
transfer cost is <10% of prefill cost for Qwen3-30B-A3B on H20.
2026-05-26 13:59:09 +08:00
f784e49c07 Microbench: prefill-decode interference + PD transfer lifecycle
Two microbenchmarks quantifying the elastic offload decision:

1. Interference (corrected): cold prefill causes 14-214x TPOT p90
   degradation on same-worker decode (D∈{1,2,4,8} × P∈{2k,8k,16k,32k}).
   Earlier run had a prefix-cache bug (deterministic prompts hit cache
   after rep 0); fixed with uuid+time_ns unique prompts.

2. Transfer lifecycle: PD-sep TTFT breakdown via Mooncake proxy,
   measuring prefill→RDMA→decode startup overhead.

Key finding: offload wins at all P≥2048 operating points —
transfer cost is 25-50% of interference cost even with bulk Mooncake.
2026-05-26 00:57:06 +08:00