Files

Gahow Wang 31cf8c9b11 DR-fix A/B: env-gate hash sync drops slope from +81 to -0.7 μs/1k blocks

Adds an env-gated skip for the per-step `set(cache.keys())` walk in
MooncakeConnectorScheduler.build_connector_meta() that was introduced
in our own commit a7df84b (Direct RDMA read). Re-runs the cache_sweep
A/B with three configs: plain (control), mooncake_both (baseline), and
mooncake_both_drfix (VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1).

Files:
  apply_direct_read_fix.py  one-line env-gate patch (markered revert)
  run_drfix.sh              orchestrator for plain + mooncake_both + drfix
  analyze.py                extended to compare mooncake_both_drfix vs plain
                            and mooncake_both vs mooncake_both_drfix
  REPORT_DRFIX.md           findings
  results/20260526_1543_drfix/ run artifacts

Headline:

  config                | slope (μs/1k blocks) | step_dur p50 @ 16.6k
  ----------------------|----------------------|---------------------
  mooncake_both         | +81.0                | 1 550 μs
  mooncake_both_drfix   | -0.7  (≈ 0)          |    95 μs
  plain (control)       | -1.8  (≈ 0)          |    72 μs

  build_meta p50 @ 16.6k blocks:
    mooncake_both        = 1 459 μs
    mooncake_both_drfix  =     6 μs    (residual loop bookkeeping)

  worker get_finished p50:
    mooncake_both        = 178 μs    (unchanged; this fix doesn't touch it)
    mooncake_both_drfix  = 183 μs

The fix recovers 1 453 μs (99.6 %) of the scheduler-side cost at
|cache|=16.6k blocks. drfix's per-bin step_dur tracks plain within
±50 μs across the full cache range — that's noise-level. The slope
goes from +81 to essentially zero.

Worker-side get_finished (180 μs constant) is unchanged because the
DR-fix touches scheduler.build_connector_meta only. That's the next
target if we want to bring kv_both fully back to plain-level.

Extrapolation to trace-replay (|cache|≈13k, APC≈79%):
  before: build_meta 1 060 μs + get_finished 180 μs = 1.24 ms/step
  after DR-fix: build_meta 6 μs + get_finished 180 μs = ~0.19 ms/step
  → 85% reduction in per-step connector cost
  → TPOT inflation drops from ~+18% to ~+3% on a 7 ms decode step

Confirms: the entire O(|cache|) slope was introduced by our own
direct-RDMA-read implementation (commit a7df84b), not upstream
Mooncake. Production fix: gate the sync on the presence of any
direct_read consumer, or replace per-step diff with an incremental
delta listener fed by block_pool add/remove callbacks.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-27 00:03:23 +08:00

5.1 KiB

Raw Blame History

DR-fix A/B: hash-sync skip eliminates the O(|cache|) slope

Run: results/20260526_1543_drfix/ Compares three configs in a single orchestrated run (same vLLM process lifecycle order, same machine, same patch stack):

config	what it does
`plain`	no kv connector — control
`mooncake_both`	`kv_role=kv_both`, hash sync ON (baseline)
`mooncake_both_drfix`	same launcher, but `VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1`

The DR-fix patch (apply_direct_read_fix.py) replaces the line if self._block_pool is not None: at mooncake_connector.py:433 with an env-gated variant that lets us A/B that single code path without recompiling vLLM. The set(self._block_pool.cache.keys()) walk is the only thing it gates.

Headline

config	slope (μs / 1k blocks)	step_dur p50 @ \|cache\|=16.6k	build_meta p50
mooncake_both	+81.0	1 550 μs	1 459 μs
mooncake_both_drfix	−0.7 (≈ 0)	95 μs	6 μs
plain (control)	−1.8 (≈ 0)	72 μs	0

The DR-fix kills the slope. mooncake_both_drfix's curve overlays plain across the full |cache| range — see figure.png.

Savings at the cache ceiling

At |cache| ≈ 16.6 k blocks (where the prior run sat for ~ 80 % of its lifetime, and where the trace-replay run with APC≈79 % would sit on a same-config H20):

component	baseline mooncake_both	drfix	saved
`build_connector_meta` p50	1 459 μs	6 μs	−1 453 μs (−99.6 %)
total step_duration p50	1 550 μs	95 μs	−1 455 μs (−94 %)
worker `get_finished` p50	178 μs	183 μs	unchanged (this fix doesn't touch it)
worker `start_load_kv` p50	2 μs	2 μs	unchanged

So the patch did exactly what the source-code reading predicted: the O(|cache|) walk was the entire scheduler-side cost, and turning it off recovers all of it. get_finished overhead is untouched — that's a separate fix candidate.

Throughput (sanity check, not the focus)

config	requests completed in 241 s	effective rate
plain	322	1.34 req/s
mooncake_both	365	1.51 req/s
mooncake_both_drfix	384	1.59 req/s

Note the plain run had a transient inflight spike (t+90s inflight=15) that other configs did not — this is Poisson-arrival variance, not a real ordering. The per-step measurements (n ≥ 15 k decode steps per config) are far more reliable than the request-count totals for comparing across configs.

Slope decomposition at each cache bin

bin	cache mid	plain p50	mooncake_both p50	mooncake_both_drfix p50	drfix tax vs plain
1	2 629	71	—	85	+14 μs
2	4 382	124	655	94	−30 μs
3	6 135	134	809	121	−13 μs
4	7 888	90	1 157	101	+11 μs
5	9 640	134	981	150	+16 μs
6	11 393	109	1 052	160	+51 μs
7	13 146	124	1 228	158	+34 μs
8	14 899	128	1 298	132	+4 μs
9	16 652	72	1 550	95	+23 μs

mooncake_both_drfix sits within ±50 μs of plain at every bin — that's noise-level. The mooncake_both column rises monotonically with bin, drfix doesn't. This is the cleanest possible "ablation".

What this means for the trace-replay 45 %

The prior cache_sweep extrapolation said at |cache|≈13 k blocks (APC≈79 %) the per-step cost is ~ 1.24 ms (1 060 μs build_meta + 180 μs get_finished). With the DR fix:

build_meta (drfix) ≈   6 μs    ← reduced from ~1 060 μs
get_finished       ≈ 180 μs    ← unchanged
total              ≈ 186 μs

So the DR fix alone takes the per-step connector cost from ~1.24 ms to ~0.19 ms — an 85 % reduction. On a ~7 ms decode step that's TPOT inflation dropping from +18 % to +3 %.

If we also fix the get_finished constant (the second fix candidate listed in REPORT.md), per-step cost goes to plain-level ~0 — recovering the entire substrate tax in kv_both mode.

Reproducibility

cd microbench/connector_tax/cache_sweep
bash run_drfix.sh         # ~22 min on H20

The orchestrator applies v1+v2+DR_FIX patches, runs the three configs sequentially (the third with the env var set), reverts all patches on exit, and produces SUMMARY.md + figure.png.

Implications

The +85 μs / 1k blocks slope was 100 % from our own a7df84b direct-RDMA-read implementation, not Mooncake's upstream design. Disabling it via env var fully recovers the tax.
Direct-read is opt-in by request: the synthetic workload here never sets direct_read=True, so the hash sync was doing no useful work. Production should gate the sync on direct_read_consumers_present, or do it incrementally via block_pool callbacks rather than per-step diff.
Worker get_finished is the next target: still 180 μs/step constant in both mooncake_both and mooncake_both_drfix. Caused by two run_coroutine_threadsafe(...).result() blocking waits in kv_both mode even when both queues are empty.

5.1 KiB Raw Blame History Unescape Escape