Adds an env-gated skip for the per-step `set(cache.keys())` walk in MooncakeConnectorScheduler.build_connector_meta() that was introduced in our own commita7df84b(Direct RDMA read). Re-runs the cache_sweep A/B with three configs: plain (control), mooncake_both (baseline), and mooncake_both_drfix (VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1). Files: apply_direct_read_fix.py one-line env-gate patch (markered revert) run_drfix.sh orchestrator for plain + mooncake_both + drfix analyze.py extended to compare mooncake_both_drfix vs plain and mooncake_both vs mooncake_both_drfix REPORT_DRFIX.md findings results/20260526_1543_drfix/ run artifacts Headline: config | slope (μs/1k blocks) | step_dur p50 @ 16.6k ----------------------|----------------------|--------------------- mooncake_both | +81.0 | 1 550 μs mooncake_both_drfix | -0.7 (≈ 0) | 95 μs plain (control) | -1.8 (≈ 0) | 72 μs build_meta p50 @ 16.6k blocks: mooncake_both = 1 459 μs mooncake_both_drfix = 6 μs (residual loop bookkeeping) worker get_finished p50: mooncake_both = 178 μs (unchanged; this fix doesn't touch it) mooncake_both_drfix = 183 μs The fix recovers 1 453 μs (99.6 %) of the scheduler-side cost at |cache|=16.6k blocks. drfix's per-bin step_dur tracks plain within ±50 μs across the full cache range — that's noise-level. The slope goes from +81 to essentially zero. Worker-side get_finished (180 μs constant) is unchanged because the DR-fix touches scheduler.build_connector_meta only. That's the next target if we want to bring kv_both fully back to plain-level. Extrapolation to trace-replay (|cache|≈13k, APC≈79%): before: build_meta 1 060 μs + get_finished 180 μs = 1.24 ms/step after DR-fix: build_meta 6 μs + get_finished 180 μs = ~0.19 ms/step → 85% reduction in per-step connector cost → TPOT inflation drops from ~+18% to ~+3% on a 7 ms decode step Confirms: the entire O(|cache|) slope was introduced by our own direct-RDMA-read implementation (commita7df84b), not upstream Mooncake. Production fix: gate the sync on the presence of any direct_read consumer, or replace per-step diff with an incremental delta listener fed by block_pool add/remove callbacks. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
5.1 KiB
DR-fix A/B: hash-sync skip eliminates the O(|cache|) slope
Run: results/20260526_1543_drfix/
Compares three configs in a single orchestrated run (same vLLM
process lifecycle order, same machine, same patch stack):
| config | what it does |
|---|---|
plain |
no kv connector — control |
mooncake_both |
kv_role=kv_both, hash sync ON (baseline) |
mooncake_both_drfix |
same launcher, but VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1 |
The DR-fix patch (apply_direct_read_fix.py) replaces the line
if self._block_pool is not None: at
mooncake_connector.py:433 with an env-gated variant that lets us
A/B that single code path without recompiling vLLM. The
set(self._block_pool.cache.keys()) walk is the only thing it gates.
Headline
| config | slope (μs / 1k blocks) | step_dur p50 @ |cache|=16.6k | build_meta p50 |
|---|---|---|---|
| mooncake_both | +81.0 | 1 550 μs | 1 459 μs |
| mooncake_both_drfix | −0.7 (≈ 0) | 95 μs | 6 μs |
| plain (control) | −1.8 (≈ 0) | 72 μs | 0 |
The DR-fix kills the slope. mooncake_both_drfix's curve overlays
plain across the full |cache| range — see figure.png.
Savings at the cache ceiling
At |cache| ≈ 16.6 k blocks (where the prior run sat for ~ 80 % of its lifetime, and where the trace-replay run with APC≈79 % would sit on a same-config H20):
| component | baseline mooncake_both | drfix | saved |
|---|---|---|---|
build_connector_meta p50 |
1 459 μs | 6 μs | −1 453 μs (−99.6 %) |
| total step_duration p50 | 1 550 μs | 95 μs | −1 455 μs (−94 %) |
worker get_finished p50 |
178 μs | 183 μs | unchanged (this fix doesn't touch it) |
worker start_load_kv p50 |
2 μs | 2 μs | unchanged |
So the patch did exactly what the source-code reading
predicted: the O(|cache|) walk was the entire scheduler-side cost,
and turning it off recovers all of it. get_finished overhead is
untouched — that's a separate fix candidate.
Throughput (sanity check, not the focus)
| config | requests completed in 241 s | effective rate |
|---|---|---|
| plain | 322 | 1.34 req/s |
| mooncake_both | 365 | 1.51 req/s |
| mooncake_both_drfix | 384 | 1.59 req/s |
Note the plain run had a transient inflight spike (t+90s inflight=15)
that other configs did not — this is Poisson-arrival variance, not
a real ordering. The per-step measurements (n ≥ 15 k decode steps
per config) are far more reliable than the request-count totals
for comparing across configs.
Slope decomposition at each cache bin
| bin | cache mid | plain p50 | mooncake_both p50 | mooncake_both_drfix p50 | drfix tax vs plain |
|---|---|---|---|---|---|
| 1 | 2 629 | 71 | — | 85 | +14 μs |
| 2 | 4 382 | 124 | 655 | 94 | −30 μs |
| 3 | 6 135 | 134 | 809 | 121 | −13 μs |
| 4 | 7 888 | 90 | 1 157 | 101 | +11 μs |
| 5 | 9 640 | 134 | 981 | 150 | +16 μs |
| 6 | 11 393 | 109 | 1 052 | 160 | +51 μs |
| 7 | 13 146 | 124 | 1 228 | 158 | +34 μs |
| 8 | 14 899 | 128 | 1 298 | 132 | +4 μs |
| 9 | 16 652 | 72 | 1 550 | 95 | +23 μs |
mooncake_both_drfix sits within ±50 μs of plain at every bin — that's noise-level. The mooncake_both column rises monotonically with bin, drfix doesn't. This is the cleanest possible "ablation".
What this means for the trace-replay 45 %
The prior cache_sweep extrapolation said at |cache|≈13 k blocks (APC≈79 %) the per-step cost is ~ 1.24 ms (1 060 μs build_meta + 180 μs get_finished). With the DR fix:
build_meta (drfix) ≈ 6 μs ← reduced from ~1 060 μs
get_finished ≈ 180 μs ← unchanged
total ≈ 186 μs
So the DR fix alone takes the per-step connector cost from ~1.24 ms to ~0.19 ms — an 85 % reduction. On a ~7 ms decode step that's TPOT inflation dropping from +18 % to +3 %.
If we also fix the get_finished constant (the second fix
candidate listed in REPORT.md), per-step cost goes to plain-level
~0 — recovering the entire substrate tax in kv_both mode.
Reproducibility
cd microbench/connector_tax/cache_sweep
bash run_drfix.sh # ~22 min on H20
The orchestrator applies v1+v2+DR_FIX patches, runs the three configs sequentially (the third with the env var set), reverts all patches on exit, and produces SUMMARY.md + figure.png.
Implications
- The +85 μs / 1k blocks slope was 100 % from our own
a7df84bdirect-RDMA-read implementation, not Mooncake's upstream design. Disabling it via env var fully recovers the tax. - Direct-read is opt-in by request: the synthetic workload
here never sets
direct_read=True, so the hash sync was doing no useful work. Production should gate the sync ondirect_read_consumers_present, or do it incrementally via block_pool callbacks rather than per-step diff. - Worker
get_finishedis the next target: still 180 μs/step constant in both mooncake_both and mooncake_both_drfix. Caused by tworun_coroutine_threadsafe(...).result()blocking waits inkv_bothmode even when both queues are empty.
