Files
agentic-kvc/microbench/connector_tax/cache_sweep/REPORT_DRFIX.md
Gahow Wang 31cf8c9b11 DR-fix A/B: env-gate hash sync drops slope from +81 to -0.7 μs/1k blocks
Adds an env-gated skip for the per-step `set(cache.keys())` walk in
MooncakeConnectorScheduler.build_connector_meta() that was introduced
in our own commit a7df84b (Direct RDMA read). Re-runs the cache_sweep
A/B with three configs: plain (control), mooncake_both (baseline), and
mooncake_both_drfix (VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1).

Files:
  apply_direct_read_fix.py  one-line env-gate patch (markered revert)
  run_drfix.sh              orchestrator for plain + mooncake_both + drfix
  analyze.py                extended to compare mooncake_both_drfix vs plain
                            and mooncake_both vs mooncake_both_drfix
  REPORT_DRFIX.md           findings
  results/20260526_1543_drfix/ run artifacts

Headline:

  config                | slope (μs/1k blocks) | step_dur p50 @ 16.6k
  ----------------------|----------------------|---------------------
  mooncake_both         | +81.0                | 1 550 μs
  mooncake_both_drfix   | -0.7  (≈ 0)          |    95 μs
  plain (control)       | -1.8  (≈ 0)          |    72 μs

  build_meta p50 @ 16.6k blocks:
    mooncake_both        = 1 459 μs
    mooncake_both_drfix  =     6 μs    (residual loop bookkeeping)

  worker get_finished p50:
    mooncake_both        = 178 μs    (unchanged; this fix doesn't touch it)
    mooncake_both_drfix  = 183 μs

The fix recovers 1 453 μs (99.6 %) of the scheduler-side cost at
|cache|=16.6k blocks. drfix's per-bin step_dur tracks plain within
±50 μs across the full cache range — that's noise-level. The slope
goes from +81 to essentially zero.

Worker-side get_finished (180 μs constant) is unchanged because the
DR-fix touches scheduler.build_connector_meta only. That's the next
target if we want to bring kv_both fully back to plain-level.

Extrapolation to trace-replay (|cache|≈13k, APC≈79%):
  before: build_meta 1 060 μs + get_finished 180 μs = 1.24 ms/step
  after DR-fix: build_meta 6 μs + get_finished 180 μs = ~0.19 ms/step
  → 85% reduction in per-step connector cost
  → TPOT inflation drops from ~+18% to ~+3% on a 7 ms decode step

Confirms: the entire O(|cache|) slope was introduced by our own
direct-RDMA-read implementation (commit a7df84b), not upstream
Mooncake. Production fix: gate the sync on the presence of any
direct_read consumer, or replace per-step diff with an incremental
delta listener fed by block_pool add/remove callbacks.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 00:03:23 +08:00

5.1 KiB
Raw Blame History

DR-fix A/B: hash-sync skip eliminates the O(|cache|) slope

Run: results/20260526_1543_drfix/ Compares three configs in a single orchestrated run (same vLLM process lifecycle order, same machine, same patch stack):

config what it does
plain no kv connector — control
mooncake_both kv_role=kv_both, hash sync ON (baseline)
mooncake_both_drfix same launcher, but VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1

The DR-fix patch (apply_direct_read_fix.py) replaces the line if self._block_pool is not None: at mooncake_connector.py:433 with an env-gated variant that lets us A/B that single code path without recompiling vLLM. The set(self._block_pool.cache.keys()) walk is the only thing it gates.

Headline

config slope (μs / 1k blocks) step_dur p50 @ |cache|=16.6k build_meta p50
mooncake_both +81.0 1 550 μs 1 459 μs
mooncake_both_drfix 0.7 (≈ 0) 95 μs 6 μs
plain (control) 1.8 (≈ 0) 72 μs 0

The DR-fix kills the slope. mooncake_both_drfix's curve overlays plain across the full |cache| range — see figure.png.

figure

Savings at the cache ceiling

At |cache| ≈ 16.6 k blocks (where the prior run sat for ~ 80 % of its lifetime, and where the trace-replay run with APC≈79 % would sit on a same-config H20):

component baseline mooncake_both drfix saved
build_connector_meta p50 1 459 μs 6 μs 1 453 μs (99.6 %)
total step_duration p50 1 550 μs 95 μs 1 455 μs (94 %)
worker get_finished p50 178 μs 183 μs unchanged (this fix doesn't touch it)
worker start_load_kv p50 2 μs 2 μs unchanged

So the patch did exactly what the source-code reading predicted: the O(|cache|) walk was the entire scheduler-side cost, and turning it off recovers all of it. get_finished overhead is untouched — that's a separate fix candidate.

Throughput (sanity check, not the focus)

config requests completed in 241 s effective rate
plain 322 1.34 req/s
mooncake_both 365 1.51 req/s
mooncake_both_drfix 384 1.59 req/s

Note the plain run had a transient inflight spike (t+90s inflight=15) that other configs did not — this is Poisson-arrival variance, not a real ordering. The per-step measurements (n ≥ 15 k decode steps per config) are far more reliable than the request-count totals for comparing across configs.

Slope decomposition at each cache bin

bin cache mid plain p50 mooncake_both p50 mooncake_both_drfix p50 drfix tax vs plain
1 2 629 71 85 +14 μs
2 4 382 124 655 94 30 μs
3 6 135 134 809 121 13 μs
4 7 888 90 1 157 101 +11 μs
5 9 640 134 981 150 +16 μs
6 11 393 109 1 052 160 +51 μs
7 13 146 124 1 228 158 +34 μs
8 14 899 128 1 298 132 +4 μs
9 16 652 72 1 550 95 +23 μs

mooncake_both_drfix sits within ±50 μs of plain at every bin — that's noise-level. The mooncake_both column rises monotonically with bin, drfix doesn't. This is the cleanest possible "ablation".

What this means for the trace-replay 45 %

The prior cache_sweep extrapolation said at |cache|≈13 k blocks (APC≈79 %) the per-step cost is ~ 1.24 ms (1 060 μs build_meta + 180 μs get_finished). With the DR fix:

build_meta (drfix) ≈   6 μs    ← reduced from ~1 060 μs
get_finished       ≈ 180 μs    ← unchanged
total              ≈ 186 μs

So the DR fix alone takes the per-step connector cost from ~1.24 ms to ~0.19 ms — an 85 % reduction. On a ~7 ms decode step that's TPOT inflation dropping from +18 % to +3 %.

If we also fix the get_finished constant (the second fix candidate listed in REPORT.md), per-step cost goes to plain-level ~0 — recovering the entire substrate tax in kv_both mode.

Reproducibility

cd microbench/connector_tax/cache_sweep
bash run_drfix.sh         # ~22 min on H20

The orchestrator applies v1+v2+DR_FIX patches, runs the three configs sequentially (the third with the env var set), reverts all patches on exit, and produces SUMMARY.md + figure.png.

Implications

  1. The +85 μs / 1k blocks slope was 100 % from our own a7df84b direct-RDMA-read implementation, not Mooncake's upstream design. Disabling it via env var fully recovers the tax.
  2. Direct-read is opt-in by request: the synthetic workload here never sets direct_read=True, so the hash sync was doing no useful work. Production should gate the sync on direct_read_consumers_present, or do it incrementally via block_pool callbacks rather than per-step diff.
  3. Worker get_finished is the next target: still 180 μs/step constant in both mooncake_both and mooncake_both_drfix. Caused by two run_coroutine_threadsafe(...).result() blocking waits in kv_both mode even when both queues are empty.