Adds an env-gated skip for the per-step `set(cache.keys())` walk in
MooncakeConnectorScheduler.build_connector_meta() that was introduced
in our own commit a7df84b (Direct RDMA read). Re-runs the cache_sweep
A/B with three configs: plain (control), mooncake_both (baseline), and
mooncake_both_drfix (VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1).
Files:
apply_direct_read_fix.py one-line env-gate patch (markered revert)
run_drfix.sh orchestrator for plain + mooncake_both + drfix
analyze.py extended to compare mooncake_both_drfix vs plain
and mooncake_both vs mooncake_both_drfix
REPORT_DRFIX.md findings
results/20260526_1543_drfix/ run artifacts
Headline:
config | slope (μs/1k blocks) | step_dur p50 @ 16.6k
----------------------|----------------------|---------------------
mooncake_both | +81.0 | 1 550 μs
mooncake_both_drfix | -0.7 (≈ 0) | 95 μs
plain (control) | -1.8 (≈ 0) | 72 μs
build_meta p50 @ 16.6k blocks:
mooncake_both = 1 459 μs
mooncake_both_drfix = 6 μs (residual loop bookkeeping)
worker get_finished p50:
mooncake_both = 178 μs (unchanged; this fix doesn't touch it)
mooncake_both_drfix = 183 μs
The fix recovers 1 453 μs (99.6 %) of the scheduler-side cost at
|cache|=16.6k blocks. drfix's per-bin step_dur tracks plain within
±50 μs across the full cache range — that's noise-level. The slope
goes from +81 to essentially zero.
Worker-side get_finished (180 μs constant) is unchanged because the
DR-fix touches scheduler.build_connector_meta only. That's the next
target if we want to bring kv_both fully back to plain-level.
Extrapolation to trace-replay (|cache|≈13k, APC≈79%):
before: build_meta 1 060 μs + get_finished 180 μs = 1.24 ms/step
after DR-fix: build_meta 6 μs + get_finished 180 μs = ~0.19 ms/step
→ 85% reduction in per-step connector cost
→ TPOT inflation drops from ~+18% to ~+3% on a 7 ms decode step
Confirms: the entire O(|cache|) slope was introduced by our own
direct-RDMA-read implementation (commit a7df84b), not upstream
Mooncake. Production fix: gate the sync on the presence of any
direct_read consumer, or replace per-step diff with an incremental
delta listener fed by block_pool add/remove callbacks.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>