31cf8c9b1156390eddbe7f37746822fbb7f04152
Adds an env-gated skip for the per-step `set(cache.keys())` walk in MooncakeConnectorScheduler.build_connector_meta() that was introduced in our own commita7df84b(Direct RDMA read). Re-runs the cache_sweep A/B with three configs: plain (control), mooncake_both (baseline), and mooncake_both_drfix (VLLM_MOONCAKE_DISABLE_DIRECT_READ_SYNC=1). Files: apply_direct_read_fix.py one-line env-gate patch (markered revert) run_drfix.sh orchestrator for plain + mooncake_both + drfix analyze.py extended to compare mooncake_both_drfix vs plain and mooncake_both vs mooncake_both_drfix REPORT_DRFIX.md findings results/20260526_1543_drfix/ run artifacts Headline: config | slope (μs/1k blocks) | step_dur p50 @ 16.6k ----------------------|----------------------|--------------------- mooncake_both | +81.0 | 1 550 μs mooncake_both_drfix | -0.7 (≈ 0) | 95 μs plain (control) | -1.8 (≈ 0) | 72 μs build_meta p50 @ 16.6k blocks: mooncake_both = 1 459 μs mooncake_both_drfix = 6 μs (residual loop bookkeeping) worker get_finished p50: mooncake_both = 178 μs (unchanged; this fix doesn't touch it) mooncake_both_drfix = 183 μs The fix recovers 1 453 μs (99.6 %) of the scheduler-side cost at |cache|=16.6k blocks. drfix's per-bin step_dur tracks plain within ±50 μs across the full cache range — that's noise-level. The slope goes from +81 to essentially zero. Worker-side get_finished (180 μs constant) is unchanged because the DR-fix touches scheduler.build_connector_meta only. That's the next target if we want to bring kv_both fully back to plain-level. Extrapolation to trace-replay (|cache|≈13k, APC≈79%): before: build_meta 1 060 μs + get_finished 180 μs = 1.24 ms/step after DR-fix: build_meta 6 μs + get_finished 180 μs = ~0.19 ms/step → 85% reduction in per-step connector cost → TPOT inflation drops from ~+18% to ~+3% on a 7 ms decode step Confirms: the entire O(|cache|) slope was introduced by our own direct-RDMA-read implementation (commita7df84b), not upstream Mooncake. Production fix: gate the sync on the presence of any direct_read consumer, or replace per-step diff with an incremental delta listener fed by block_pool add/remove callbacks. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Description
No description provided
Languages
Python
82.9%
Shell
17.1%