Gahow Wang 8829928fc5 Cache-size sweep: build_meta is O(|cache|), +85.6 μs / 1k blocks
Follow-up to Microbench 3 that finally tests H5 (cache-size
dependence) and instruments worker-side connector callbacks the
original patch missed.

Patch v2 (apply_step_timing_v2.py) adds:
  scheduler: `cache_size` field in engine_step.jsonl
  worker:    `get_finished_us` + `start_load_kv_us` in worker_step.r0.jsonl
  uses BLOCK_BEGIN/END sentinels for safe multi-line revert
  (the original v1 patch survives this v2's apply/revert cycle)

Driver: continuous open-loop (1.5 req/s, 4096x256 random per req)
that lets APC fill from 0 → ceiling within one vLLM lifetime so a
single run produces the full cache_size sweep. Decode-only steps
are filtered post-hoc to remove prefill-mix variance.

Findings (H20 96GB, ceiling reached ~17.5k blocks; n=15-18k decode
steps per config):

  config         | slope (μs / 1k blocks) | step_dur p50 @ |cache|=16.6k
  ---------------|------------------------|-----------------------------
  mooncake_both  | +85.6                  | 1528 μs (build_meta=1442, 94%)
  noop_connector | -0.8 (≈0)              |  79 μs
  plain          | +1.0 (≈0)              |  84 μs

  Worker-side get_finished p50/p90/p99 (μs/step):
    mooncake_both:  180 / 257 / 333
    noop_connector:   0 /   0 /   2

H5 PASSES. mooncake_both step_duration scales linearly with |cache|
because build_connector_meta walks set(cache.keys()) every step
(`mooncake_connector.py:434-450`). plain and noop are flat.

The previously-uninstrumented get_finished() adds a constant
180 μs/step on top — two `run_coroutine_threadsafe(...).result()`
blocking waits in kv_both mode (`mooncake_connector.py:1107-1137`)
fire every step even when no transfer is pending.

Trace-replay reconciliation (APC ≈ 79% → |cache| ≈ 13k blocks):
  build_meta @ 13k ≈ 1060 μs + get_finished ≈ 180 μs = 1.24 ms/step
  On ~7 ms decode forward → +15-20% TPOT per step.
  This explains most of the trace-replay +25% TPOT p90 gap from
  single-instance per-step cost alone, leaving a smaller residual
  for multi-instance coupling than originally assumed.

Two clear fixes pointed out in REPORT.md:
  1. replace O(|cache|) per-step walk with incremental delta
     listener using block_pool's add/remove callbacks
  2. short-circuit get_finished() when both producer/consumer
     queues are empty in kv_both

Heavy raw artifacts (engine_step.jsonl, vllm_stdout/stderr,
.vllm.pid) are .gitignored — they re-derive from `bash run_all.sh`
and SUMMARY.md / per_config.json fully capture the conclusions.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 23:34:21 +08:00
Description
No description provided
48 MiB
Languages
Python 82.9%
Shell 17.1%