Connector tax RESULTS.md: errata + run-to-run variance disclosure

The prior write-up presented one specific reading of the data as the headline without flagging methodology gaps. Three corrections: 1. The "0% low-concurrency tax" comes from a single back-to-back mooncake_both_v2/plain_v2 rerun. The original Phase A pair showed TTFT p90 +29%, TPOT p90 +54%, E2E p90 +55% at rate=2 — a 40 percentage-point swing between two consecutive runs that the original write-up did not call out. The run-to-run noise floor is too high to claim "0%" at low concurrency. 2. get_finished() was never instrumented. The patch only times step_duration_us and build_meta_us. "100% of per-step cost is build_meta" is an upper bound on what was timed, not a true decomposition. 3. H5 (cache-size dependence) was the central hypothesis but was never tested in the prior run; random content kept APC near empty. The +7-9% high-concurrency (single instance, 512x64, rate=8-16) and +17% 8-instance-saturated numbers are kept; they were measured with adequate sample sizes and are reproducible. The follow-up sweep in cache_sweep/ tests H5 directly and revises the decomposition. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 23:33:01 +08:00
parent e3480f7d28
commit 54de78eb11
1 changed files with 113 additions and 31 deletions
--- a/microbench/connector_tax/results/RESULTS.md
+++ b/microbench/connector_tax/results/RESULTS.md
@@ -1,18 +1,57 @@
 # Microbench 3: Connector Substrate Tax — Results

+> **2026-05-26 ERRATA (post-review)**: The original write-up overstated
+> what this microbench had measured. Three things to call out before
+> reading the rest:
+> 1. **The "0% low-concurrency tax" number comes from a single
+>    back-to-back rerun (`mooncake_both_v2 / plain_v2`), not from
+>    randomized repeats.** The *same* configuration in the original
+>    Phase A (`mooncake_both / plain`) shows TTFT p90 +29 %, TPOT p90
+>    +54 %, E2E p90 +55 % at rate = 2 req/s — a 40-percentage-point
+>    swing between two consecutive runs is the dominant signal, not
+>    the substrate. See "Run-to-run variance" below.
+> 2. **`get_finished()` was never instrumented.** The patch in
+>    `patches/apply_step_timing.py` only times `step_duration_us` and
+>    `build_meta_us`; the docstring lists more callbacks but they are
+>    not in the code. The "100 % of per-step cost is build_meta"
+>    statement is therefore an *upper bound on what we measured*, not
+>    a true decomposition. `get_finished()` in `kv_both` mode runs two
+>    cross-thread `run_coroutine_threadsafe(...).result()` blocking
+>    waits every step (`mooncake_connector.py:1107-1137`) and is a
+>    plausible second contributor.
+> 3. **H5 (cache-size dependence) is untested.** The hypothesis that
+>    `set(self._block_pool.cache.keys())` cost grows with |cache| is
+>    central to attributing the trace-replay 45 % gap, but the run
+>    used random-content prompts with effectively empty APC. The
+>    cache-size sweep in `cache_sweep/` is what actually tests this.
+>
+> The headline mechanism (build_connector_meta walks O(|cache|) every
+> step) is still correct as an *identifiable code path*. The
+> *quantitative* claims (0% / 7-9% / 17%) are correct for the
+> *regimes that were measured* (random content, single instance or
+> 8-instance with `load_only`, fresh APC). Whether they generalize to
+> the trace-replay setting requires the cache-size sweep.
+
 ## Executive Summary

 The `build_connector_meta()` in MooncakeConnector adds **1.4ms per scheduler
-step** (measured via engine_step.jsonl instrumentation). However, this overhead
-only manifests as user-visible latency degradation under **high decode
-concurrency** (8+ concurrent requests with short forward steps). Under low
-concurrency, vLLM's scheduler-model async pipeline completely hides the cost.
+step** (measured via engine_step.jsonl instrumentation) on a *cold* APC.
+This overhead is only the build-meta portion of the connector callbacks
+(`get_finished`, `start_load_kv`, etc. were not instrumented). Under the
+regimes we actually measured, it manifests as user-visible latency
+degradation only under **high decode concurrency** (8+ concurrent
+requests with short forward steps). Under low concurrency, the
+scheduler-model async pipeline appears to hide most of the cost — but
+the run-to-run variance is large enough that we cannot rule out a
+real 10-30 % tax there either (see §Run-to-run variance).

-| Regime | Substrate Tax (TTFT p90) | Mechanism |
-|--------|--------------------------|-----------|
-| Low concurrency (0.5-2 req/s, 4k input) | **~0%** (undetectable) | Scheduler runs during model forward; 1.4ms << forward step time |
-| High concurrency (8-16 req/s, 512 input) | **+7-9%** | Multiple short decode steps; scheduler per-step cost becomes visible |
-| 8-instance trace-replay (elastic_migration_v2) | **+45%** | High concurrency + multi-instance coupling amplification |
+| Regime | TTFT-p90 tax (mooncake_both vs plain) | Caveat |
+|--------|---------------------------------------|--------|
+| Low conc, 4096×256, rate≤2 (v1 run) | +12 % (r=1) / +29 % (r=2) | First-shot data; APC near-empty |
+| Low conc, 4096×256, rate≤2 (v2 rerun) | −12 % (r=1) / −10 % (r=2) | Back-to-back rerun; sign flips |
+| High conc, 512×64, rate=8-16 (single instance) | **+7-9 %** | Most reproducible; n≥395 per cell |
+| 8-inst load_only, 512×64, rate=128 (saturated) | **+17 %** | Throughput dropped to 0.70 |
+| 8-inst agentic trace-replay (elastic_migration_v2) | **+45 %** | APC ≈ 79 %, session-coupled — *not yet reproduced* |

 ---

@@ -33,6 +72,28 @@ The vLLM v1 framework dispatch itself (noop_connector) adds only +16μs.

 ---

+## Run-to-run variance (4096 × 256)
+
+We have two back-to-back pairs of runs at the same shape, same rates,
+same seed methodology. They disagree by 40 percentage points:
+
+| rate | metric | v1 (plain → mooncake_both) | v2 (plain_v2 → mooncake_both_v2) |
+|---|---|---|---|
+| 0.5 | TTFT p90 tax | −8 % | −12 % |
+| 1.0 | TTFT p90 tax | **+12 %** | **−12 %** |
+| 2.0 | TTFT p90 tax | **+29 %** | **−10 %** |
+| 2.0 | TPOT p90 tax | **+54 %** | **−23 %** |
+| 2.0 | E2E  p90 tax | **+55 %** | **−23 %** |
+
+Both v1 and v2 used 200 completed-request floors; v1 ran configs
+serially with full GPU release between, v2 ran the two configs
+back-to-back without restart. Neither has CI bars. The 40-pp swing
+between the two is larger than any of the "0%/+9%/+17%" headline
+numbers, so the conclusion that "low-concurrency tax is ~0%" needs
+either many more replicates or a fundamentally different methodology
+(e.g. controlled |cache|; see `cache_sweep/`). The v2 numbers below
+are kept for historical reference but should be read with this caveat.
+
 ## Low-Concurrency Results (4096 input, 256 output)

 Back-to-back fresh runs (mooncake_both_v2 first, plain_v2 second):
@@ -104,37 +165,58 @@ SLO-compliant.

 ---

-## Reconciliation with Trace-Replay (+45%)
+## Reconciliation with Trace-Replay (+45%) — what we *do* and *don't* know

 The trace-replay claim (elastic_migration_v2 §Result 1) measured
-TTFT p90 +45% with 8 instances, saturated agentic coupling.
+TTFT p90 +45% with 8 instances, saturated agentic coupling, APC≈79%.

-Our microbench decomposes the +45%:
+What this microbench established:

-| Factor | Contribution | Evidence |
-|--------|-------------|----------|
-| `build_connector_meta` per-step cost | **+7-9%** | High-concurrency single-instance test |
-| Large cache amplifies O(\|cache\|) walk | likely 2-3× | Per-step grows with cache size (not yet measured) |
-| Multi-instance coupling amplification | remaining ~20-30% | 8-instance scheduling feedback cascades |
+| Factor | Status | Evidence |
+|--------|--------|----------|
+| `build_connector_meta` adds ~1.4 ms/step on a *near-empty* APC | measured | `engine_step.jsonl`, mooncake_both vs plain |
+| Tax surfaces at high decode concurrency (single instance, 512×64) | +7-9 % | rate=8/16 cells, n≥395 per cell |
+| 8-instance load_only at saturation | +17 % | 8inst_mooncake @ rate=128, thr_p=0.70 |
+| **`get_finished()` per-step cost (two blocking futures)** | **not measured** | patch only times build_meta |
+| **`set(cache.keys())` cost scaling with \|cache\|** | **not measured** | random content → APC ≈ empty in all cells |
+| **Agentic session structure (high reuse + tight cache pressure)** | **not measured** | synthetic open-loop has no sessions |
+| Multi-instance scheduler coupling beyond load_only | not measured | only `load_only` proxy tested |
+
+The honest reconciliation is: the +7-9 % single-instance and +17 %
+8-instance saturated tax are real and small; the gap to +45 % is
+hypothesised to come from (a) the O(|cache|) walk at APC≈79 %,
+(b) the un-instrumented `get_finished()` cost, and (c)
+agentic-coupling effects we have not yet replicated synthetically.
+The `cache_sweep/` experiment tests (a) directly.

 ---

-## Conclusions
+## Conclusions (revised)

-1. **`build_connector_meta` is the tax source**: 1.4ms/step, 100%
-   from Mooncake's `set(cache.keys())` walk. vLLM framework itself
-   costs only 16μs/step.
+1. **`build_connector_meta` is *a* tax source**: ≈1.4 ms/step on a
+   near-empty APC. Whether it is *the* source depends on the
+   un-measured `get_finished()` cost. The "100 %, framework costs
+   only +16 μs/step" claim is an upper bound on what was timed, not
+   a true split.

-2. **Tax is concurrency-dependent**: zero at low concurrency (scheduler
-   hidden behind forward), +7-9% at high concurrency (scheduler on
-   critical path).
+2. **Tax is regime-dependent**, *but the lower bound is unclear* at
+   low concurrency: v1 said +29 % at rate=2, v2 said −10 % at the
+   same shape — the run-to-run noise floor is too high to claim 0 %.
+   High-concurrency (+7-9 %) and 8-instance-saturated (+17 %) are
+   more reproducible.

-3. **Trace-replay's +45% includes coupling amplification**: single-instance
-   accounts for 7-9%; the rest is multi-instance cascade.
+3. **Trace-replay's +45 % is plausible but not yet decomposed.** We
+   have not yet exercised the regime that drives it (APC≈79 % cache,
+   agentic session structure). `cache_sweep/` adds (a). (b) and (c)
+   are open.

-4. **Fixable**: Replace O(|cache|) per-step walk with incremental delta
-   tracking → eliminates the 1.4ms/step entirely.
+4. **Likely fix is still incremental hash sync** — replace the
+   O(|cache|) per-step diff with a delta listener fed by the
+   block-pool's add/remove callbacks. Re-measuring with the fix
+   tells us whether `build_meta` was the dominant cost or just
+   one component.

-5. **SLO impact at production rates**: At rate=16 req/s, tax adds 12ms
-   to TTFT p90 (156ms vs 144ms) and 2ms to TPOT p90 (29.7 vs 27.8ms).
-   Well within typical SLO budgets.
+5. **Take headline SLO numbers with caution**: +12 ms to TTFT p90 at
+   rate=16 (512×64) is the single-instance high-conc figure; under
+   agentic coupling with full cache, this can be substantially
+   larger.