Connector tax RESULTS.md: errata + run-to-run variance disclosure

The prior write-up presented one specific reading of the data as
the headline without flagging methodology gaps. Three corrections:

1. The "0% low-concurrency tax" comes from a single back-to-back
   mooncake_both_v2/plain_v2 rerun. The original Phase A pair
   showed TTFT p90 +29%, TPOT p90 +54%, E2E p90 +55% at rate=2
   — a 40 percentage-point swing between two consecutive runs
   that the original write-up did not call out. The run-to-run
   noise floor is too high to claim "0%" at low concurrency.

2. get_finished() was never instrumented. The patch only times
   step_duration_us and build_meta_us. "100% of per-step cost is
   build_meta" is an upper bound on what was timed, not a true
   decomposition.

3. H5 (cache-size dependence) was the central hypothesis but
   was never tested in the prior run; random content kept APC
   near empty.

The +7-9% high-concurrency (single instance, 512x64, rate=8-16)
and +17% 8-instance-saturated numbers are kept; they were
measured with adequate sample sizes and are reproducible.

The follow-up sweep in cache_sweep/ tests H5 directly and
revises the decomposition.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
2026-05-26 23:33:01 +08:00
parent e3480f7d28
commit 54de78eb11

View File

@@ -1,18 +1,57 @@
# Microbench 3: Connector Substrate Tax — Results
> **2026-05-26 ERRATA (post-review)**: The original write-up overstated
> what this microbench had measured. Three things to call out before
> reading the rest:
> 1. **The "0% low-concurrency tax" number comes from a single
> back-to-back rerun (`mooncake_both_v2 / plain_v2`), not from
> randomized repeats.** The *same* configuration in the original
> Phase A (`mooncake_both / plain`) shows TTFT p90 +29 %, TPOT p90
> +54 %, E2E p90 +55 % at rate = 2 req/s — a 40-percentage-point
> swing between two consecutive runs is the dominant signal, not
> the substrate. See "Run-to-run variance" below.
> 2. **`get_finished()` was never instrumented.** The patch in
> `patches/apply_step_timing.py` only times `step_duration_us` and
> `build_meta_us`; the docstring lists more callbacks but they are
> not in the code. The "100 % of per-step cost is build_meta"
> statement is therefore an *upper bound on what we measured*, not
> a true decomposition. `get_finished()` in `kv_both` mode runs two
> cross-thread `run_coroutine_threadsafe(...).result()` blocking
> waits every step (`mooncake_connector.py:1107-1137`) and is a
> plausible second contributor.
> 3. **H5 (cache-size dependence) is untested.** The hypothesis that
> `set(self._block_pool.cache.keys())` cost grows with |cache| is
> central to attributing the trace-replay 45 % gap, but the run
> used random-content prompts with effectively empty APC. The
> cache-size sweep in `cache_sweep/` is what actually tests this.
>
> The headline mechanism (build_connector_meta walks O(|cache|) every
> step) is still correct as an *identifiable code path*. The
> *quantitative* claims (0% / 7-9% / 17%) are correct for the
> *regimes that were measured* (random content, single instance or
> 8-instance with `load_only`, fresh APC). Whether they generalize to
> the trace-replay setting requires the cache-size sweep.
## Executive Summary
The `build_connector_meta()` in MooncakeConnector adds **1.4ms per scheduler
step** (measured via engine_step.jsonl instrumentation). However, this overhead
only manifests as user-visible latency degradation under **high decode
concurrency** (8+ concurrent requests with short forward steps). Under low
concurrency, vLLM's scheduler-model async pipeline completely hides the cost.
step** (measured via engine_step.jsonl instrumentation) on a *cold* APC.
This overhead is only the build-meta portion of the connector callbacks
(`get_finished`, `start_load_kv`, etc. were not instrumented). Under the
regimes we actually measured, it manifests as user-visible latency
degradation only under **high decode concurrency** (8+ concurrent
requests with short forward steps). Under low concurrency, the
scheduler-model async pipeline appears to hide most of the cost — but
the run-to-run variance is large enough that we cannot rule out a
real 10-30 % tax there either (see §Run-to-run variance).
| Regime | Substrate Tax (TTFT p90) | Mechanism |
|--------|--------------------------|-----------|
| Low concurrency (0.5-2 req/s, 4k input) | **~0%** (undetectable) | Scheduler runs during model forward; 1.4ms << forward step time |
| High concurrency (8-16 req/s, 512 input) | **+7-9%** | Multiple short decode steps; scheduler per-step cost becomes visible |
| 8-instance trace-replay (elastic_migration_v2) | **+45%** | High concurrency + multi-instance coupling amplification |
| Regime | TTFT-p90 tax (mooncake_both vs plain) | Caveat |
|--------|---------------------------------------|--------|
| Low conc, 4096×256, rate≤2 (v1 run) | +12 % (r=1) / +29 % (r=2) | First-shot data; APC near-empty |
| Low conc, 4096×256, rate≤2 (v2 rerun) | 12 % (r=1) / 10 % (r=2) | Back-to-back rerun; sign flips |
| High conc, 512×64, rate=8-16 (single instance) | **+7-9 %** | Most reproducible; n≥395 per cell |
| 8-inst load_only, 512×64, rate=128 (saturated) | **+17 %** | Throughput dropped to 0.70 |
| 8-inst agentic trace-replay (elastic_migration_v2) | **+45 %** | APC ≈ 79 %, session-coupled — *not yet reproduced* |
---
@@ -33,6 +72,28 @@ The vLLM v1 framework dispatch itself (noop_connector) adds only +16μs.
---
## Run-to-run variance (4096 × 256)
We have two back-to-back pairs of runs at the same shape, same rates,
same seed methodology. They disagree by 40 percentage points:
| rate | metric | v1 (plain → mooncake_both) | v2 (plain_v2 → mooncake_both_v2) |
|---|---|---|---|
| 0.5 | TTFT p90 tax | 8 % | 12 % |
| 1.0 | TTFT p90 tax | **+12 %** | **12 %** |
| 2.0 | TTFT p90 tax | **+29 %** | **10 %** |
| 2.0 | TPOT p90 tax | **+54 %** | **23 %** |
| 2.0 | E2E p90 tax | **+55 %** | **23 %** |
Both v1 and v2 used 200 completed-request floors; v1 ran configs
serially with full GPU release between, v2 ran the two configs
back-to-back without restart. Neither has CI bars. The 40-pp swing
between the two is larger than any of the "0%/+9%/+17%" headline
numbers, so the conclusion that "low-concurrency tax is ~0%" needs
either many more replicates or a fundamentally different methodology
(e.g. controlled |cache|; see `cache_sweep/`). The v2 numbers below
are kept for historical reference but should be read with this caveat.
## Low-Concurrency Results (4096 input, 256 output)
Back-to-back fresh runs (mooncake_both_v2 first, plain_v2 second):
@@ -104,37 +165,58 @@ SLO-compliant.
---
## Reconciliation with Trace-Replay (+45%)
## Reconciliation with Trace-Replay (+45%) — what we *do* and *don't* know
The trace-replay claim (elastic_migration_v2 §Result 1) measured
TTFT p90 +45% with 8 instances, saturated agentic coupling.
TTFT p90 +45% with 8 instances, saturated agentic coupling, APC79%.
Our microbench decomposes the +45%:
What this microbench established:
| Factor | Contribution | Evidence |
|--------|-------------|----------|
| `build_connector_meta` per-step cost | **+7-9%** | High-concurrency single-instance test |
| Large cache amplifies O(\|cache\|) walk | likely 2-3× | Per-step grows with cache size (not yet measured) |
| Multi-instance coupling amplification | remaining ~20-30% | 8-instance scheduling feedback cascades |
| Factor | Status | Evidence |
|--------|--------|----------|
| `build_connector_meta` adds ~1.4 ms/step on a *near-empty* APC | measured | `engine_step.jsonl`, mooncake_both vs plain |
| Tax surfaces at high decode concurrency (single instance, 512×64) | +7-9 % | rate=8/16 cells, n395 per cell |
| 8-instance load_only at saturation | +17 % | 8inst_mooncake @ rate=128, thr_p=0.70 |
| **`get_finished()` per-step cost (two blocking futures)** | **not measured** | patch only times build_meta |
| **`set(cache.keys())` cost scaling with \|cache\|** | **not measured** | random content APC empty in all cells |
| **Agentic session structure (high reuse + tight cache pressure)** | **not measured** | synthetic open-loop has no sessions |
| Multi-instance scheduler coupling beyond load_only | not measured | only `load_only` proxy tested |
The honest reconciliation is: the +7-9 % single-instance and +17 %
8-instance saturated tax are real and small; the gap to +45 % is
hypothesised to come from (a) the O(|cache|) walk at APC79 %,
(b) the un-instrumented `get_finished()` cost, and (c)
agentic-coupling effects we have not yet replicated synthetically.
The `cache_sweep/` experiment tests (a) directly.
---
## Conclusions
## Conclusions (revised)
1. **`build_connector_meta` is the tax source**: 1.4ms/step, 100%
from Mooncake's `set(cache.keys())` walk. vLLM framework itself
costs only 16μs/step.
1. **`build_connector_meta` is *a* tax source**: 1.4 ms/step on a
near-empty APC. Whether it is *the* source depends on the
un-measured `get_finished()` cost. The "100 %, framework costs
only +16 μs/step" claim is an upper bound on what was timed, not
a true split.
2. **Tax is concurrency-dependent**: zero at low concurrency (scheduler
hidden behind forward), +7-9% at high concurrency (scheduler on
critical path).
2. **Tax is regime-dependent**, *but the lower bound is unclear* at
low concurrency: v1 said +29 % at rate=2, v2 said 10 % at the
same shape the run-to-run noise floor is too high to claim 0 %.
High-concurrency (+7-9 %) and 8-instance-saturated (+17 %) are
more reproducible.
3. **Trace-replay's +45% includes coupling amplification**: single-instance
accounts for 7-9%; the rest is multi-instance cascade.
3. **Trace-replay's +45 % is plausible but not yet decomposed.** We
have not yet exercised the regime that drives it (APC79 % cache,
agentic session structure). `cache_sweep/` adds (a). (b) and (c)
are open.
4. **Fixable**: Replace O(|cache|) per-step walk with incremental delta
tracking eliminates the 1.4ms/step entirely.
4. **Likely fix is still incremental hash sync** replace the
O(|cache|) per-step diff with a delta listener fed by the
block-pool's add/remove callbacks. Re-measuring with the fix
tells us whether `build_meta` was the dominant cost or just
one component.
5. **SLO impact at production rates**: At rate=16 req/s, tax adds 12ms
to TTFT p90 (156ms vs 144ms) and 2ms to TPOT p90 (29.7 vs 27.8ms).
Well within typical SLO budgets.
5. **Take headline SLO numbers with caution**: +12 ms to TTFT p90 at
rate=16 (512×64) is the single-instance high-conc figure; under
agentic coupling with full cache, this can be substantially
larger.