Connector tax RESULTS.md: errata + run-to-run variance disclosure
The prior write-up presented one specific reading of the data as the headline without flagging methodology gaps. Three corrections: 1. The "0% low-concurrency tax" comes from a single back-to-back mooncake_both_v2/plain_v2 rerun. The original Phase A pair showed TTFT p90 +29%, TPOT p90 +54%, E2E p90 +55% at rate=2 — a 40 percentage-point swing between two consecutive runs that the original write-up did not call out. The run-to-run noise floor is too high to claim "0%" at low concurrency. 2. get_finished() was never instrumented. The patch only times step_duration_us and build_meta_us. "100% of per-step cost is build_meta" is an upper bound on what was timed, not a true decomposition. 3. H5 (cache-size dependence) was the central hypothesis but was never tested in the prior run; random content kept APC near empty. The +7-9% high-concurrency (single instance, 512x64, rate=8-16) and +17% 8-instance-saturated numbers are kept; they were measured with adequate sample sizes and are reproducible. The follow-up sweep in cache_sweep/ tests H5 directly and revises the decomposition. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -1,18 +1,57 @@
|
||||
# Microbench 3: Connector Substrate Tax — Results
|
||||
|
||||
> **2026-05-26 ERRATA (post-review)**: The original write-up overstated
|
||||
> what this microbench had measured. Three things to call out before
|
||||
> reading the rest:
|
||||
> 1. **The "0% low-concurrency tax" number comes from a single
|
||||
> back-to-back rerun (`mooncake_both_v2 / plain_v2`), not from
|
||||
> randomized repeats.** The *same* configuration in the original
|
||||
> Phase A (`mooncake_both / plain`) shows TTFT p90 +29 %, TPOT p90
|
||||
> +54 %, E2E p90 +55 % at rate = 2 req/s — a 40-percentage-point
|
||||
> swing between two consecutive runs is the dominant signal, not
|
||||
> the substrate. See "Run-to-run variance" below.
|
||||
> 2. **`get_finished()` was never instrumented.** The patch in
|
||||
> `patches/apply_step_timing.py` only times `step_duration_us` and
|
||||
> `build_meta_us`; the docstring lists more callbacks but they are
|
||||
> not in the code. The "100 % of per-step cost is build_meta"
|
||||
> statement is therefore an *upper bound on what we measured*, not
|
||||
> a true decomposition. `get_finished()` in `kv_both` mode runs two
|
||||
> cross-thread `run_coroutine_threadsafe(...).result()` blocking
|
||||
> waits every step (`mooncake_connector.py:1107-1137`) and is a
|
||||
> plausible second contributor.
|
||||
> 3. **H5 (cache-size dependence) is untested.** The hypothesis that
|
||||
> `set(self._block_pool.cache.keys())` cost grows with |cache| is
|
||||
> central to attributing the trace-replay 45 % gap, but the run
|
||||
> used random-content prompts with effectively empty APC. The
|
||||
> cache-size sweep in `cache_sweep/` is what actually tests this.
|
||||
>
|
||||
> The headline mechanism (build_connector_meta walks O(|cache|) every
|
||||
> step) is still correct as an *identifiable code path*. The
|
||||
> *quantitative* claims (0% / 7-9% / 17%) are correct for the
|
||||
> *regimes that were measured* (random content, single instance or
|
||||
> 8-instance with `load_only`, fresh APC). Whether they generalize to
|
||||
> the trace-replay setting requires the cache-size sweep.
|
||||
|
||||
## Executive Summary
|
||||
|
||||
The `build_connector_meta()` in MooncakeConnector adds **1.4ms per scheduler
|
||||
step** (measured via engine_step.jsonl instrumentation). However, this overhead
|
||||
only manifests as user-visible latency degradation under **high decode
|
||||
concurrency** (8+ concurrent requests with short forward steps). Under low
|
||||
concurrency, vLLM's scheduler-model async pipeline completely hides the cost.
|
||||
step** (measured via engine_step.jsonl instrumentation) on a *cold* APC.
|
||||
This overhead is only the build-meta portion of the connector callbacks
|
||||
(`get_finished`, `start_load_kv`, etc. were not instrumented). Under the
|
||||
regimes we actually measured, it manifests as user-visible latency
|
||||
degradation only under **high decode concurrency** (8+ concurrent
|
||||
requests with short forward steps). Under low concurrency, the
|
||||
scheduler-model async pipeline appears to hide most of the cost — but
|
||||
the run-to-run variance is large enough that we cannot rule out a
|
||||
real 10-30 % tax there either (see §Run-to-run variance).
|
||||
|
||||
| Regime | Substrate Tax (TTFT p90) | Mechanism |
|
||||
|--------|--------------------------|-----------|
|
||||
| Low concurrency (0.5-2 req/s, 4k input) | **~0%** (undetectable) | Scheduler runs during model forward; 1.4ms << forward step time |
|
||||
| High concurrency (8-16 req/s, 512 input) | **+7-9%** | Multiple short decode steps; scheduler per-step cost becomes visible |
|
||||
| 8-instance trace-replay (elastic_migration_v2) | **+45%** | High concurrency + multi-instance coupling amplification |
|
||||
| Regime | TTFT-p90 tax (mooncake_both vs plain) | Caveat |
|
||||
|--------|---------------------------------------|--------|
|
||||
| Low conc, 4096×256, rate≤2 (v1 run) | +12 % (r=1) / +29 % (r=2) | First-shot data; APC near-empty |
|
||||
| Low conc, 4096×256, rate≤2 (v2 rerun) | −12 % (r=1) / −10 % (r=2) | Back-to-back rerun; sign flips |
|
||||
| High conc, 512×64, rate=8-16 (single instance) | **+7-9 %** | Most reproducible; n≥395 per cell |
|
||||
| 8-inst load_only, 512×64, rate=128 (saturated) | **+17 %** | Throughput dropped to 0.70 |
|
||||
| 8-inst agentic trace-replay (elastic_migration_v2) | **+45 %** | APC ≈ 79 %, session-coupled — *not yet reproduced* |
|
||||
|
||||
---
|
||||
|
||||
@@ -33,6 +72,28 @@ The vLLM v1 framework dispatch itself (noop_connector) adds only +16μs.
|
||||
|
||||
---
|
||||
|
||||
## Run-to-run variance (4096 × 256)
|
||||
|
||||
We have two back-to-back pairs of runs at the same shape, same rates,
|
||||
same seed methodology. They disagree by 40 percentage points:
|
||||
|
||||
| rate | metric | v1 (plain → mooncake_both) | v2 (plain_v2 → mooncake_both_v2) |
|
||||
|---|---|---|---|
|
||||
| 0.5 | TTFT p90 tax | −8 % | −12 % |
|
||||
| 1.0 | TTFT p90 tax | **+12 %** | **−12 %** |
|
||||
| 2.0 | TTFT p90 tax | **+29 %** | **−10 %** |
|
||||
| 2.0 | TPOT p90 tax | **+54 %** | **−23 %** |
|
||||
| 2.0 | E2E p90 tax | **+55 %** | **−23 %** |
|
||||
|
||||
Both v1 and v2 used 200 completed-request floors; v1 ran configs
|
||||
serially with full GPU release between, v2 ran the two configs
|
||||
back-to-back without restart. Neither has CI bars. The 40-pp swing
|
||||
between the two is larger than any of the "0%/+9%/+17%" headline
|
||||
numbers, so the conclusion that "low-concurrency tax is ~0%" needs
|
||||
either many more replicates or a fundamentally different methodology
|
||||
(e.g. controlled |cache|; see `cache_sweep/`). The v2 numbers below
|
||||
are kept for historical reference but should be read with this caveat.
|
||||
|
||||
## Low-Concurrency Results (4096 input, 256 output)
|
||||
|
||||
Back-to-back fresh runs (mooncake_both_v2 first, plain_v2 second):
|
||||
@@ -104,37 +165,58 @@ SLO-compliant.
|
||||
|
||||
---
|
||||
|
||||
## Reconciliation with Trace-Replay (+45%)
|
||||
## Reconciliation with Trace-Replay (+45%) — what we *do* and *don't* know
|
||||
|
||||
The trace-replay claim (elastic_migration_v2 §Result 1) measured
|
||||
TTFT p90 +45% with 8 instances, saturated agentic coupling.
|
||||
TTFT p90 +45% with 8 instances, saturated agentic coupling, APC≈79%.
|
||||
|
||||
Our microbench decomposes the +45%:
|
||||
What this microbench established:
|
||||
|
||||
| Factor | Contribution | Evidence |
|
||||
|--------|-------------|----------|
|
||||
| `build_connector_meta` per-step cost | **+7-9%** | High-concurrency single-instance test |
|
||||
| Large cache amplifies O(\|cache\|) walk | likely 2-3× | Per-step grows with cache size (not yet measured) |
|
||||
| Multi-instance coupling amplification | remaining ~20-30% | 8-instance scheduling feedback cascades |
|
||||
| Factor | Status | Evidence |
|
||||
|--------|--------|----------|
|
||||
| `build_connector_meta` adds ~1.4 ms/step on a *near-empty* APC | measured | `engine_step.jsonl`, mooncake_both vs plain |
|
||||
| Tax surfaces at high decode concurrency (single instance, 512×64) | +7-9 % | rate=8/16 cells, n≥395 per cell |
|
||||
| 8-instance load_only at saturation | +17 % | 8inst_mooncake @ rate=128, thr_p=0.70 |
|
||||
| **`get_finished()` per-step cost (two blocking futures)** | **not measured** | patch only times build_meta |
|
||||
| **`set(cache.keys())` cost scaling with \|cache\|** | **not measured** | random content → APC ≈ empty in all cells |
|
||||
| **Agentic session structure (high reuse + tight cache pressure)** | **not measured** | synthetic open-loop has no sessions |
|
||||
| Multi-instance scheduler coupling beyond load_only | not measured | only `load_only` proxy tested |
|
||||
|
||||
The honest reconciliation is: the +7-9 % single-instance and +17 %
|
||||
8-instance saturated tax are real and small; the gap to +45 % is
|
||||
hypothesised to come from (a) the O(|cache|) walk at APC≈79 %,
|
||||
(b) the un-instrumented `get_finished()` cost, and (c)
|
||||
agentic-coupling effects we have not yet replicated synthetically.
|
||||
The `cache_sweep/` experiment tests (a) directly.
|
||||
|
||||
---
|
||||
|
||||
## Conclusions
|
||||
## Conclusions (revised)
|
||||
|
||||
1. **`build_connector_meta` is the tax source**: 1.4ms/step, 100%
|
||||
from Mooncake's `set(cache.keys())` walk. vLLM framework itself
|
||||
costs only 16μs/step.
|
||||
1. **`build_connector_meta` is *a* tax source**: ≈1.4 ms/step on a
|
||||
near-empty APC. Whether it is *the* source depends on the
|
||||
un-measured `get_finished()` cost. The "100 %, framework costs
|
||||
only +16 μs/step" claim is an upper bound on what was timed, not
|
||||
a true split.
|
||||
|
||||
2. **Tax is concurrency-dependent**: zero at low concurrency (scheduler
|
||||
hidden behind forward), +7-9% at high concurrency (scheduler on
|
||||
critical path).
|
||||
2. **Tax is regime-dependent**, *but the lower bound is unclear* at
|
||||
low concurrency: v1 said +29 % at rate=2, v2 said −10 % at the
|
||||
same shape — the run-to-run noise floor is too high to claim 0 %.
|
||||
High-concurrency (+7-9 %) and 8-instance-saturated (+17 %) are
|
||||
more reproducible.
|
||||
|
||||
3. **Trace-replay's +45% includes coupling amplification**: single-instance
|
||||
accounts for 7-9%; the rest is multi-instance cascade.
|
||||
3. **Trace-replay's +45 % is plausible but not yet decomposed.** We
|
||||
have not yet exercised the regime that drives it (APC≈79 % cache,
|
||||
agentic session structure). `cache_sweep/` adds (a). (b) and (c)
|
||||
are open.
|
||||
|
||||
4. **Fixable**: Replace O(|cache|) per-step walk with incremental delta
|
||||
tracking → eliminates the 1.4ms/step entirely.
|
||||
4. **Likely fix is still incremental hash sync** — replace the
|
||||
O(|cache|) per-step diff with a delta listener fed by the
|
||||
block-pool's add/remove callbacks. Re-measuring with the fix
|
||||
tells us whether `build_meta` was the dominant cost or just
|
||||
one component.
|
||||
|
||||
5. **SLO impact at production rates**: At rate=16 req/s, tax adds 12ms
|
||||
to TTFT p90 (156ms vs 144ms) and 2ms to TPOT p90 (29.7 vs 27.8ms).
|
||||
Well within typical SLO budgets.
|
||||
5. **Take headline SLO numbers with caution**: +12 ms to TTFT p90 at
|
||||
rate=16 (512×64) is the single-instance high-conc figure; under
|
||||
agentic coupling with full cache, this can be substantially
|
||||
larger.
|
||||
|
||||
Reference in New Issue
Block a user