Files
Gahow Wang 297fed6e73 Microbench 3 (connector_tax): infrastructure for KV connector substrate tax
Validates the elastic_migration_v2 finding that kv_role=kv_both adds
TTFT p90 +45% even when PD-sep never fires. Replicates under
single-instance, synthetic, open-loop workload to disambiguate
mechanism cost from 8-instance feedback amplification.

Configurations (8):
  plain, noop_connector, mooncake_{producer,consumer,both},
  nixl_both, lmcache_only, multi_mooncake_lmcache.

Pre-flight verification gates risky configs (kv_consumer needs dummy
bootstrap, multi-connector composition, NoOp custom class loading).

Workload: two-phase sweep
  Phase A: rate {0.5..32} req/s × shape (4096, 256), saturation criteria
  Phase B: ref_safe rate × cartesian (input ∈ {512,4k,32k}, output ∈ {64,256,1024})

Step-timing patch enriches vLLM's existing AGENTIC_STEP_LOG_PATH emit
with step_duration_us and build_meta_us — directly measures per-step
substrate cost, not just user-visible TTFT/TPOT.

run_all.sh runs as 5-stage barrier:
  0 pre-flight + apply patch
  1 Phase A all configs
  2 pick ref_safe / ref_load
  3 Phase B all configs
  4 revert patch + analyze + plot

Outputs aggregate.{json,csv}, MANIFEST.tsv, and 5 figures.
Estimated runtime: 4-5.5 hours on idle dash0 H20.
2026-05-26 17:27:41 +08:00

665 lines
26 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Microbench 3: KV Connector Substrate Tax (revision 2)
## Goal
Validate the headline claim from
`analysis/characterization/elastic_migration_v2/README.md` Result 1:
> Switching the vLLM launch from plain to `kv_role=kv_both` without ever
> triggering PD-sep already costs **TTFT p90 +45%, TPOT p90 +25%,
> hotspot index +19%**.
That claim was measured on 8 instances with a 1214-request real-trace
replay under saturated coupling. We replicate it with **single-instance,
synthetic, open-loop** workload so we can:
1. Disambiguate **vLLM-v1-framework cost** from
**connector-implementation cost** by including a no-op connector.
2. **Validate (or refute) the agentic-coupling amplification** claim:
if single-instance synthetic numbers ≈ 8-instance trace numbers
(3845%), the coupling is not the main cause. If single-instance is
much smaller, then the 8-instance saturated coupling does most of
the damage.
3. Make the result **reproducible and auditable**: every run dumps full
raw artifacts + manifest entry + a re-run script.
---
## Hypotheses (revised based on elastic_migration_v2 prior)
The headline trace-replay numbers from elastic_migration_v2 are our
**prior**, not an open question:
```
trace replay, 8 instances, agentic dispatch coupling, saturated:
plain TTFT p90 = 7.35 s
NIXL TTFT p90 = ~10.1 s (+38%)
Mooncake_both = 10.67 s (+45%)
```
The microbench validates / refutes / refines these:
| ID | Hypothesis | Falsifier |
|---|---|---|
| **H1: Substrate tax persists at single instance / synthetic load** | Single-instance Mooncake_both TTFT p90 is ≥ 10% higher than plain at the reference rate | If <10% trace-replay tax is dominated by 8-instance feedback coupling, not connector machinery |
| **H2: NIXL-vs-Mooncake gap is mechanism-side, not coupling-side** | Single-instance numbers preserve the ~7 pp gap (NIXL tax < Mooncake tax by 510 pp) | If gap shrinks/inverts the gap was a coupling artifact |
| **H3: Framework-vs-implementation split** | `noop_connector` (v1 framework only, all hooks return no-op) tax is < 50% of Mooncake_both tax | Lets us attribute cost between vLLM's connector dispatch loop and the specific connector's per-step work |
| **H4: MultiConnector tax is additive** | tax(Mooncake+LMCache) tax(Mooncake) + tax(LMCache), within 30% | If super-additive cross-connector interference; if sub-additive some shared per-step cost is amortized |
| **H5: Tax is shape-dependent** | tax_TTFT_p90 grows monotonically with input length for Mooncake_both | Confirms E2 audit §6.5 hypothesis (`set(cache.keys())` walks scale with cache size) |
| **H6: Tax compounds in decode** | tax_TPOT_p90 grows with output length | Confirms connector code runs each decode step |
H3 and H4 are the must-have new hypotheses that pulled in the new configs.
---
## Hardware & Model
| Parameter | Value |
|---|---|
| GPU | NVIDIA H20 96 GB × 1 (single instance) |
| Model | Qwen3-Coder-30B-A3B-Instruct |
| TP | 1 |
| `max_model_len` | 200 000 |
| `enable_prefix_caching` | true |
| `enable_chunked_prefill` | true |
| `max_num_batched_tokens` | 8192 |
| `gpu_memory_utilization` | 0.9 |
Single GPU per run. Each configuration is a fresh vLLM launch on GPU 0.
---
## Configurations (8 total, was 6)
| ID | Connector | Role | Why we measure it |
|---|---|---|---|
| `plain` | (none) | | Baseline |
| `noop_connector` | custom `NoOpConnector` (this microbench ships it) | n/a | Isolate **vLLM-v1 framework** cost (build_connector_meta, mixin dispatch, get_finished bookkeeping) without any real connector work see Note 1 |
| `mooncake_producer` | MooncakeConnector | `kv_producer` | Isolate P-side stack |
| `mooncake_consumer` | MooncakeConnector | `kv_consumer` | Isolate D-side stack pre-flight gated, see §Pre-flight |
| `mooncake_both` | MooncakeConnector | `kv_both` | The README claim |
| `nixl_both` | NIXLConnector | `kv_both` | Connector-specific vs framework cost |
| `lmcache_only` | `LMCacheConnectorV1` | n/a | NEW gives H4 a denominator |
| `multi_mooncake_lmcache` | MultiConnector(Mooncake `kv_both` + `LMCacheConnectorV1`) | mixed | Stacked-connector check (gated by pre-flight) |
**Note 1 — noop_connector (we ship it, not the vLLM-bundled one)**:
The vLLM-shipped `ExampleConnector` is NOT a true no-op it
implements a debug-grade disk KV cache: stores match metadata,
serializes safetensors per-layer in `save_kv_layer`, etc. (see
`third_party/vllm/.../example_connector.py:345`,
`example_connector.py:250`,
`kv_transfer_utils.py:49`). Using it would conflate framework cost
with disk-I/O + per-layer save cost.
Instead we ship `microbench/connector_tax/tools/noop_connector.py`
that subclasses `KVConnectorBase_V1` and returns no-op for **every**
hook:
```python
class NoOpConnector(KVConnectorBase_V1):
def get_num_new_matched_tokens(self, req, num_computed): return 0, False
def update_state_after_alloc(self, *_args, **_kw): pass
def build_connector_meta(self, scheduler_output): return KVConnectorMetadata()
def request_finished(self, *_args, **_kw): return False, None
def start_load_kv(self, *_args, **_kw): pass
def wait_for_layer_load(self, *_args, **_kw): pass
def save_kv_layer(self, *_args, **_kw): pass
def wait_for_save(self): pass
def get_finished(self, *_args, **_kw): return None, None
```
vLLM loads it via:
```
--kv-transfer-config '{
"kv_connector_module_path":
"microbench.connector_tax.tools.noop_connector:NoOpConnector",
"kv_role": "kv_both"
}'
```
`PYTHONPATH` is set in `launch_noop_connector.sh` so vLLM can resolve
the dotted import path.
If `noop_connector` overhead 0 all substrate tax is in connector
implementations. If `noop_connector` overhead 30% of Mooncake_both
tax vLLM's framework dispatch alone explains a meaningful slice.
---
## Pre-flight Verification (NEW — gates risky configs)
Two configs depend on infrastructure we can't take for granted. Run
verification scripts BEFORE the main bench. Skip the config (and record
SKIP in manifest) if it fails.
### `verify_kv_consumer.sh`
1. Start a dummy bootstrap process (`tools/dummy_bootstrap.py`).
2. Launch vLLM with `kv_role=kv_consumer` pointing at the dummy.
3. Curl `/v1/models` must return 200 with the model id.
4. Send one short request (`max_tokens=4`) without `kv_transfer_params`
must return 200 in <30 s.
If steps 3 or 4 fail, the config is unrunnable and we drop it. We do
not try harder; the trace-replay paper does not promise consumer-only
single-instance numbers.
### `verify_multi_connector.sh`
1. Launch vLLM with `MultiConnector(MooncakeConnector kv_both,
LMCacheConnectorV1)`.
2. Send 5 sequential requests, `max_tokens=32`, random content.
3. All 5 must complete in <60 s.
4. Verify no engine crashes: `vllm:engine_core_failed_total == 0` from
`/metrics`.
If any check fails, drop the config and mark SKIP (Manifest column:
"why skipped").
### `verify_noop_connector.sh`
1. Launch with noop_connector active (loaded via `kv_connector_module_path`).
2. Send 5 sequential requests, `max_tokens=32`.
3. Verify all 5 return 200 in <30 s and no engine crash.
This one is unlikely to fail but the verification is the same.
The verification scripts produce `verify_<config>.log` under
`results/preflight/` and a `preflight_status.json` summarizing
skip-or-include decisions for the manifest.
---
## Workload (revised)
**Open-loop, fixed-rate, randomized content, two-phase sweep,
data-driven saturation criteria.**
### Phase A — rate sweep (find saturation per config)
| Parameter | Value |
|---|---|
| Input length | 4096 tokens (random per request) |
| Output length | 256 tokens (`max_tokens=256`, `ignore_eos=True`) |
| Send rates | {0.5, 1, 2, 4, 8, 16, 32} req/s (added 0.5 for low-end calibration) |
| Duration per cell | `max(60 s, time_to_min_completed)` + 10 s warmup |
| Min completed per cell | 200 requests |
| Inflight cap | 256 (drop excess to log) |
**Why min_completed = 200**: at p90, the margin-of-error of a Monte
Carlo percentile estimate from N samples is ≈ 1.65 √(0.9 × 0.1 / N).
For N=200 this is ~3.5% absolute, ~10% relative — acceptable. For
N=30 (which 0.5 req/s × 60 s gives) it's ~28% relative, useless for
saturation detection. So at low rates the cell automatically extends:
0.5 req/s → ≥ 400 s, 1 req/s → ≥ 200 s. At 4 req/s and above the
60-second floor dominates.
Updated Phase A duration per config (rounded):
| Rate (req/s) | Duration (s) | Note |
|---|---|---|
| 0.5 | 410 | extended to hit 200 completed |
| 1 | 210 | extended |
| 2 | 110 | extended slightly |
| 4 | 70 | floor |
| 8 | 70 | floor |
| 16 | 70 | floor |
| 32 | 70 | floor |
| **sum per config (excl. warmup)** | **~1010 s** | |
Total Phase A: 8 configs × (90 s vLLM warmup + 1010 s of cells +
60 s GPU release) = 8 × 1160 s ≈ **155 min**.
### Saturation criteria (data-driven, was hardcoded inflight>8)
A config is **saturated at rate r** if **any** of:
1. `effective_throughput(r) / r < 0.95` — vLLM can't keep up
2. `num_requests_waiting p50 (from /metrics) > 1` — vLLM has visible queue
3. `TTFT p90 (r) / TTFT p90 (r/2) > 1.5` — TTFT inflating super-linearly
The **per-config saturation rate** is the lowest r that triggers ≥ 1
criterion. We log which criterion fired so reviewers can disagree.
### Reference rate selection (revised)
We define **two reference rates** for Phase B, both computed from
Phase A data:
```
ref_safe = max rate where ALL 8 configs are NOT saturated
ref_load = max rate where plain is NOT saturated
(some other configs may be saturated here)
```
`ref_safe` measures the **pure substrate per-step tax** under no
queueing.
`ref_load` measures **the tax in the regime closer to deployment** —
where plain is happily under-loaded but Mooncake is starting to hurt.
The gap `tax(ref_load) tax(ref_safe)` is the **non-linear queueing
amplification** of the substrate tax. This is exactly the effect the
reviewer worried about and now we measure it explicitly instead of
ignoring it.
Both rates are reported. The headline number we cite is `ref_safe`
because it's the cleanest decomposition. The `ref_load` numbers tell
us how much worse the tax gets near saturation.
### Phase B — shape sweep (substrate tax across length regimes)
| Parameter | Value |
|---|---|
| Send rate | `ref_safe` (one value, single rate to keep cost bounded) |
| Input lengths | {512, 4096, 32768} tokens |
| Output lengths | {64, 256, 1024} tokens (32 promoted to 64 — see Note 2) |
| Duration per cell | `max(60 s, time_to_min_completed)` + 10 s warmup |
| Min completed per cell | 200 requests |
| Cartesian shapes | 3 × 3 = 9 |
The same min-completed extension applies. If `ref_safe ≥ 4 req/s`,
each cell hits the 60 s floor and per-config Phase B cell time is
9 × 70 s = 630 s. If `ref_safe = 2 req/s`, cells extend to 110 s and
per-config cell time is 9 × 110 s = 990 s.
Total Phase B (worst case, `ref_safe = 2`): 8 configs × (90 s warmup
+ 990 s of cells + 60 s GPU release) ≈ **152 min**. Best case
(`ref_safe ≥ 4`): 8 × (90 + 630 + 60) ≈ **104 min**.
If after Phase A we find `ref_load` differs meaningfully from
`ref_safe`, we add a small Phase B' run on `ref_load` for the 4
high-priority configs (plain, mooncake_both, nixl_both, lmcache_only)
on 3 representative shapes (512/256, 4096/256, 32768/256). That is
4 configs × 3 shapes × 70 s ≈ 14 min, controlled trade-off.
**Note 2 — output 64 instead of 32**: with 32 output tokens TPOT is
estimated from 31 inter-token intervals — too few samples for stable
p90. Bumping to 64 gives 63 samples, comfortable for percentile
estimation. The output=32 regime is also less common in agentic
deployments where a tool result frame is rarely <64 tokens.
### Common settings
| Parameter | Value |
|---|---|
| `temperature` | 0 (deterministic) |
| `ignore_eos` | True (force exact output length) |
| Content | random UUID + hash per request, zero prefix cache hit |
| Concurrent inflight cap | 256 |
---
## Metrics (revised — adds A3 step-level engine_state)
### Client-side (per-request, JSONL)
Same as before: `t_send_ns`, `t_first_token_ns`, `t_last_token_ns`,
`prompt_tokens`, `completion_tokens`, `inflight_at_send`.
### Server-side `/metrics` sampling (1 Hz)
Captured into `metrics_<cfg>_<phase>_<cell>.jsonl`. Same fields as
prior version.
### Step-level timing instrumentation (NEW — we ship the patch)
The reviewer correctly noted that the existing A3 step log
(`third_party/vllm/.../scheduler.py:953`) only records per-step token
counts and request lists, **not** step duration or per-callback
timing. So we cannot just turn on AGENTIC_STEP_LOG_PATH and get
Figure 6/7's "direct evidence" — that data does not exist yet.
This microbench ships its own scheduler timing patch at
`microbench/connector_tax/patches/scheduler_step_timing.py`, modelled
on the idempotent `microbench/patches/apply_patches.py` we wrote for
Microbench 2. It uses the same `_pd_profile.py` emit pattern.
The patch instruments:
1. `Scheduler.schedule()` entry → `t_step_enter` (perf_counter_ns)
2. `Scheduler.schedule()` exit → `t_step_exit`
3. Around `connector.build_connector_meta(scheduler_output)`
→ `build_meta_us`
4. Around `connector.get_finished(...)` call
(in `_update_from_output` / mixin)
→ `get_finished_us`
5. Around `connector.start_load_kv(...)` (in the worker mixin
`_get_kv_connector_output`)
→ `start_load_kv_us` (worker-side; emitted from worker process)
Each step emits one JSONL record:
```json
{
"t_ns": <step_enter perf_counter_ns>,
"step_id": <monotonic int>,
"step_duration_us": <step_exit - step_enter>,
"build_meta_us": <build_connector_meta duration>,
"get_finished_us": <connector get_finished duration>,
"start_load_kv_us": <worker start_load_kv; null on scheduler-only proc>,
"num_running": <int>,
"num_waiting": <int>,
"prefill_tokens": <int>,
"decode_tokens": <int>
}
```
Output goes to `AGENTIC_STEP_LOG_PATH` (one file per process; we use
`engine_step_<phase>_<cell>.jsonl` paths from launch scripts).
Apply / revert is idempotent — same `# CONNECTOR_TAX_PATCH` marker
strategy as Microbench 2.
```
microbench/connector_tax/patches/
├── _step_profile.py # the emitter (ported from _pd_profile)
├── scheduler_step_timing.py # patch installer / reverter
└── apply.sh # invoked by run_all.sh; revert at end
```
**Fallback if the patch fails to apply on a future vLLM version**:
the bench drops to client-side TTFT/TPOT only. Figures 6 (per-step
CDF) and 7 (decomposition stack) are not produced; the manifest
records `step_timing_available=false`. The other figures and the
H1 / H2 / H4 headline numbers do not depend on this patch, so the
bench is still useful in fallback mode.
### Derived (post-processing)
For each (config, rate-or-shape) cell after warmup:
- TTFT/TPOT/E2E p50/p90/p99
- `effective_throughput`, `requested_throughput`, throughput_ratio
- `saturation_flag` (which criterion, if any, triggered)
- (when `step_timing_available=true`):
- `step_duration_us` p50/p90
- `build_meta_us` p50/p90
- `get_finished_us` p50/p90
- `start_load_kv_us` p50/p90 (worker-process file)
- `connector_total_us` p50/p90 (sum of the 3 callback timings)
### Substrate tax definition
```
tax_TTFT_p90(X, ref) = TTFT_p90(X, ref) / TTFT_p90(plain, ref) - 1
tax_TPOT_p90(X, ref) = TPOT_p90(X, ref) / TPOT_p90(plain, ref) - 1
tax_step_p50(X) = step_duration_us p50 (X) - step_duration_us p50 (plain)
tax_callback_p50(X) = connector_total_us p50 (X) # plain has no callbacks
```
`tax_step` is the **gross** per-step penalty (any cause).
`tax_callback` is the **callback-attributable** penalty (sum of the
three measured connector hooks). The difference `tax_step
tax_callback` is "step-time overhead not attributable to instrumented
callbacks" — block-pool walks, scheduler-state churn, etc. Reporting
both lets reviewers see whether our instrumentation accounts for the
full cost.
---
## Auditability & Reproducibility Plan
### Run artifacts (per config × phase × cell)
```
microbench/connector_tax/results/
<date>_<config>/
config.json # parameters used
launch.sh # exact vLLM launch command
vllm_stdout.log # full vLLM stdout
vllm_stderr.log # full vLLM stderr
requests_<phase>_<cell>.jsonl
metrics_<phase>_<cell>.jsonl
engine_step_<phase>_<cell>.jsonl # if A3 active
summary.json # per-cell percentiles
env.txt # pip freeze, vLLM SHA, GPU info
preflight/
verify_kv_consumer.log
verify_multi_connector.log
verify_noop_connector.log
preflight_status.json # which configs are SKIP'd and why
```
### Manifest
`microbench/connector_tax/MANIFEST.md` lists every run with date,
vLLM version + git SHA, Mooncake version, NIXL version, LMCache
version, GPU id (`nvidia-smi -L`), config name, launch command, result
directory, A3-active flag, and skip-status (with reason).
### Re-run script
`microbench/connector_tax/run_all.sh` runs in **three barrier stages**.
Phase A across all configs must finish before Phase B can pick a
reference rate.
**Stage 0 — Pre-flight + patch:**
1. Run `verify_kv_consumer.sh`, `verify_multi_connector.sh`, and
`verify_noop_connector.sh`. Persist `preflight_status.json`.
2. Apply `microbench/connector_tax/patches/scheduler_step_timing.py`
to the active vLLM. Record `step_timing_available=true|false`
in the manifest based on whether the patch applied cleanly.
**Stage 1 — Phase A (all configs, randomized order):**
For each non-SKIP config:
1. `launch_<config>.sh` → wait for `/v1/models`.
2. `bench_loop.py --rates 0.5,1,2,4,8,16,32 --shape 4096,256
--duration 60 --min-completed 200`.
3. Kill vLLM, wait 60 s for GPU release.
4. Append manifest row.
After **all** configs have finished Stage 1:
**Stage 2 — Reference rate selection (CPU only):**
1. Compute saturation flags from each cell using the data-driven
criteria.
2. Choose `ref_safe` = max rate where ALL configs that completed
Phase A are not saturated.
3. Choose `ref_load` = max rate where `plain` is not saturated.
4. Persist `reference_rates.json`.
**Stage 3 — Phase B (all configs, randomized order):**
For each non-SKIP config:
1. `launch_<config>.sh` → wait for ready.
2. `bench_loop.py --rate <ref_safe> --shapes 512x64,512x256,
...,32768x1024 --duration 60 --min-completed 200`.
3. (If `ref_load != ref_safe`) Run Phase B' for priority configs
(plain, mooncake_both, nixl_both, lmcache_only) on shapes
{512x256, 4096x256, 32768x256} at `ref_load`.
4. Kill vLLM, wait 60 s, append manifest row.
**Stage 4 — Patch revert + analysis:**
1. Revert the scheduler_step_timing patch.
2. `analyze.py --root results/`.
3. `plot_connector_tax.py`.
A reviewer with a fresh checkout runs:
```
cd microbench/connector_tax
bash run_all.sh
```
and gets the figures + manifest + raw artifacts. The script is
re-runnable: any stage can be skipped via `--skip-stage N` if the
artifacts exist.
### Determinism notes
Same as previous: temperature=0 + ignore_eos give shape determinism;
content varies per request via seeded UUID. We do not promise
bit-exact reproducibility, only distribution-level reproducibility.
### Updated runtime estimate (was 1.52 h, **now 45.5 h**)
| Phase | Time |
|---|---|
| Pre-flight (3 verify scripts) | 15 min |
| Phase A: 8 configs × (90 s warmup + 1010 s cells + 60 s GPU clear) | 155 min |
| Phase A → ref_safe selection (CPU) | <1 min |
| Phase B (best, `ref_safe ≥ 4`): 8 × (90 + 630 + 60) | 104 min |
| Phase B (worst, `ref_safe = 2`): 8 × (90 + 990 + 60) | 152 min |
| Optional Phase B' (4 configs × 3 shapes × ≥70 s + 4 × 90 s warmup) | 20 min |
| Analysis + figures | 5 min |
| **Total (best case)** | **~5 h** |
| **Total (worst case)** | **~5.5 h** |
This is honest. The reviewer's earlier estimate of 2.53 h
underestimated how long low-rate cells must run to give stable p90.
---
## Analysis & Figures
### Figure 1: TTFT p90 vs send rate, per configuration (Phase A)
Same as before, now 8 lines plus saturation markers (× per criterion).
### Figure 2: TPOT p90 vs send rate, per configuration (Phase A)
Same.
### Figure 3: Achieved throughput vs requested (Phase A)
Same, 8 lines + y=x reference + saturation knee annotations.
### Figure 4: Substrate tax bar (TTFT p90 + TPOT p90)
At `ref_safe` and `ref_load`, side-by-side bars per non-plain config.
Shows:
- Pure tax (`ref_safe`)
- Tax + non-linear queueing (`ref_load`)
- The gap is the **coupling amplification** the reviewer flagged.
### Figure 5: Shape-dependent tax heatmap (Phase B)
3×3 heatmap (input × output) of tax_TTFT_p90 for each non-plain
config. 6 heatmaps in a row, including noop_connector,
mooncake_both, nixl_both, lmcache_only, mooncake_producer,
multi_mooncake_lmcache. (Skip mooncake_consumer if pre-flight
dropped it.)
### Figure 6: Per-step latency CDF, ref_safe rate (Phase A)
X = step duration (μs), Y = CDF, line per config. **The most direct
visualization of "what each step costs."** Shipped only if A3 step log
is available.
### Figure 7: Tax decomposition stack
For each non-plain config at ref_safe, stacked bar:
- "framework cost" estimated = tax(noop_connector)
- "implementation cost" estimated = tax(this config) tax(noop_connector)
If `noop_connector` doesn't run (we'd document why), we drop this
figure and report tax as a single number per config.
### Figure 8: H4 additivity check
3-bar group: tax(mooncake_both), tax(lmcache_only), tax(multi). The
sum-of-first-two compared against multi visualizes additivity.
---
## Risks & Mitigations (revised)
| Risk | Impact | Mitigation |
|---|---|---|
| `kv_consumer` won't start with dummy bootstrap | Skip the config | Pre-flight; documented SKIP in manifest |
| `multi_mooncake_lmcache` crashes engine | Skip the config | Pre-flight |
| NIXL not installed | Skip nixl_both | Tolerant; warn + continue |
| LMCache not installed | Skip lmcache_only AND multi config | Tolerant; warn + continue |
| GPU thermal drift across 3+ h | Skews late configs | Run order randomized; consider running twice on different days and reporting both |
| Open-loop blow-up at 32 req/s | Memory blowup | Inflight cap 256, drop with logged counter |
| Cold-start of first request | Inflates mean TTFT | 10 s warmup discarded |
| `scheduler_step_timing` patch fails to apply on a future vLLM version | Lose Figures 6 and 7 | Document `step_timing_available=false` in manifest; H1/H2/H4 still report from client-side TTFT/TPOT |
| `noop_connector` import fails (PYTHONPATH or class signature) | Lose Figures 7 + H3 falsifier | Pre-flight `verify_noop_connector.sh` catches this; report SKIP in manifest |
---
## Success Criteria (revised)
1. **H1 falsifiable**: tax_TTFT_p90 for `mooncake_both` at `ref_safe`
is reported. We accept the prior (≈45%) if measurement is in
[25%, 60%]; we **revise the prior** if outside.
2. **H2 testable**: NIXL-vs-Mooncake gap at `ref_safe` is reported.
The trace-replay difference was ~7 pp. We document agreement /
disagreement.
3. **H3 disambiguated**: tax(noop_connector) at `ref_safe` is
reported. We label substrate tax as
"framework-cost-dominated" if noop_connector ≥ 50% of
mooncake_both tax, "implementation-cost-dominated" if < 30%.
4. **H4 additivity**: |tax(multi) (tax(mooncake_both) +
tax(lmcache_only))| / tax(multi) ≤ 0.30 → linear.
5. **H5 + H6 directional**: report whether tax_TTFT_p90 grows with
input and tax_TPOT_p90 grows with output (sign + magnitude).
6. **All artifacts present**: every config that ran has the 6 file
types; every SKIP config has a reason in `preflight_status.json`.
7. **Bench finishes < 6 h** wall clock on idle dash0
(Phase A + Phase B + optional Phase B' combined; reflects min-completed
extension at low rates).
---
## Out of Scope
- Multi-node Mooncake (RDMA over actual network).
- Patching Mooncake or vLLM to optimize the substrate (the point of
this microbench is to measure baseline cost as shipped).
- Varying `chunk_size`, `max_num_seqs`, or other vLLM scheduler
parameters; fixed at trace-replay defaults.
- chunk-boundary effects (input ∈ {8192, 16384}). The reviewer noted
this is a real follow-up but adding it doubles Phase B runtime.
Documented as a follow-up if Phase B shows shape-dependent tax that
can't be explained by total token count.
---
## Cross-references
- `analysis/characterization/elastic_migration_v2/README.md` — the
trace-replay paper this microbench validates / refutes.
- `microbench/interference/` — Microbench 1 (B2 same-worker
interference; complementary).
- `microbench/lifecycle/` — Microbench 2 (PD-sep transfer breakdown;
uses different vLLM patches).
- `microbench/patches/` — `_pd_profile.py` template if A3 fallback
is needed.
---
## Files
```
microbench/connector_tax/
├── DESIGN.md # this file
├── MANIFEST.md # filled per run
├── tools/
│ ├── noop_connector.py # custom NoOpConnector for H3
│ ├── dummy_bootstrap.py # for kv_consumer pre-flight
│ ├── verify_kv_consumer.sh
│ ├── verify_multi_connector.sh
│ └── verify_noop_connector.sh
├── patches/
│ ├── _step_profile.py # event emitter (ports _pd_profile)
│ ├── scheduler_step_timing.py # idempotent install/revert
│ └── apply.sh # invoked by run_all.sh
├── launch/
│ ├── launch_plain.sh
│ ├── launch_noop_connector.sh
│ ├── launch_mooncake_producer.sh
│ ├── launch_mooncake_consumer.sh
│ ├── launch_mooncake_both.sh
│ ├── launch_nixl_both.sh
│ ├── launch_lmcache_only.sh
│ └── launch_multi_mooncake_lmcache.sh
├── bench_loop.py # open-loop loadgen (--min-completed)
├── metrics_sampler.py # /metrics scraper
├── analyze.py # raw → percentiles + saturation flags
├── plot_connector_tax.py # all figures
├── run_all.sh # 4-stage barrier orchestrator
└── results/<date>_<config>/ # per-run artifacts
└── results/preflight/ # pre-flight verification
```