Validates the elastic_migration_v2 finding that kv_role=kv_both adds
TTFT p90 +45% even when PD-sep never fires. Replicates under
single-instance, synthetic, open-loop workload to disambiguate
mechanism cost from 8-instance feedback amplification.
Configurations (8):
plain, noop_connector, mooncake_{producer,consumer,both},
nixl_both, lmcache_only, multi_mooncake_lmcache.
Pre-flight verification gates risky configs (kv_consumer needs dummy
bootstrap, multi-connector composition, NoOp custom class loading).
Workload: two-phase sweep
Phase A: rate {0.5..32} req/s × shape (4096, 256), saturation criteria
Phase B: ref_safe rate × cartesian (input ∈ {512,4k,32k}, output ∈ {64,256,1024})
Step-timing patch enriches vLLM's existing AGENTIC_STEP_LOG_PATH emit
with step_duration_us and build_meta_us — directly measures per-step
substrate cost, not just user-visible TTFT/TPOT.
run_all.sh runs as 5-stage barrier:
0 pre-flight + apply patch
1 Phase A all configs
2 pick ref_safe / ref_load
3 Phase B all configs
4 revert patch + analyze + plot
Outputs aggregate.{json,csv}, MANIFEST.tsv, and 5 figures.
Estimated runtime: 4-5.5 hours on idle dash0 H20.
26 KiB
Microbench 3: KV Connector Substrate Tax (revision 2)
Goal
Validate the headline claim from
analysis/characterization/elastic_migration_v2/README.md Result 1:
Switching the vLLM launch from plain to
kv_role=kv_bothwithout ever triggering PD-sep already costs TTFT p90 +45%, TPOT p90 +25%, hotspot index +19%.
That claim was measured on 8 instances with a 1214-request real-trace replay under saturated coupling. We replicate it with single-instance, synthetic, open-loop workload so we can:
- Disambiguate vLLM-v1-framework cost from connector-implementation cost by including a no-op connector.
- Validate (or refute) the agentic-coupling amplification claim: if single-instance synthetic numbers ≈ 8-instance trace numbers (38–45%), the coupling is not the main cause. If single-instance is much smaller, then the 8-instance saturated coupling does most of the damage.
- Make the result reproducible and auditable: every run dumps full raw artifacts + manifest entry + a re-run script.
Hypotheses (revised based on elastic_migration_v2 prior)
The headline trace-replay numbers from elastic_migration_v2 are our prior, not an open question:
trace replay, 8 instances, agentic dispatch coupling, saturated:
plain TTFT p90 = 7.35 s
NIXL TTFT p90 = ~10.1 s (+38%)
Mooncake_both = 10.67 s (+45%)
The microbench validates / refutes / refines these:
| ID | Hypothesis | Falsifier |
|---|---|---|
| H1: Substrate tax persists at single instance / synthetic load | Single-instance Mooncake_both TTFT p90 is ≥ 10% higher than plain at the reference rate | If <10% → trace-replay tax is dominated by 8-instance feedback coupling, not connector machinery |
| H2: NIXL-vs-Mooncake gap is mechanism-side, not coupling-side | Single-instance numbers preserve the ~7 pp gap (NIXL tax < Mooncake tax by 5–10 pp) | If gap shrinks/inverts → the gap was a coupling artifact |
| H3: Framework-vs-implementation split | noop_connector (v1 framework only, all hooks return no-op) tax is < 50% of Mooncake_both tax |
Lets us attribute cost between vLLM's connector dispatch loop and the specific connector's per-step work |
| H4: MultiConnector tax is additive | tax(Mooncake+LMCache) ≈ tax(Mooncake) + tax(LMCache), within 30% | If super-additive → cross-connector interference; if sub-additive → some shared per-step cost is amortized |
| H5: Tax is shape-dependent | tax_TTFT_p90 grows monotonically with input length for Mooncake_both | Confirms E2 audit §6.5 hypothesis (set(cache.keys()) walks scale with cache size) |
| H6: Tax compounds in decode | tax_TPOT_p90 grows with output length | Confirms connector code runs each decode step |
H3 and H4 are the must-have new hypotheses that pulled in the new configs.
Hardware & Model
| Parameter | Value |
|---|---|
| GPU | NVIDIA H20 96 GB × 1 (single instance) |
| Model | Qwen3-Coder-30B-A3B-Instruct |
| TP | 1 |
max_model_len |
200 000 |
enable_prefix_caching |
true |
enable_chunked_prefill |
true |
max_num_batched_tokens |
8192 |
gpu_memory_utilization |
0.9 |
Single GPU per run. Each configuration is a fresh vLLM launch on GPU 0.
Configurations (8 total, was 6)
| ID | Connector | Role | Why we measure it |
|---|---|---|---|
plain |
(none) | — | Baseline |
noop_connector |
custom NoOpConnector (this microbench ships it) |
n/a | Isolate vLLM-v1 framework cost (build_connector_meta, mixin dispatch, get_finished bookkeeping) without any real connector work — see Note 1 |
mooncake_producer |
MooncakeConnector | kv_producer |
Isolate P-side stack |
mooncake_consumer |
MooncakeConnector | kv_consumer |
Isolate D-side stack — pre-flight gated, see §Pre-flight |
mooncake_both |
MooncakeConnector | kv_both |
The README claim |
nixl_both |
NIXLConnector | kv_both |
Connector-specific vs framework cost |
lmcache_only |
LMCacheConnectorV1 |
n/a | NEW — gives H4 a denominator |
multi_mooncake_lmcache |
MultiConnector(Mooncake kv_both + LMCacheConnectorV1) |
mixed | Stacked-connector check (gated by pre-flight) |
Note 1 — noop_connector (we ship it, not the vLLM-bundled one):
The vLLM-shipped ExampleConnector is NOT a true no-op — it
implements a debug-grade disk KV cache: stores match metadata,
serializes safetensors per-layer in save_kv_layer, etc. (see
third_party/vllm/.../example_connector.py:345,
example_connector.py:250,
kv_transfer_utils.py:49). Using it would conflate framework cost
with disk-I/O + per-layer save cost.
Instead we ship microbench/connector_tax/tools/noop_connector.py
that subclasses KVConnectorBase_V1 and returns no-op for every
hook:
class NoOpConnector(KVConnectorBase_V1):
def get_num_new_matched_tokens(self, req, num_computed): return 0, False
def update_state_after_alloc(self, *_args, **_kw): pass
def build_connector_meta(self, scheduler_output): return KVConnectorMetadata()
def request_finished(self, *_args, **_kw): return False, None
def start_load_kv(self, *_args, **_kw): pass
def wait_for_layer_load(self, *_args, **_kw): pass
def save_kv_layer(self, *_args, **_kw): pass
def wait_for_save(self): pass
def get_finished(self, *_args, **_kw): return None, None
vLLM loads it via:
--kv-transfer-config '{
"kv_connector_module_path":
"microbench.connector_tax.tools.noop_connector:NoOpConnector",
"kv_role": "kv_both"
}'
PYTHONPATH is set in launch_noop_connector.sh so vLLM can resolve
the dotted import path.
If noop_connector overhead ≈ 0 → all substrate tax is in connector
implementations. If noop_connector overhead ≈ 30% of Mooncake_both
tax → vLLM's framework dispatch alone explains a meaningful slice.
Pre-flight Verification (NEW — gates risky configs)
Two configs depend on infrastructure we can't take for granted. Run verification scripts BEFORE the main bench. Skip the config (and record SKIP in manifest) if it fails.
verify_kv_consumer.sh
- Start a dummy bootstrap process (
tools/dummy_bootstrap.py). - Launch vLLM with
kv_role=kv_consumerpointing at the dummy. - Curl
/v1/models— must return 200 with the model id. - Send one short request (
max_tokens=4) withoutkv_transfer_params— must return 200 in <30 s.
If steps 3 or 4 fail, the config is unrunnable and we drop it. We do not try harder; the trace-replay paper does not promise consumer-only single-instance numbers.
verify_multi_connector.sh
- Launch vLLM with
MultiConnector(MooncakeConnector kv_both, LMCacheConnectorV1). - Send 5 sequential requests,
max_tokens=32, random content. - All 5 must complete in <60 s.
- Verify no engine crashes:
vllm:engine_core_failed_total == 0from/metrics.
If any check fails, drop the config and mark SKIP (Manifest column: "why skipped").
verify_noop_connector.sh
- Launch with noop_connector active (loaded via
kv_connector_module_path). - Send 5 sequential requests,
max_tokens=32. - Verify all 5 return 200 in <30 s and no engine crash.
This one is unlikely to fail but the verification is the same.
The verification scripts produce verify_<config>.log under
results/preflight/ and a preflight_status.json summarizing
skip-or-include decisions for the manifest.
Workload (revised)
Open-loop, fixed-rate, randomized content, two-phase sweep, data-driven saturation criteria.
Phase A — rate sweep (find saturation per config)
| Parameter | Value |
|---|---|
| Input length | 4096 tokens (random per request) |
| Output length | 256 tokens (max_tokens=256, ignore_eos=True) |
| Send rates | {0.5, 1, 2, 4, 8, 16, 32} req/s (added 0.5 for low-end calibration) |
| Duration per cell | max(60 s, time_to_min_completed) + 10 s warmup |
| Min completed per cell | 200 requests |
| Inflight cap | 256 (drop excess to log) |
Why min_completed = 200: at p90, the margin-of-error of a Monte Carlo percentile estimate from N samples is ≈ 1.65 √(0.9 × 0.1 / N). For N=200 this is ~3.5% absolute, ~10% relative — acceptable. For N=30 (which 0.5 req/s × 60 s gives) it's ~28% relative, useless for saturation detection. So at low rates the cell automatically extends: 0.5 req/s → ≥ 400 s, 1 req/s → ≥ 200 s. At 4 req/s and above the 60-second floor dominates.
Updated Phase A duration per config (rounded):
| Rate (req/s) | Duration (s) | Note |
|---|---|---|
| 0.5 | 410 | extended to hit 200 completed |
| 1 | 210 | extended |
| 2 | 110 | extended slightly |
| 4 | 70 | floor |
| 8 | 70 | floor |
| 16 | 70 | floor |
| 32 | 70 | floor |
| sum per config (excl. warmup) | ~1010 s |
Total Phase A: 8 configs × (90 s vLLM warmup + 1010 s of cells + 60 s GPU release) = 8 × 1160 s ≈ 155 min.
Saturation criteria (data-driven, was hardcoded inflight>8)
A config is saturated at rate r if any of:
effective_throughput(r) / r < 0.95— vLLM can't keep upnum_requests_waiting p50 (from /metrics) > 1— vLLM has visible queueTTFT p90 (r) / TTFT p90 (r/2) > 1.5— TTFT inflating super-linearly
The per-config saturation rate is the lowest r that triggers ≥ 1 criterion. We log which criterion fired so reviewers can disagree.
Reference rate selection (revised)
We define two reference rates for Phase B, both computed from Phase A data:
ref_safe = max rate where ALL 8 configs are NOT saturated
ref_load = max rate where plain is NOT saturated
(some other configs may be saturated here)
ref_safe measures the pure substrate per-step tax under no
queueing.
ref_load measures the tax in the regime closer to deployment —
where plain is happily under-loaded but Mooncake is starting to hurt.
The gap tax(ref_load) − tax(ref_safe) is the non-linear queueing
amplification of the substrate tax. This is exactly the effect the
reviewer worried about and now we measure it explicitly instead of
ignoring it.
Both rates are reported. The headline number we cite is ref_safe
because it's the cleanest decomposition. The ref_load numbers tell
us how much worse the tax gets near saturation.
Phase B — shape sweep (substrate tax across length regimes)
| Parameter | Value |
|---|---|
| Send rate | ref_safe (one value, single rate to keep cost bounded) |
| Input lengths | {512, 4096, 32768} tokens |
| Output lengths | {64, 256, 1024} tokens (32 promoted to 64 — see Note 2) |
| Duration per cell | max(60 s, time_to_min_completed) + 10 s warmup |
| Min completed per cell | 200 requests |
| Cartesian shapes | 3 × 3 = 9 |
The same min-completed extension applies. If ref_safe ≥ 4 req/s,
each cell hits the 60 s floor and per-config Phase B cell time is
9 × 70 s = 630 s. If ref_safe = 2 req/s, cells extend to 110 s and
per-config cell time is 9 × 110 s = 990 s.
Total Phase B (worst case, ref_safe = 2): 8 configs × (90 s warmup
- 990 s of cells + 60 s GPU release) ≈ 152 min. Best case
(
ref_safe ≥ 4): 8 × (90 + 630 + 60) ≈ 104 min.
If after Phase A we find ref_load differs meaningfully from
ref_safe, we add a small Phase B' run on ref_load for the 4
high-priority configs (plain, mooncake_both, nixl_both, lmcache_only)
on 3 representative shapes (512/256, 4096/256, 32768/256). That is
4 configs × 3 shapes × 70 s ≈ 14 min, controlled trade-off.
Note 2 — output 64 instead of 32: with 32 output tokens TPOT is estimated from 31 inter-token intervals — too few samples for stable p90. Bumping to 64 gives 63 samples, comfortable for percentile estimation. The output=32 regime is also less common in agentic deployments where a tool result frame is rarely <64 tokens.
Common settings
| Parameter | Value |
|---|---|
temperature |
0 (deterministic) |
ignore_eos |
True (force exact output length) |
| Content | random UUID + hash per request, zero prefix cache hit |
| Concurrent inflight cap | 256 |
Metrics (revised — adds A3 step-level engine_state)
Client-side (per-request, JSONL)
Same as before: t_send_ns, t_first_token_ns, t_last_token_ns,
prompt_tokens, completion_tokens, inflight_at_send.
Server-side /metrics sampling (1 Hz)
Captured into metrics_<cfg>_<phase>_<cell>.jsonl. Same fields as
prior version.
Step-level timing instrumentation (NEW — we ship the patch)
The reviewer correctly noted that the existing A3 step log
(third_party/vllm/.../scheduler.py:953) only records per-step token
counts and request lists, not step duration or per-callback
timing. So we cannot just turn on AGENTIC_STEP_LOG_PATH and get
Figure 6/7's "direct evidence" — that data does not exist yet.
This microbench ships its own scheduler timing patch at
microbench/connector_tax/patches/scheduler_step_timing.py, modelled
on the idempotent microbench/patches/apply_patches.py we wrote for
Microbench 2. It uses the same _pd_profile.py emit pattern.
The patch instruments:
Scheduler.schedule()entry →t_step_enter(perf_counter_ns)Scheduler.schedule()exit →t_step_exit- Around
connector.build_connector_meta(scheduler_output)→build_meta_us - Around
connector.get_finished(...)call (in_update_from_output/ mixin) →get_finished_us - Around
connector.start_load_kv(...)(in the worker mixin_get_kv_connector_output) →start_load_kv_us(worker-side; emitted from worker process)
Each step emits one JSONL record:
{
"t_ns": <step_enter perf_counter_ns>,
"step_id": <monotonic int>,
"step_duration_us": <step_exit - step_enter>,
"build_meta_us": <build_connector_meta duration>,
"get_finished_us": <connector get_finished duration>,
"start_load_kv_us": <worker start_load_kv; null on scheduler-only proc>,
"num_running": <int>,
"num_waiting": <int>,
"prefill_tokens": <int>,
"decode_tokens": <int>
}
Output goes to AGENTIC_STEP_LOG_PATH (one file per process; we use
engine_step_<phase>_<cell>.jsonl paths from launch scripts).
Apply / revert is idempotent — same # CONNECTOR_TAX_PATCH marker
strategy as Microbench 2.
microbench/connector_tax/patches/
├── _step_profile.py # the emitter (ported from _pd_profile)
├── scheduler_step_timing.py # patch installer / reverter
└── apply.sh # invoked by run_all.sh; revert at end
Fallback if the patch fails to apply on a future vLLM version:
the bench drops to client-side TTFT/TPOT only. Figures 6 (per-step
CDF) and 7 (decomposition stack) are not produced; the manifest
records step_timing_available=false. The other figures and the
H1 / H2 / H4 headline numbers do not depend on this patch, so the
bench is still useful in fallback mode.
Derived (post-processing)
For each (config, rate-or-shape) cell after warmup:
- TTFT/TPOT/E2E p50/p90/p99
effective_throughput,requested_throughput, throughput_ratiosaturation_flag(which criterion, if any, triggered)- (when
step_timing_available=true):step_duration_usp50/p90build_meta_usp50/p90get_finished_usp50/p90start_load_kv_usp50/p90 (worker-process file)connector_total_usp50/p90 (sum of the 3 callback timings)
Substrate tax definition
tax_TTFT_p90(X, ref) = TTFT_p90(X, ref) / TTFT_p90(plain, ref) - 1
tax_TPOT_p90(X, ref) = TPOT_p90(X, ref) / TPOT_p90(plain, ref) - 1
tax_step_p50(X) = step_duration_us p50 (X) - step_duration_us p50 (plain)
tax_callback_p50(X) = connector_total_us p50 (X) # plain has no callbacks
tax_step is the gross per-step penalty (any cause).
tax_callback is the callback-attributable penalty (sum of the
three measured connector hooks). The difference tax_step − tax_callback is "step-time overhead not attributable to instrumented
callbacks" — block-pool walks, scheduler-state churn, etc. Reporting
both lets reviewers see whether our instrumentation accounts for the
full cost.
Auditability & Reproducibility Plan
Run artifacts (per config × phase × cell)
microbench/connector_tax/results/
<date>_<config>/
config.json # parameters used
launch.sh # exact vLLM launch command
vllm_stdout.log # full vLLM stdout
vllm_stderr.log # full vLLM stderr
requests_<phase>_<cell>.jsonl
metrics_<phase>_<cell>.jsonl
engine_step_<phase>_<cell>.jsonl # if A3 active
summary.json # per-cell percentiles
env.txt # pip freeze, vLLM SHA, GPU info
preflight/
verify_kv_consumer.log
verify_multi_connector.log
verify_noop_connector.log
preflight_status.json # which configs are SKIP'd and why
Manifest
microbench/connector_tax/MANIFEST.md lists every run with date,
vLLM version + git SHA, Mooncake version, NIXL version, LMCache
version, GPU id (nvidia-smi -L), config name, launch command, result
directory, A3-active flag, and skip-status (with reason).
Re-run script
microbench/connector_tax/run_all.sh runs in three barrier stages.
Phase A across all configs must finish before Phase B can pick a
reference rate.
Stage 0 — Pre-flight + patch:
- Run
verify_kv_consumer.sh,verify_multi_connector.sh, andverify_noop_connector.sh. Persistpreflight_status.json. - Apply
microbench/connector_tax/patches/scheduler_step_timing.pyto the active vLLM. Recordstep_timing_available=true|falsein the manifest based on whether the patch applied cleanly.
Stage 1 — Phase A (all configs, randomized order): For each non-SKIP config:
launch_<config>.sh→ wait for/v1/models.bench_loop.py --rates 0.5,1,2,4,8,16,32 --shape 4096,256 --duration 60 --min-completed 200.- Kill vLLM, wait 60 s for GPU release.
- Append manifest row.
After all configs have finished Stage 1:
Stage 2 — Reference rate selection (CPU only):
- Compute saturation flags from each cell using the data-driven criteria.
- Choose
ref_safe= max rate where ALL configs that completed Phase A are not saturated. - Choose
ref_load= max rate whereplainis not saturated. - Persist
reference_rates.json.
Stage 3 — Phase B (all configs, randomized order): For each non-SKIP config:
launch_<config>.sh→ wait for ready.bench_loop.py --rate <ref_safe> --shapes 512x64,512x256, ...,32768x1024 --duration 60 --min-completed 200.- (If
ref_load != ref_safe) Run Phase B' for priority configs (plain, mooncake_both, nixl_both, lmcache_only) on shapes {512x256, 4096x256, 32768x256} atref_load. - Kill vLLM, wait 60 s, append manifest row.
Stage 4 — Patch revert + analysis:
- Revert the scheduler_step_timing patch.
analyze.py --root results/.plot_connector_tax.py.
A reviewer with a fresh checkout runs:
cd microbench/connector_tax
bash run_all.sh
and gets the figures + manifest + raw artifacts. The script is
re-runnable: any stage can be skipped via --skip-stage N if the
artifacts exist.
Determinism notes
Same as previous: temperature=0 + ignore_eos give shape determinism; content varies per request via seeded UUID. We do not promise bit-exact reproducibility, only distribution-level reproducibility.
Updated runtime estimate (was 1.5–2 h, now 4–5.5 h)
| Phase | Time |
|---|---|
| Pre-flight (3 verify scripts) | 15 min |
| Phase A: 8 configs × (90 s warmup + 1010 s cells + 60 s GPU clear) | 155 min |
| Phase A → ref_safe selection (CPU) | <1 min |
Phase B (best, ref_safe ≥ 4): 8 × (90 + 630 + 60) |
104 min |
Phase B (worst, ref_safe = 2): 8 × (90 + 990 + 60) |
152 min |
| Optional Phase B' (4 configs × 3 shapes × ≥70 s + 4 × 90 s warmup) | 20 min |
| Analysis + figures | 5 min |
| Total (best case) | ~5 h |
| Total (worst case) | ~5.5 h |
This is honest. The reviewer's earlier estimate of 2.5–3 h underestimated how long low-rate cells must run to give stable p90.
Analysis & Figures
Figure 1: TTFT p90 vs send rate, per configuration (Phase A)
Same as before, now 8 lines plus saturation markers (× per criterion).
Figure 2: TPOT p90 vs send rate, per configuration (Phase A)
Same.
Figure 3: Achieved throughput vs requested (Phase A)
Same, 8 lines + y=x reference + saturation knee annotations.
Figure 4: Substrate tax bar (TTFT p90 + TPOT p90)
At ref_safe and ref_load, side-by-side bars per non-plain config.
Shows:
- Pure tax (
ref_safe) - Tax + non-linear queueing (
ref_load) - The gap is the coupling amplification the reviewer flagged.
Figure 5: Shape-dependent tax heatmap (Phase B)
3×3 heatmap (input × output) of tax_TTFT_p90 for each non-plain config. 6 heatmaps in a row, including noop_connector, mooncake_both, nixl_both, lmcache_only, mooncake_producer, multi_mooncake_lmcache. (Skip mooncake_consumer if pre-flight dropped it.)
Figure 6: Per-step latency CDF, ref_safe rate (Phase A)
X = step duration (μs), Y = CDF, line per config. The most direct visualization of "what each step costs." Shipped only if A3 step log is available.
Figure 7: Tax decomposition stack
For each non-plain config at ref_safe, stacked bar:
- "framework cost" estimated = tax(noop_connector)
- "implementation cost" estimated = tax(this config) − tax(noop_connector)
If noop_connector doesn't run (we'd document why), we drop this
figure and report tax as a single number per config.
Figure 8: H4 additivity check
3-bar group: tax(mooncake_both), tax(lmcache_only), tax(multi). The sum-of-first-two compared against multi visualizes additivity.
Risks & Mitigations (revised)
| Risk | Impact | Mitigation |
|---|---|---|
kv_consumer won't start with dummy bootstrap |
Skip the config | Pre-flight; documented SKIP in manifest |
multi_mooncake_lmcache crashes engine |
Skip the config | Pre-flight |
| NIXL not installed | Skip nixl_both | Tolerant; warn + continue |
| LMCache not installed | Skip lmcache_only AND multi config | Tolerant; warn + continue |
| GPU thermal drift across 3+ h | Skews late configs | Run order randomized; consider running twice on different days and reporting both |
| Open-loop blow-up at 32 req/s | Memory blowup | Inflight cap 256, drop with logged counter |
| Cold-start of first request | Inflates mean TTFT | 10 s warmup discarded |
scheduler_step_timing patch fails to apply on a future vLLM version |
Lose Figures 6 and 7 | Document step_timing_available=false in manifest; H1/H2/H4 still report from client-side TTFT/TPOT |
noop_connector import fails (PYTHONPATH or class signature) |
Lose Figures 7 + H3 falsifier | Pre-flight verify_noop_connector.sh catches this; report SKIP in manifest |
Success Criteria (revised)
- H1 falsifiable: tax_TTFT_p90 for
mooncake_bothatref_safeis reported. We accept the prior (≈45%) if measurement is in [25%, 60%]; we revise the prior if outside. - H2 testable: NIXL-vs-Mooncake gap at
ref_safeis reported. The trace-replay difference was ~7 pp. We document agreement / disagreement. - H3 disambiguated: tax(noop_connector) at
ref_safeis reported. We label substrate tax as "framework-cost-dominated" if noop_connector ≥ 50% of mooncake_both tax, "implementation-cost-dominated" if < 30%. - H4 additivity: |tax(multi) − (tax(mooncake_both) + tax(lmcache_only))| / tax(multi) ≤ 0.30 → linear.
- H5 + H6 directional: report whether tax_TTFT_p90 grows with input and tax_TPOT_p90 grows with output (sign + magnitude).
- All artifacts present: every config that ran has the 6 file
types; every SKIP config has a reason in
preflight_status.json. - Bench finishes < 6 h wall clock on idle dash0 (Phase A + Phase B + optional Phase B' combined; reflects min-completed extension at low rates).
Out of Scope
- Multi-node Mooncake (RDMA over actual network).
- Patching Mooncake or vLLM to optimize the substrate (the point of this microbench is to measure baseline cost as shipped).
- Varying
chunk_size,max_num_seqs, or other vLLM scheduler parameters; fixed at trace-replay defaults. - chunk-boundary effects (input ∈ {8192, 16384}). The reviewer noted this is a real follow-up but adding it doubles Phase B runtime. Documented as a follow-up if Phase B shows shape-dependent tax that can't be explained by total token count.
Cross-references
analysis/characterization/elastic_migration_v2/README.md— the trace-replay paper this microbench validates / refutes.microbench/interference/— Microbench 1 (B2 same-worker interference; complementary).microbench/lifecycle/— Microbench 2 (PD-sep transfer breakdown; uses different vLLM patches).microbench/patches/—_pd_profile.pytemplate if A3 fallback is needed.
Files
microbench/connector_tax/
├── DESIGN.md # this file
├── MANIFEST.md # filled per run
├── tools/
│ ├── noop_connector.py # custom NoOpConnector for H3
│ ├── dummy_bootstrap.py # for kv_consumer pre-flight
│ ├── verify_kv_consumer.sh
│ ├── verify_multi_connector.sh
│ └── verify_noop_connector.sh
├── patches/
│ ├── _step_profile.py # event emitter (ports _pd_profile)
│ ├── scheduler_step_timing.py # idempotent install/revert
│ └── apply.sh # invoked by run_all.sh
├── launch/
│ ├── launch_plain.sh
│ ├── launch_noop_connector.sh
│ ├── launch_mooncake_producer.sh
│ ├── launch_mooncake_consumer.sh
│ ├── launch_mooncake_both.sh
│ ├── launch_nixl_both.sh
│ ├── launch_lmcache_only.sh
│ └── launch_multi_mooncake_lmcache.sh
├── bench_loop.py # open-loop loadgen (--min-completed)
├── metrics_sampler.py # /metrics scraper
├── analyze.py # raw → percentiles + saturation flags
├── plot_connector_tax.py # all figures
├── run_all.sh # 4-stage barrier orchestrator
└── results/<date>_<config>/ # per-run artifacts
└── results/preflight/ # pre-flight verification