Files

Gahow Wang 297fed6e73 Microbench 3 (connector_tax): infrastructure for KV connector substrate tax

Validates the elastic_migration_v2 finding that kv_role=kv_both adds
TTFT p90 +45% even when PD-sep never fires. Replicates under
single-instance, synthetic, open-loop workload to disambiguate
mechanism cost from 8-instance feedback amplification.

Configurations (8):
  plain, noop_connector, mooncake_{producer,consumer,both},
  nixl_both, lmcache_only, multi_mooncake_lmcache.

Pre-flight verification gates risky configs (kv_consumer needs dummy
bootstrap, multi-connector composition, NoOp custom class loading).

Workload: two-phase sweep
  Phase A: rate {0.5..32} req/s × shape (4096, 256), saturation criteria
  Phase B: ref_safe rate × cartesian (input ∈ {512,4k,32k}, output ∈ {64,256,1024})

Step-timing patch enriches vLLM's existing AGENTIC_STEP_LOG_PATH emit
with step_duration_us and build_meta_us — directly measures per-step
substrate cost, not just user-visible TTFT/TPOT.

run_all.sh runs as 5-stage barrier:
  0 pre-flight + apply patch
  1 Phase A all configs
  2 pick ref_safe / ref_load
  3 Phase B all configs
  4 revert patch + analyze + plot

Outputs aggregate.{json,csv}, MANIFEST.tsv, and 5 figures.
Estimated runtime: 4-5.5 hours on idle dash0 H20.

2026-05-26 17:27:41 +08:00

26 KiB

Raw Permalink Blame History

Microbench 3: KV Connector Substrate Tax (revision 2)

Goal

Validate the headline claim from analysis/characterization/elastic_migration_v2/README.md Result 1:

Switching the vLLM launch from plain to kv_role=kv_both without ever triggering PD-sep already costs TTFT p90 +45%, TPOT p90 +25%, hotspot index +19%.

That claim was measured on 8 instances with a 1214-request real-trace replay under saturated coupling. We replicate it with single-instance, synthetic, open-loop workload so we can:

Disambiguate vLLM-v1-framework cost from connector-implementation cost by including a no-op connector.
Validate (or refute) the agentic-coupling amplification claim: if single-instance synthetic numbers ≈ 8-instance trace numbers (38–45%), the coupling is not the main cause. If single-instance is much smaller, then the 8-instance saturated coupling does most of the damage.
Make the result reproducible and auditable: every run dumps full raw artifacts + manifest entry + a re-run script.

Hypotheses (revised based on elastic_migration_v2 prior)

The headline trace-replay numbers from elastic_migration_v2 are our prior, not an open question:

trace replay, 8 instances, agentic dispatch coupling, saturated:
  plain TTFT p90 = 7.35 s
  NIXL  TTFT p90 = ~10.1 s   (+38%)
  Mooncake_both  = 10.67 s   (+45%)

The microbench validates / refutes / refines these:

ID	Hypothesis	Falsifier
H1: Substrate tax persists at single instance / synthetic load	Single-instance Mooncake_both TTFT p90 is ≥ 10% higher than plain at the reference rate	If <10% → trace-replay tax is dominated by 8-instance feedback coupling, not connector machinery
H2: NIXL-vs-Mooncake gap is mechanism-side, not coupling-side	Single-instance numbers preserve the ~7 pp gap (NIXL tax < Mooncake tax by 5–10 pp)	If gap shrinks/inverts → the gap was a coupling artifact
H3: Framework-vs-implementation split	`noop_connector` (v1 framework only, all hooks return no-op) tax is < 50% of Mooncake_both tax	Lets us attribute cost between vLLM's connector dispatch loop and the specific connector's per-step work
H4: MultiConnector tax is additive	tax(Mooncake+LMCache) ≈ tax(Mooncake) + tax(LMCache), within 30%	If super-additive → cross-connector interference; if sub-additive → some shared per-step cost is amortized
H5: Tax is shape-dependent	tax_TTFT_p90 grows monotonically with input length for Mooncake_both	Confirms E2 audit §6.5 hypothesis (`set(cache.keys())` walks scale with cache size)
H6: Tax compounds in decode	tax_TPOT_p90 grows with output length	Confirms connector code runs each decode step

H3 and H4 are the must-have new hypotheses that pulled in the new configs.

Hardware & Model

Parameter	Value
GPU	NVIDIA H20 96 GB × 1 (single instance)
Model	Qwen3-Coder-30B-A3B-Instruct
TP	1
`max_model_len`	200 000
`enable_prefix_caching`	true
`enable_chunked_prefill`	true
`max_num_batched_tokens`	8192
`gpu_memory_utilization`	0.9

Single GPU per run. Each configuration is a fresh vLLM launch on GPU 0.

Configurations (8 total, was 6)

ID	Connector	Role	Why we measure it
`plain`	(none)	—	Baseline
`noop_connector`	custom `NoOpConnector` (this microbench ships it)	n/a	Isolate vLLM-v1 framework cost (build_connector_meta, mixin dispatch, get_finished bookkeeping) without any real connector work — see Note 1
`mooncake_producer`	MooncakeConnector	`kv_producer`	Isolate P-side stack
`mooncake_consumer`	MooncakeConnector	`kv_consumer`	Isolate D-side stack — pre-flight gated, see §Pre-flight
`mooncake_both`	MooncakeConnector	`kv_both`	The README claim
`nixl_both`	NIXLConnector	`kv_both`	Connector-specific vs framework cost
`lmcache_only`	`LMCacheConnectorV1`	n/a	NEW — gives H4 a denominator
`multi_mooncake_lmcache`	MultiConnector(Mooncake `kv_both` + `LMCacheConnectorV1`)	mixed	Stacked-connector check (gated by pre-flight)

Note 1 — noop_connector (we ship it, not the vLLM-bundled one): The vLLM-shipped ExampleConnector is NOT a true no-op — it implements a debug-grade disk KV cache: stores match metadata, serializes safetensors per-layer in save_kv_layer, etc. (see third_party/vllm/.../example_connector.py:345, example_connector.py:250, kv_transfer_utils.py:49). Using it would conflate framework cost with disk-I/O + per-layer save cost.

Instead we ship microbench/connector_tax/tools/noop_connector.py that subclasses KVConnectorBase_V1 and returns no-op for every hook:

class NoOpConnector(KVConnectorBase_V1):
    def get_num_new_matched_tokens(self, req, num_computed): return 0, False
    def update_state_after_alloc(self, *_args, **_kw): pass
    def build_connector_meta(self, scheduler_output): return KVConnectorMetadata()
    def request_finished(self, *_args, **_kw): return False, None
    def start_load_kv(self, *_args, **_kw): pass
    def wait_for_layer_load(self, *_args, **_kw): pass
    def save_kv_layer(self, *_args, **_kw): pass
    def wait_for_save(self): pass
    def get_finished(self, *_args, **_kw): return None, None

vLLM loads it via:

--kv-transfer-config '{
    "kv_connector_module_path":
        "microbench.connector_tax.tools.noop_connector:NoOpConnector",
    "kv_role": "kv_both"
}'

PYTHONPATH is set in launch_noop_connector.sh so vLLM can resolve the dotted import path.

If noop_connector overhead ≈ 0 → all substrate tax is in connector implementations. If noop_connector overhead ≈ 30% of Mooncake_both tax → vLLM's framework dispatch alone explains a meaningful slice.

Pre-flight Verification (NEW — gates risky configs)

Two configs depend on infrastructure we can't take for granted. Run verification scripts BEFORE the main bench. Skip the config (and record SKIP in manifest) if it fails.

`verify_kv_consumer.sh`

Start a dummy bootstrap process (tools/dummy_bootstrap.py).
Launch vLLM with kv_role=kv_consumer pointing at the dummy.
Curl /v1/models — must return 200 with the model id.
Send one short request (max_tokens=4) without kv_transfer_params — must return 200 in <30 s.

If steps 3 or 4 fail, the config is unrunnable and we drop it. We do not try harder; the trace-replay paper does not promise consumer-only single-instance numbers.

`verify_multi_connector.sh`

Launch vLLM with MultiConnector(MooncakeConnector kv_both, LMCacheConnectorV1).
Send 5 sequential requests, max_tokens=32, random content.
All 5 must complete in <60 s.
Verify no engine crashes: vllm:engine_core_failed_total == 0 from /metrics.

If any check fails, drop the config and mark SKIP (Manifest column: "why skipped").

`verify_noop_connector.sh`

Launch with noop_connector active (loaded via kv_connector_module_path).
Send 5 sequential requests, max_tokens=32.
Verify all 5 return 200 in <30 s and no engine crash.

This one is unlikely to fail but the verification is the same.

The verification scripts produce verify_<config>.log under results/preflight/ and a preflight_status.json summarizing skip-or-include decisions for the manifest.

Workload (revised)

Open-loop, fixed-rate, randomized content, two-phase sweep, data-driven saturation criteria.

Phase A — rate sweep (find saturation per config)

Parameter	Value
Input length	4096 tokens (random per request)
Output length	256 tokens (`max_tokens=256`, `ignore_eos=True`)
Send rates	{0.5, 1, 2, 4, 8, 16, 32} req/s (added 0.5 for low-end calibration)
Duration per cell	`max(60 s, time_to_min_completed)` + 10 s warmup
Min completed per cell	200 requests
Inflight cap	256 (drop excess to log)

Why min_completed = 200: at p90, the margin-of-error of a Monte Carlo percentile estimate from N samples is ≈ 1.65 √(0.9 × 0.1 / N). For N=200 this is ~3.5% absolute, ~10% relative — acceptable. For N=30 (which 0.5 req/s × 60 s gives) it's ~28% relative, useless for saturation detection. So at low rates the cell automatically extends: 0.5 req/s → ≥ 400 s, 1 req/s → ≥ 200 s. At 4 req/s and above the 60-second floor dominates.

Updated Phase A duration per config (rounded):

Rate (req/s)	Duration (s)	Note
0.5	410	extended to hit 200 completed
1	210	extended
2	110	extended slightly
4	70	floor
8	70	floor
16	70	floor
32	70	floor
sum per config (excl. warmup)	~1010 s

Total Phase A: 8 configs × (90 s vLLM warmup + 1010 s of cells + 60 s GPU release) = 8 × 1160 s ≈ 155 min.

Saturation criteria (data-driven, was hardcoded inflight>8)

A config is saturated at rate r if any of:

effective_throughput(r) / r < 0.95 — vLLM can't keep up
num_requests_waiting p50 (from /metrics) > 1 — vLLM has visible queue
TTFT p90 (r) / TTFT p90 (r/2) > 1.5 — TTFT inflating super-linearly

The per-config saturation rate is the lowest r that triggers ≥ 1 criterion. We log which criterion fired so reviewers can disagree.

Reference rate selection (revised)

We define two reference rates for Phase B, both computed from Phase A data:

ref_safe = max rate where ALL 8 configs are NOT saturated
ref_load = max rate where plain is NOT saturated
           (some other configs may be saturated here)

ref_safe measures the pure substrate per-step tax under no queueing.

ref_load measures the tax in the regime closer to deployment — where plain is happily under-loaded but Mooncake is starting to hurt. The gap tax(ref_load) − tax(ref_safe) is the non-linear queueing amplification of the substrate tax. This is exactly the effect the reviewer worried about and now we measure it explicitly instead of ignoring it.

Both rates are reported. The headline number we cite is ref_safe because it's the cleanest decomposition. The ref_load numbers tell us how much worse the tax gets near saturation.

Phase B — shape sweep (substrate tax across length regimes)

Parameter	Value
Send rate	`ref_safe` (one value, single rate to keep cost bounded)
Input lengths	{512, 4096, 32768} tokens
Output lengths	{64, 256, 1024} tokens (32 promoted to 64 — see Note 2)
Duration per cell	`max(60 s, time_to_min_completed)` + 10 s warmup
Min completed per cell	200 requests
Cartesian shapes	3 × 3 = 9

The same min-completed extension applies. If ref_safe ≥ 4 req/s, each cell hits the 60 s floor and per-config Phase B cell time is 9 × 70 s = 630 s. If ref_safe = 2 req/s, cells extend to 110 s and per-config cell time is 9 × 110 s = 990 s.

Total Phase B (worst case, ref_safe = 2): 8 configs × (90 s warmup

990 s of cells + 60 s GPU release) ≈ 152 min. Best case (ref_safe ≥ 4): 8 × (90 + 630 + 60) ≈ 104 min.

If after Phase A we find ref_load differs meaningfully from ref_safe, we add a small Phase B' run on ref_load for the 4 high-priority configs (plain, mooncake_both, nixl_both, lmcache_only) on 3 representative shapes (512/256, 4096/256, 32768/256). That is 4 configs × 3 shapes × 70 s ≈ 14 min, controlled trade-off.

Note 2 — output 64 instead of 32: with 32 output tokens TPOT is estimated from 31 inter-token intervals — too few samples for stable p90. Bumping to 64 gives 63 samples, comfortable for percentile estimation. The output=32 regime is also less common in agentic deployments where a tool result frame is rarely <64 tokens.

Common settings

Parameter	Value
`temperature`	0 (deterministic)
`ignore_eos`	True (force exact output length)
Content	random UUID + hash per request, zero prefix cache hit
Concurrent inflight cap	256

Metrics (revised — adds A3 step-level engine_state)

Client-side (per-request, JSONL)

Same as before: t_send_ns, t_first_token_ns, t_last_token_ns, prompt_tokens, completion_tokens, inflight_at_send.

Server-side `/metrics` sampling (1 Hz)

Captured into metrics_<cfg>_<phase>_<cell>.jsonl. Same fields as prior version.

Step-level timing instrumentation (NEW — we ship the patch)

The reviewer correctly noted that the existing A3 step log (third_party/vllm/.../scheduler.py:953) only records per-step token counts and request lists, not step duration or per-callback timing. So we cannot just turn on AGENTIC_STEP_LOG_PATH and get Figure 6/7's "direct evidence" — that data does not exist yet.

This microbench ships its own scheduler timing patch at microbench/connector_tax/patches/scheduler_step_timing.py, modelled on the idempotent microbench/patches/apply_patches.py we wrote for Microbench 2. It uses the same _pd_profile.py emit pattern.

The patch instruments:

Scheduler.schedule() entry → t_step_enter (perf_counter_ns)
Scheduler.schedule() exit → t_step_exit
Around connector.build_connector_meta(scheduler_output) → build_meta_us
Around connector.get_finished(...) call (in _update_from_output / mixin) → get_finished_us
Around connector.start_load_kv(...) (in the worker mixin _get_kv_connector_output) → start_load_kv_us (worker-side; emitted from worker process)

Each step emits one JSONL record:

{
  "t_ns": <step_enter perf_counter_ns>,
  "step_id": <monotonic int>,
  "step_duration_us": <step_exit - step_enter>,
  "build_meta_us":    <build_connector_meta duration>,
  "get_finished_us":  <connector get_finished duration>,
  "start_load_kv_us": <worker start_load_kv; null on scheduler-only proc>,
  "num_running":      <int>,
  "num_waiting":      <int>,
  "prefill_tokens":   <int>,
  "decode_tokens":    <int>
}

Output goes to AGENTIC_STEP_LOG_PATH (one file per process; we use engine_step_<phase>_<cell>.jsonl paths from launch scripts).

Apply / revert is idempotent — same # CONNECTOR_TAX_PATCH marker strategy as Microbench 2.

microbench/connector_tax/patches/
├── _step_profile.py         # the emitter (ported from _pd_profile)
├── scheduler_step_timing.py # patch installer / reverter
└── apply.sh                 # invoked by run_all.sh; revert at end

Fallback if the patch fails to apply on a future vLLM version: the bench drops to client-side TTFT/TPOT only. Figures 6 (per-step CDF) and 7 (decomposition stack) are not produced; the manifest records step_timing_available=false. The other figures and the H1 / H2 / H4 headline numbers do not depend on this patch, so the bench is still useful in fallback mode.

Derived (post-processing)

For each (config, rate-or-shape) cell after warmup:

TTFT/TPOT/E2E p50/p90/p99
effective_throughput, requested_throughput, throughput_ratio
saturation_flag (which criterion, if any, triggered)
(when step_timing_available=true):
- step_duration_us p50/p90
- build_meta_us p50/p90
- get_finished_us p50/p90
- start_load_kv_us p50/p90 (worker-process file)
- connector_total_us p50/p90 (sum of the 3 callback timings)

Substrate tax definition

tax_TTFT_p90(X, ref) = TTFT_p90(X, ref) / TTFT_p90(plain, ref) - 1
tax_TPOT_p90(X, ref) = TPOT_p90(X, ref) / TPOT_p90(plain, ref) - 1
tax_step_p50(X)      = step_duration_us p50 (X) - step_duration_us p50 (plain)
tax_callback_p50(X)  = connector_total_us p50 (X)   # plain has no callbacks

tax_step is the gross per-step penalty (any cause). tax_callback is the callback-attributable penalty (sum of the three measured connector hooks). The difference tax_step − tax_callback is "step-time overhead not attributable to instrumented callbacks" — block-pool walks, scheduler-state churn, etc. Reporting both lets reviewers see whether our instrumentation accounts for the full cost.

Auditability & Reproducibility Plan

Run artifacts (per config × phase × cell)

microbench/connector_tax/results/
  <date>_<config>/
    config.json                  # parameters used
    launch.sh                    # exact vLLM launch command
    vllm_stdout.log              # full vLLM stdout
    vllm_stderr.log              # full vLLM stderr
    requests_<phase>_<cell>.jsonl
    metrics_<phase>_<cell>.jsonl
    engine_step_<phase>_<cell>.jsonl  # if A3 active
    summary.json                 # per-cell percentiles
    env.txt                      # pip freeze, vLLM SHA, GPU info
  preflight/
    verify_kv_consumer.log
    verify_multi_connector.log
    verify_noop_connector.log
    preflight_status.json        # which configs are SKIP'd and why

Manifest

microbench/connector_tax/MANIFEST.md lists every run with date, vLLM version + git SHA, Mooncake version, NIXL version, LMCache version, GPU id (nvidia-smi -L), config name, launch command, result directory, A3-active flag, and skip-status (with reason).

Re-run script

microbench/connector_tax/run_all.sh runs in three barrier stages. Phase A across all configs must finish before Phase B can pick a reference rate.

Stage 0 — Pre-flight + patch:

Run verify_kv_consumer.sh, verify_multi_connector.sh, and verify_noop_connector.sh. Persist preflight_status.json.
Apply microbench/connector_tax/patches/scheduler_step_timing.py to the active vLLM. Record step_timing_available=true|false in the manifest based on whether the patch applied cleanly.

Stage 1 — Phase A (all configs, randomized order): For each non-SKIP config:

launch_<config>.sh → wait for /v1/models.
bench_loop.py --rates 0.5,1,2,4,8,16,32 --shape 4096,256 --duration 60 --min-completed 200.
Kill vLLM, wait 60 s for GPU release.
Append manifest row.

After all configs have finished Stage 1:

Stage 2 — Reference rate selection (CPU only):

Compute saturation flags from each cell using the data-driven criteria.
Choose ref_safe = max rate where ALL configs that completed Phase A are not saturated.
Choose ref_load = max rate where plain is not saturated.
Persist reference_rates.json.

Stage 3 — Phase B (all configs, randomized order): For each non-SKIP config:

launch_<config>.sh → wait for ready.
bench_loop.py --rate <ref_safe> --shapes 512x64,512x256, ...,32768x1024 --duration 60 --min-completed 200.
(If ref_load != ref_safe) Run Phase B' for priority configs (plain, mooncake_both, nixl_both, lmcache_only) on shapes {512x256, 4096x256, 32768x256} at ref_load.
Kill vLLM, wait 60 s, append manifest row.

Stage 4 — Patch revert + analysis:

Revert the scheduler_step_timing patch.
analyze.py --root results/.
plot_connector_tax.py.

A reviewer with a fresh checkout runs:

cd microbench/connector_tax
bash run_all.sh

and gets the figures + manifest + raw artifacts. The script is re-runnable: any stage can be skipped via --skip-stage N if the artifacts exist.

Determinism notes

Same as previous: temperature=0 + ignore_eos give shape determinism; content varies per request via seeded UUID. We do not promise bit-exact reproducibility, only distribution-level reproducibility.

Updated runtime estimate (was 1.5–2 h, now 4–5.5 h)

Phase	Time
Pre-flight (3 verify scripts)	15 min
Phase A: 8 configs × (90 s warmup + 1010 s cells + 60 s GPU clear)	155 min
Phase A → ref_safe selection (CPU)	<1 min
Phase B (best, `ref_safe ≥ 4`): 8 × (90 + 630 + 60)	104 min
Phase B (worst, `ref_safe = 2`): 8 × (90 + 990 + 60)	152 min
Optional Phase B' (4 configs × 3 shapes × ≥70 s + 4 × 90 s warmup)	20 min
Analysis + figures	5 min
Total (best case)	~5 h
Total (worst case)	~5.5 h

This is honest. The reviewer's earlier estimate of 2.5–3 h underestimated how long low-rate cells must run to give stable p90.

Analysis & Figures

Figure 1: TTFT p90 vs send rate, per configuration (Phase A)

Same as before, now 8 lines plus saturation markers (× per criterion).

Figure 2: TPOT p90 vs send rate, per configuration (Phase A)

Same.

Figure 3: Achieved throughput vs requested (Phase A)

Same, 8 lines + y=x reference + saturation knee annotations.

Figure 4: Substrate tax bar (TTFT p90 + TPOT p90)

At ref_safe and ref_load, side-by-side bars per non-plain config. Shows:

Pure tax (ref_safe)
Tax + non-linear queueing (ref_load)
The gap is the coupling amplification the reviewer flagged.

Figure 5: Shape-dependent tax heatmap (Phase B)

3×3 heatmap (input × output) of tax_TTFT_p90 for each non-plain config. 6 heatmaps in a row, including noop_connector, mooncake_both, nixl_both, lmcache_only, mooncake_producer, multi_mooncake_lmcache. (Skip mooncake_consumer if pre-flight dropped it.)

Figure 6: Per-step latency CDF, ref_safe rate (Phase A)

X = step duration (μs), Y = CDF, line per config. The most direct visualization of "what each step costs." Shipped only if A3 step log is available.

Figure 7: Tax decomposition stack

For each non-plain config at ref_safe, stacked bar:

"framework cost" estimated = tax(noop_connector)
"implementation cost" estimated = tax(this config) − tax(noop_connector)

If noop_connector doesn't run (we'd document why), we drop this figure and report tax as a single number per config.

Figure 8: H4 additivity check

3-bar group: tax(mooncake_both), tax(lmcache_only), tax(multi). The sum-of-first-two compared against multi visualizes additivity.

Risks & Mitigations (revised)

Risk	Impact	Mitigation
`kv_consumer` won't start with dummy bootstrap	Skip the config	Pre-flight; documented SKIP in manifest
`multi_mooncake_lmcache` crashes engine	Skip the config	Pre-flight
NIXL not installed	Skip nixl_both	Tolerant; warn + continue
LMCache not installed	Skip lmcache_only AND multi config	Tolerant; warn + continue
GPU thermal drift across 3+ h	Skews late configs	Run order randomized; consider running twice on different days and reporting both
Open-loop blow-up at 32 req/s	Memory blowup	Inflight cap 256, drop with logged counter
Cold-start of first request	Inflates mean TTFT	10 s warmup discarded
`scheduler_step_timing` patch fails to apply on a future vLLM version	Lose Figures 6 and 7	Document `step_timing_available=false` in manifest; H1/H2/H4 still report from client-side TTFT/TPOT
`noop_connector` import fails (PYTHONPATH or class signature)	Lose Figures 7 + H3 falsifier	Pre-flight `verify_noop_connector.sh` catches this; report SKIP in manifest

Success Criteria (revised)

H1 falsifiable: tax_TTFT_p90 for mooncake_both at ref_safe is reported. We accept the prior (≈45%) if measurement is in [25%, 60%]; we revise the prior if outside.
H2 testable: NIXL-vs-Mooncake gap at ref_safe is reported. The trace-replay difference was ~7 pp. We document agreement / disagreement.
H3 disambiguated: tax(noop_connector) at ref_safe is reported. We label substrate tax as "framework-cost-dominated" if noop_connector ≥ 50% of mooncake_both tax, "implementation-cost-dominated" if < 30%.
H4 additivity: |tax(multi) − (tax(mooncake_both) + tax(lmcache_only))| / tax(multi) ≤ 0.30 → linear.
H5 + H6 directional: report whether tax_TTFT_p90 grows with input and tax_TPOT_p90 grows with output (sign + magnitude).
All artifacts present: every config that ran has the 6 file types; every SKIP config has a reason in preflight_status.json.
Bench finishes < 6 h wall clock on idle dash0 (Phase A + Phase B + optional Phase B' combined; reflects min-completed extension at low rates).

Out of Scope

Multi-node Mooncake (RDMA over actual network).
Patching Mooncake or vLLM to optimize the substrate (the point of this microbench is to measure baseline cost as shipped).
Varying chunk_size, max_num_seqs, or other vLLM scheduler parameters; fixed at trace-replay defaults.
chunk-boundary effects (input ∈ {8192, 16384}). The reviewer noted this is a real follow-up but adding it doubles Phase B runtime. Documented as a follow-up if Phase B shows shape-dependent tax that can't be explained by total token count.

Cross-references

analysis/characterization/elastic_migration_v2/README.md — the trace-replay paper this microbench validates / refutes.
microbench/interference/ — Microbench 1 (B2 same-worker interference; complementary).
microbench/lifecycle/ — Microbench 2 (PD-sep transfer breakdown; uses different vLLM patches).
microbench/patches/ — _pd_profile.py template if A3 fallback is needed.

Files

microbench/connector_tax/
├── DESIGN.md                       # this file
├── MANIFEST.md                     # filled per run
├── tools/
│   ├── noop_connector.py           # custom NoOpConnector for H3
│   ├── dummy_bootstrap.py          # for kv_consumer pre-flight
│   ├── verify_kv_consumer.sh
│   ├── verify_multi_connector.sh
│   └── verify_noop_connector.sh
├── patches/
│   ├── _step_profile.py            # event emitter (ports _pd_profile)
│   ├── scheduler_step_timing.py    # idempotent install/revert
│   └── apply.sh                    # invoked by run_all.sh
├── launch/
│   ├── launch_plain.sh
│   ├── launch_noop_connector.sh
│   ├── launch_mooncake_producer.sh
│   ├── launch_mooncake_consumer.sh
│   ├── launch_mooncake_both.sh
│   ├── launch_nixl_both.sh
│   ├── launch_lmcache_only.sh
│   └── launch_multi_mooncake_lmcache.sh
├── bench_loop.py                   # open-loop loadgen (--min-completed)
├── metrics_sampler.py              # /metrics scraper
├── analyze.py                      # raw → percentiles + saturation flags
├── plot_connector_tax.py           # all figures
├── run_all.sh                      # 4-stage barrier orchestrator
└── results/<date>_<config>/        # per-run artifacts
└── results/preflight/              # pre-flight verification

26 KiB Raw Permalink Blame History Unescape Escape