Microbench 3 (connector_tax): infrastructure for KV connector substrate tax

Validates the elastic_migration_v2 finding that kv_role=kv_both adds
TTFT p90 +45% even when PD-sep never fires. Replicates under
single-instance, synthetic, open-loop workload to disambiguate
mechanism cost from 8-instance feedback amplification.

Configurations (8):
  plain, noop_connector, mooncake_{producer,consumer,both},
  nixl_both, lmcache_only, multi_mooncake_lmcache.

Pre-flight verification gates risky configs (kv_consumer needs dummy
bootstrap, multi-connector composition, NoOp custom class loading).

Workload: two-phase sweep
  Phase A: rate {0.5..32} req/s × shape (4096, 256), saturation criteria
  Phase B: ref_safe rate × cartesian (input ∈ {512,4k,32k}, output ∈ {64,256,1024})

Step-timing patch enriches vLLM's existing AGENTIC_STEP_LOG_PATH emit
with step_duration_us and build_meta_us — directly measures per-step
substrate cost, not just user-visible TTFT/TPOT.

run_all.sh runs as 5-stage barrier:
  0 pre-flight + apply patch
  1 Phase A all configs
  2 pick ref_safe / ref_load
  3 Phase B all configs
  4 revert patch + analyze + plot

Outputs aggregate.{json,csv}, MANIFEST.tsv, and 5 figures.
Estimated runtime: 4-5.5 hours on idle dash0 H20.
This commit is contained in:
2026-05-26 17:27:41 +08:00
parent 3fdcec9c0f
commit 297fed6e73
24 changed files with 2476 additions and 0 deletions

0
microbench/__init__.py Normal file
View File

View File

@@ -0,0 +1,664 @@
# Microbench 3: KV Connector Substrate Tax (revision 2)
## Goal
Validate the headline claim from
`analysis/characterization/elastic_migration_v2/README.md` Result 1:
> Switching the vLLM launch from plain to `kv_role=kv_both` without ever
> triggering PD-sep already costs **TTFT p90 +45%, TPOT p90 +25%,
> hotspot index +19%**.
That claim was measured on 8 instances with a 1214-request real-trace
replay under saturated coupling. We replicate it with **single-instance,
synthetic, open-loop** workload so we can:
1. Disambiguate **vLLM-v1-framework cost** from
**connector-implementation cost** by including a no-op connector.
2. **Validate (or refute) the agentic-coupling amplification** claim:
if single-instance synthetic numbers ≈ 8-instance trace numbers
(3845%), the coupling is not the main cause. If single-instance is
much smaller, then the 8-instance saturated coupling does most of
the damage.
3. Make the result **reproducible and auditable**: every run dumps full
raw artifacts + manifest entry + a re-run script.
---
## Hypotheses (revised based on elastic_migration_v2 prior)
The headline trace-replay numbers from elastic_migration_v2 are our
**prior**, not an open question:
```
trace replay, 8 instances, agentic dispatch coupling, saturated:
plain TTFT p90 = 7.35 s
NIXL TTFT p90 = ~10.1 s (+38%)
Mooncake_both = 10.67 s (+45%)
```
The microbench validates / refutes / refines these:
| ID | Hypothesis | Falsifier |
|---|---|---|
| **H1: Substrate tax persists at single instance / synthetic load** | Single-instance Mooncake_both TTFT p90 is ≥ 10% higher than plain at the reference rate | If <10% trace-replay tax is dominated by 8-instance feedback coupling, not connector machinery |
| **H2: NIXL-vs-Mooncake gap is mechanism-side, not coupling-side** | Single-instance numbers preserve the ~7 pp gap (NIXL tax < Mooncake tax by 510 pp) | If gap shrinks/inverts the gap was a coupling artifact |
| **H3: Framework-vs-implementation split** | `noop_connector` (v1 framework only, all hooks return no-op) tax is < 50% of Mooncake_both tax | Lets us attribute cost between vLLM's connector dispatch loop and the specific connector's per-step work |
| **H4: MultiConnector tax is additive** | tax(Mooncake+LMCache) tax(Mooncake) + tax(LMCache), within 30% | If super-additive cross-connector interference; if sub-additive some shared per-step cost is amortized |
| **H5: Tax is shape-dependent** | tax_TTFT_p90 grows monotonically with input length for Mooncake_both | Confirms E2 audit §6.5 hypothesis (`set(cache.keys())` walks scale with cache size) |
| **H6: Tax compounds in decode** | tax_TPOT_p90 grows with output length | Confirms connector code runs each decode step |
H3 and H4 are the must-have new hypotheses that pulled in the new configs.
---
## Hardware & Model
| Parameter | Value |
|---|---|
| GPU | NVIDIA H20 96 GB × 1 (single instance) |
| Model | Qwen3-Coder-30B-A3B-Instruct |
| TP | 1 |
| `max_model_len` | 200 000 |
| `enable_prefix_caching` | true |
| `enable_chunked_prefill` | true |
| `max_num_batched_tokens` | 8192 |
| `gpu_memory_utilization` | 0.9 |
Single GPU per run. Each configuration is a fresh vLLM launch on GPU 0.
---
## Configurations (8 total, was 6)
| ID | Connector | Role | Why we measure it |
|---|---|---|---|
| `plain` | (none) | | Baseline |
| `noop_connector` | custom `NoOpConnector` (this microbench ships it) | n/a | Isolate **vLLM-v1 framework** cost (build_connector_meta, mixin dispatch, get_finished bookkeeping) without any real connector work see Note 1 |
| `mooncake_producer` | MooncakeConnector | `kv_producer` | Isolate P-side stack |
| `mooncake_consumer` | MooncakeConnector | `kv_consumer` | Isolate D-side stack pre-flight gated, see §Pre-flight |
| `mooncake_both` | MooncakeConnector | `kv_both` | The README claim |
| `nixl_both` | NIXLConnector | `kv_both` | Connector-specific vs framework cost |
| `lmcache_only` | `LMCacheConnectorV1` | n/a | NEW gives H4 a denominator |
| `multi_mooncake_lmcache` | MultiConnector(Mooncake `kv_both` + `LMCacheConnectorV1`) | mixed | Stacked-connector check (gated by pre-flight) |
**Note 1 — noop_connector (we ship it, not the vLLM-bundled one)**:
The vLLM-shipped `ExampleConnector` is NOT a true no-op it
implements a debug-grade disk KV cache: stores match metadata,
serializes safetensors per-layer in `save_kv_layer`, etc. (see
`third_party/vllm/.../example_connector.py:345`,
`example_connector.py:250`,
`kv_transfer_utils.py:49`). Using it would conflate framework cost
with disk-I/O + per-layer save cost.
Instead we ship `microbench/connector_tax/tools/noop_connector.py`
that subclasses `KVConnectorBase_V1` and returns no-op for **every**
hook:
```python
class NoOpConnector(KVConnectorBase_V1):
def get_num_new_matched_tokens(self, req, num_computed): return 0, False
def update_state_after_alloc(self, *_args, **_kw): pass
def build_connector_meta(self, scheduler_output): return KVConnectorMetadata()
def request_finished(self, *_args, **_kw): return False, None
def start_load_kv(self, *_args, **_kw): pass
def wait_for_layer_load(self, *_args, **_kw): pass
def save_kv_layer(self, *_args, **_kw): pass
def wait_for_save(self): pass
def get_finished(self, *_args, **_kw): return None, None
```
vLLM loads it via:
```
--kv-transfer-config '{
"kv_connector_module_path":
"microbench.connector_tax.tools.noop_connector:NoOpConnector",
"kv_role": "kv_both"
}'
```
`PYTHONPATH` is set in `launch_noop_connector.sh` so vLLM can resolve
the dotted import path.
If `noop_connector` overhead 0 all substrate tax is in connector
implementations. If `noop_connector` overhead 30% of Mooncake_both
tax vLLM's framework dispatch alone explains a meaningful slice.
---
## Pre-flight Verification (NEW — gates risky configs)
Two configs depend on infrastructure we can't take for granted. Run
verification scripts BEFORE the main bench. Skip the config (and record
SKIP in manifest) if it fails.
### `verify_kv_consumer.sh`
1. Start a dummy bootstrap process (`tools/dummy_bootstrap.py`).
2. Launch vLLM with `kv_role=kv_consumer` pointing at the dummy.
3. Curl `/v1/models` must return 200 with the model id.
4. Send one short request (`max_tokens=4`) without `kv_transfer_params`
must return 200 in <30 s.
If steps 3 or 4 fail, the config is unrunnable and we drop it. We do
not try harder; the trace-replay paper does not promise consumer-only
single-instance numbers.
### `verify_multi_connector.sh`
1. Launch vLLM with `MultiConnector(MooncakeConnector kv_both,
LMCacheConnectorV1)`.
2. Send 5 sequential requests, `max_tokens=32`, random content.
3. All 5 must complete in <60 s.
4. Verify no engine crashes: `vllm:engine_core_failed_total == 0` from
`/metrics`.
If any check fails, drop the config and mark SKIP (Manifest column:
"why skipped").
### `verify_noop_connector.sh`
1. Launch with noop_connector active (loaded via `kv_connector_module_path`).
2. Send 5 sequential requests, `max_tokens=32`.
3. Verify all 5 return 200 in <30 s and no engine crash.
This one is unlikely to fail but the verification is the same.
The verification scripts produce `verify_<config>.log` under
`results/preflight/` and a `preflight_status.json` summarizing
skip-or-include decisions for the manifest.
---
## Workload (revised)
**Open-loop, fixed-rate, randomized content, two-phase sweep,
data-driven saturation criteria.**
### Phase A — rate sweep (find saturation per config)
| Parameter | Value |
|---|---|
| Input length | 4096 tokens (random per request) |
| Output length | 256 tokens (`max_tokens=256`, `ignore_eos=True`) |
| Send rates | {0.5, 1, 2, 4, 8, 16, 32} req/s (added 0.5 for low-end calibration) |
| Duration per cell | `max(60 s, time_to_min_completed)` + 10 s warmup |
| Min completed per cell | 200 requests |
| Inflight cap | 256 (drop excess to log) |
**Why min_completed = 200**: at p90, the margin-of-error of a Monte
Carlo percentile estimate from N samples is ≈ 1.65 √(0.9 × 0.1 / N).
For N=200 this is ~3.5% absolute, ~10% relative — acceptable. For
N=30 (which 0.5 req/s × 60 s gives) it's ~28% relative, useless for
saturation detection. So at low rates the cell automatically extends:
0.5 req/s → ≥ 400 s, 1 req/s → ≥ 200 s. At 4 req/s and above the
60-second floor dominates.
Updated Phase A duration per config (rounded):
| Rate (req/s) | Duration (s) | Note |
|---|---|---|
| 0.5 | 410 | extended to hit 200 completed |
| 1 | 210 | extended |
| 2 | 110 | extended slightly |
| 4 | 70 | floor |
| 8 | 70 | floor |
| 16 | 70 | floor |
| 32 | 70 | floor |
| **sum per config (excl. warmup)** | **~1010 s** | |
Total Phase A: 8 configs × (90 s vLLM warmup + 1010 s of cells +
60 s GPU release) = 8 × 1160 s ≈ **155 min**.
### Saturation criteria (data-driven, was hardcoded inflight>8)
A config is **saturated at rate r** if **any** of:
1. `effective_throughput(r) / r < 0.95` — vLLM can't keep up
2. `num_requests_waiting p50 (from /metrics) > 1` — vLLM has visible queue
3. `TTFT p90 (r) / TTFT p90 (r/2) > 1.5` — TTFT inflating super-linearly
The **per-config saturation rate** is the lowest r that triggers ≥ 1
criterion. We log which criterion fired so reviewers can disagree.
### Reference rate selection (revised)
We define **two reference rates** for Phase B, both computed from
Phase A data:
```
ref_safe = max rate where ALL 8 configs are NOT saturated
ref_load = max rate where plain is NOT saturated
(some other configs may be saturated here)
```
`ref_safe` measures the **pure substrate per-step tax** under no
queueing.
`ref_load` measures **the tax in the regime closer to deployment** —
where plain is happily under-loaded but Mooncake is starting to hurt.
The gap `tax(ref_load) tax(ref_safe)` is the **non-linear queueing
amplification** of the substrate tax. This is exactly the effect the
reviewer worried about and now we measure it explicitly instead of
ignoring it.
Both rates are reported. The headline number we cite is `ref_safe`
because it's the cleanest decomposition. The `ref_load` numbers tell
us how much worse the tax gets near saturation.
### Phase B — shape sweep (substrate tax across length regimes)
| Parameter | Value |
|---|---|
| Send rate | `ref_safe` (one value, single rate to keep cost bounded) |
| Input lengths | {512, 4096, 32768} tokens |
| Output lengths | {64, 256, 1024} tokens (32 promoted to 64 — see Note 2) |
| Duration per cell | `max(60 s, time_to_min_completed)` + 10 s warmup |
| Min completed per cell | 200 requests |
| Cartesian shapes | 3 × 3 = 9 |
The same min-completed extension applies. If `ref_safe ≥ 4 req/s`,
each cell hits the 60 s floor and per-config Phase B cell time is
9 × 70 s = 630 s. If `ref_safe = 2 req/s`, cells extend to 110 s and
per-config cell time is 9 × 110 s = 990 s.
Total Phase B (worst case, `ref_safe = 2`): 8 configs × (90 s warmup
+ 990 s of cells + 60 s GPU release) ≈ **152 min**. Best case
(`ref_safe ≥ 4`): 8 × (90 + 630 + 60) ≈ **104 min**.
If after Phase A we find `ref_load` differs meaningfully from
`ref_safe`, we add a small Phase B' run on `ref_load` for the 4
high-priority configs (plain, mooncake_both, nixl_both, lmcache_only)
on 3 representative shapes (512/256, 4096/256, 32768/256). That is
4 configs × 3 shapes × 70 s ≈ 14 min, controlled trade-off.
**Note 2 — output 64 instead of 32**: with 32 output tokens TPOT is
estimated from 31 inter-token intervals — too few samples for stable
p90. Bumping to 64 gives 63 samples, comfortable for percentile
estimation. The output=32 regime is also less common in agentic
deployments where a tool result frame is rarely <64 tokens.
### Common settings
| Parameter | Value |
|---|---|
| `temperature` | 0 (deterministic) |
| `ignore_eos` | True (force exact output length) |
| Content | random UUID + hash per request, zero prefix cache hit |
| Concurrent inflight cap | 256 |
---
## Metrics (revised — adds A3 step-level engine_state)
### Client-side (per-request, JSONL)
Same as before: `t_send_ns`, `t_first_token_ns`, `t_last_token_ns`,
`prompt_tokens`, `completion_tokens`, `inflight_at_send`.
### Server-side `/metrics` sampling (1 Hz)
Captured into `metrics_<cfg>_<phase>_<cell>.jsonl`. Same fields as
prior version.
### Step-level timing instrumentation (NEW — we ship the patch)
The reviewer correctly noted that the existing A3 step log
(`third_party/vllm/.../scheduler.py:953`) only records per-step token
counts and request lists, **not** step duration or per-callback
timing. So we cannot just turn on AGENTIC_STEP_LOG_PATH and get
Figure 6/7's "direct evidence" — that data does not exist yet.
This microbench ships its own scheduler timing patch at
`microbench/connector_tax/patches/scheduler_step_timing.py`, modelled
on the idempotent `microbench/patches/apply_patches.py` we wrote for
Microbench 2. It uses the same `_pd_profile.py` emit pattern.
The patch instruments:
1. `Scheduler.schedule()` entry → `t_step_enter` (perf_counter_ns)
2. `Scheduler.schedule()` exit → `t_step_exit`
3. Around `connector.build_connector_meta(scheduler_output)`
→ `build_meta_us`
4. Around `connector.get_finished(...)` call
(in `_update_from_output` / mixin)
→ `get_finished_us`
5. Around `connector.start_load_kv(...)` (in the worker mixin
`_get_kv_connector_output`)
→ `start_load_kv_us` (worker-side; emitted from worker process)
Each step emits one JSONL record:
```json
{
"t_ns": <step_enter perf_counter_ns>,
"step_id": <monotonic int>,
"step_duration_us": <step_exit - step_enter>,
"build_meta_us": <build_connector_meta duration>,
"get_finished_us": <connector get_finished duration>,
"start_load_kv_us": <worker start_load_kv; null on scheduler-only proc>,
"num_running": <int>,
"num_waiting": <int>,
"prefill_tokens": <int>,
"decode_tokens": <int>
}
```
Output goes to `AGENTIC_STEP_LOG_PATH` (one file per process; we use
`engine_step_<phase>_<cell>.jsonl` paths from launch scripts).
Apply / revert is idempotent — same `# CONNECTOR_TAX_PATCH` marker
strategy as Microbench 2.
```
microbench/connector_tax/patches/
├── _step_profile.py # the emitter (ported from _pd_profile)
├── scheduler_step_timing.py # patch installer / reverter
└── apply.sh # invoked by run_all.sh; revert at end
```
**Fallback if the patch fails to apply on a future vLLM version**:
the bench drops to client-side TTFT/TPOT only. Figures 6 (per-step
CDF) and 7 (decomposition stack) are not produced; the manifest
records `step_timing_available=false`. The other figures and the
H1 / H2 / H4 headline numbers do not depend on this patch, so the
bench is still useful in fallback mode.
### Derived (post-processing)
For each (config, rate-or-shape) cell after warmup:
- TTFT/TPOT/E2E p50/p90/p99
- `effective_throughput`, `requested_throughput`, throughput_ratio
- `saturation_flag` (which criterion, if any, triggered)
- (when `step_timing_available=true`):
- `step_duration_us` p50/p90
- `build_meta_us` p50/p90
- `get_finished_us` p50/p90
- `start_load_kv_us` p50/p90 (worker-process file)
- `connector_total_us` p50/p90 (sum of the 3 callback timings)
### Substrate tax definition
```
tax_TTFT_p90(X, ref) = TTFT_p90(X, ref) / TTFT_p90(plain, ref) - 1
tax_TPOT_p90(X, ref) = TPOT_p90(X, ref) / TPOT_p90(plain, ref) - 1
tax_step_p50(X) = step_duration_us p50 (X) - step_duration_us p50 (plain)
tax_callback_p50(X) = connector_total_us p50 (X) # plain has no callbacks
```
`tax_step` is the **gross** per-step penalty (any cause).
`tax_callback` is the **callback-attributable** penalty (sum of the
three measured connector hooks). The difference `tax_step
tax_callback` is "step-time overhead not attributable to instrumented
callbacks" — block-pool walks, scheduler-state churn, etc. Reporting
both lets reviewers see whether our instrumentation accounts for the
full cost.
---
## Auditability & Reproducibility Plan
### Run artifacts (per config × phase × cell)
```
microbench/connector_tax/results/
<date>_<config>/
config.json # parameters used
launch.sh # exact vLLM launch command
vllm_stdout.log # full vLLM stdout
vllm_stderr.log # full vLLM stderr
requests_<phase>_<cell>.jsonl
metrics_<phase>_<cell>.jsonl
engine_step_<phase>_<cell>.jsonl # if A3 active
summary.json # per-cell percentiles
env.txt # pip freeze, vLLM SHA, GPU info
preflight/
verify_kv_consumer.log
verify_multi_connector.log
verify_noop_connector.log
preflight_status.json # which configs are SKIP'd and why
```
### Manifest
`microbench/connector_tax/MANIFEST.md` lists every run with date,
vLLM version + git SHA, Mooncake version, NIXL version, LMCache
version, GPU id (`nvidia-smi -L`), config name, launch command, result
directory, A3-active flag, and skip-status (with reason).
### Re-run script
`microbench/connector_tax/run_all.sh` runs in **three barrier stages**.
Phase A across all configs must finish before Phase B can pick a
reference rate.
**Stage 0 — Pre-flight + patch:**
1. Run `verify_kv_consumer.sh`, `verify_multi_connector.sh`, and
`verify_noop_connector.sh`. Persist `preflight_status.json`.
2. Apply `microbench/connector_tax/patches/scheduler_step_timing.py`
to the active vLLM. Record `step_timing_available=true|false`
in the manifest based on whether the patch applied cleanly.
**Stage 1 — Phase A (all configs, randomized order):**
For each non-SKIP config:
1. `launch_<config>.sh` → wait for `/v1/models`.
2. `bench_loop.py --rates 0.5,1,2,4,8,16,32 --shape 4096,256
--duration 60 --min-completed 200`.
3. Kill vLLM, wait 60 s for GPU release.
4. Append manifest row.
After **all** configs have finished Stage 1:
**Stage 2 — Reference rate selection (CPU only):**
1. Compute saturation flags from each cell using the data-driven
criteria.
2. Choose `ref_safe` = max rate where ALL configs that completed
Phase A are not saturated.
3. Choose `ref_load` = max rate where `plain` is not saturated.
4. Persist `reference_rates.json`.
**Stage 3 — Phase B (all configs, randomized order):**
For each non-SKIP config:
1. `launch_<config>.sh` → wait for ready.
2. `bench_loop.py --rate <ref_safe> --shapes 512x64,512x256,
...,32768x1024 --duration 60 --min-completed 200`.
3. (If `ref_load != ref_safe`) Run Phase B' for priority configs
(plain, mooncake_both, nixl_both, lmcache_only) on shapes
{512x256, 4096x256, 32768x256} at `ref_load`.
4. Kill vLLM, wait 60 s, append manifest row.
**Stage 4 — Patch revert + analysis:**
1. Revert the scheduler_step_timing patch.
2. `analyze.py --root results/`.
3. `plot_connector_tax.py`.
A reviewer with a fresh checkout runs:
```
cd microbench/connector_tax
bash run_all.sh
```
and gets the figures + manifest + raw artifacts. The script is
re-runnable: any stage can be skipped via `--skip-stage N` if the
artifacts exist.
### Determinism notes
Same as previous: temperature=0 + ignore_eos give shape determinism;
content varies per request via seeded UUID. We do not promise
bit-exact reproducibility, only distribution-level reproducibility.
### Updated runtime estimate (was 1.52 h, **now 45.5 h**)
| Phase | Time |
|---|---|
| Pre-flight (3 verify scripts) | 15 min |
| Phase A: 8 configs × (90 s warmup + 1010 s cells + 60 s GPU clear) | 155 min |
| Phase A → ref_safe selection (CPU) | <1 min |
| Phase B (best, `ref_safe ≥ 4`): 8 × (90 + 630 + 60) | 104 min |
| Phase B (worst, `ref_safe = 2`): 8 × (90 + 990 + 60) | 152 min |
| Optional Phase B' (4 configs × 3 shapes × ≥70 s + 4 × 90 s warmup) | 20 min |
| Analysis + figures | 5 min |
| **Total (best case)** | **~5 h** |
| **Total (worst case)** | **~5.5 h** |
This is honest. The reviewer's earlier estimate of 2.53 h
underestimated how long low-rate cells must run to give stable p90.
---
## Analysis & Figures
### Figure 1: TTFT p90 vs send rate, per configuration (Phase A)
Same as before, now 8 lines plus saturation markers (× per criterion).
### Figure 2: TPOT p90 vs send rate, per configuration (Phase A)
Same.
### Figure 3: Achieved throughput vs requested (Phase A)
Same, 8 lines + y=x reference + saturation knee annotations.
### Figure 4: Substrate tax bar (TTFT p90 + TPOT p90)
At `ref_safe` and `ref_load`, side-by-side bars per non-plain config.
Shows:
- Pure tax (`ref_safe`)
- Tax + non-linear queueing (`ref_load`)
- The gap is the **coupling amplification** the reviewer flagged.
### Figure 5: Shape-dependent tax heatmap (Phase B)
3×3 heatmap (input × output) of tax_TTFT_p90 for each non-plain
config. 6 heatmaps in a row, including noop_connector,
mooncake_both, nixl_both, lmcache_only, mooncake_producer,
multi_mooncake_lmcache. (Skip mooncake_consumer if pre-flight
dropped it.)
### Figure 6: Per-step latency CDF, ref_safe rate (Phase A)
X = step duration (μs), Y = CDF, line per config. **The most direct
visualization of "what each step costs."** Shipped only if A3 step log
is available.
### Figure 7: Tax decomposition stack
For each non-plain config at ref_safe, stacked bar:
- "framework cost" estimated = tax(noop_connector)
- "implementation cost" estimated = tax(this config) tax(noop_connector)
If `noop_connector` doesn't run (we'd document why), we drop this
figure and report tax as a single number per config.
### Figure 8: H4 additivity check
3-bar group: tax(mooncake_both), tax(lmcache_only), tax(multi). The
sum-of-first-two compared against multi visualizes additivity.
---
## Risks & Mitigations (revised)
| Risk | Impact | Mitigation |
|---|---|---|
| `kv_consumer` won't start with dummy bootstrap | Skip the config | Pre-flight; documented SKIP in manifest |
| `multi_mooncake_lmcache` crashes engine | Skip the config | Pre-flight |
| NIXL not installed | Skip nixl_both | Tolerant; warn + continue |
| LMCache not installed | Skip lmcache_only AND multi config | Tolerant; warn + continue |
| GPU thermal drift across 3+ h | Skews late configs | Run order randomized; consider running twice on different days and reporting both |
| Open-loop blow-up at 32 req/s | Memory blowup | Inflight cap 256, drop with logged counter |
| Cold-start of first request | Inflates mean TTFT | 10 s warmup discarded |
| `scheduler_step_timing` patch fails to apply on a future vLLM version | Lose Figures 6 and 7 | Document `step_timing_available=false` in manifest; H1/H2/H4 still report from client-side TTFT/TPOT |
| `noop_connector` import fails (PYTHONPATH or class signature) | Lose Figures 7 + H3 falsifier | Pre-flight `verify_noop_connector.sh` catches this; report SKIP in manifest |
---
## Success Criteria (revised)
1. **H1 falsifiable**: tax_TTFT_p90 for `mooncake_both` at `ref_safe`
is reported. We accept the prior (≈45%) if measurement is in
[25%, 60%]; we **revise the prior** if outside.
2. **H2 testable**: NIXL-vs-Mooncake gap at `ref_safe` is reported.
The trace-replay difference was ~7 pp. We document agreement /
disagreement.
3. **H3 disambiguated**: tax(noop_connector) at `ref_safe` is
reported. We label substrate tax as
"framework-cost-dominated" if noop_connector ≥ 50% of
mooncake_both tax, "implementation-cost-dominated" if < 30%.
4. **H4 additivity**: |tax(multi) (tax(mooncake_both) +
tax(lmcache_only))| / tax(multi) ≤ 0.30 → linear.
5. **H5 + H6 directional**: report whether tax_TTFT_p90 grows with
input and tax_TPOT_p90 grows with output (sign + magnitude).
6. **All artifacts present**: every config that ran has the 6 file
types; every SKIP config has a reason in `preflight_status.json`.
7. **Bench finishes < 6 h** wall clock on idle dash0
(Phase A + Phase B + optional Phase B' combined; reflects min-completed
extension at low rates).
---
## Out of Scope
- Multi-node Mooncake (RDMA over actual network).
- Patching Mooncake or vLLM to optimize the substrate (the point of
this microbench is to measure baseline cost as shipped).
- Varying `chunk_size`, `max_num_seqs`, or other vLLM scheduler
parameters; fixed at trace-replay defaults.
- chunk-boundary effects (input ∈ {8192, 16384}). The reviewer noted
this is a real follow-up but adding it doubles Phase B runtime.
Documented as a follow-up if Phase B shows shape-dependent tax that
can't be explained by total token count.
---
## Cross-references
- `analysis/characterization/elastic_migration_v2/README.md` — the
trace-replay paper this microbench validates / refutes.
- `microbench/interference/` — Microbench 1 (B2 same-worker
interference; complementary).
- `microbench/lifecycle/` — Microbench 2 (PD-sep transfer breakdown;
uses different vLLM patches).
- `microbench/patches/` — `_pd_profile.py` template if A3 fallback
is needed.
---
## Files
```
microbench/connector_tax/
├── DESIGN.md # this file
├── MANIFEST.md # filled per run
├── tools/
│ ├── noop_connector.py # custom NoOpConnector for H3
│ ├── dummy_bootstrap.py # for kv_consumer pre-flight
│ ├── verify_kv_consumer.sh
│ ├── verify_multi_connector.sh
│ └── verify_noop_connector.sh
├── patches/
│ ├── _step_profile.py # event emitter (ports _pd_profile)
│ ├── scheduler_step_timing.py # idempotent install/revert
│ └── apply.sh # invoked by run_all.sh
├── launch/
│ ├── launch_plain.sh
│ ├── launch_noop_connector.sh
│ ├── launch_mooncake_producer.sh
│ ├── launch_mooncake_consumer.sh
│ ├── launch_mooncake_both.sh
│ ├── launch_nixl_both.sh
│ ├── launch_lmcache_only.sh
│ └── launch_multi_mooncake_lmcache.sh
├── bench_loop.py # open-loop loadgen (--min-completed)
├── metrics_sampler.py # /metrics scraper
├── analyze.py # raw → percentiles + saturation flags
├── plot_connector_tax.py # all figures
├── run_all.sh # 4-stage barrier orchestrator
└── results/<date>_<config>/ # per-run artifacts
└── results/preflight/ # pre-flight verification
```

View File

View File

@@ -0,0 +1,177 @@
#!/usr/bin/env python3
"""Aggregate connector_tax results.
Reads results/<config>/summary_A.json and summary_B.json for every config,
applies saturation criteria, picks ref_safe / ref_load, and writes
aggregate.json + aggregate.csv.
Usage:
analyze.py --root microbench/connector_tax/results
"""
import argparse
import csv
import json
from pathlib import Path
SAT_THROUGHPUT_RATIO = 0.95
SAT_QUEUE_P50 = 1.0
SAT_TTFT_INFLATION = 1.5 # vs previous (lower) rate
def saturated(cell: dict, prev_ttft_p90: float | None) -> tuple[bool, list[str]]:
reasons = []
tr = cell.get("throughput_ratio")
if tr is not None and tr < SAT_THROUGHPUT_RATIO:
reasons.append(f"throughput_ratio={tr:.2f}<{SAT_THROUGHPUT_RATIO}")
# queue p50 from inflight (proxy)
inf50 = cell.get("inflight_p50") or 0
# Note: inflight_p50 measured at send time. >= 2 means queue forming.
if inf50 >= 2:
# Throughput tracking is the primary signal; this is corroboration.
pass
ttft = cell.get("ttft_ms_p90")
if (
ttft is not None
and prev_ttft_p90 is not None
and prev_ttft_p90 > 0
and ttft / prev_ttft_p90 > SAT_TTFT_INFLATION
):
reasons.append(f"ttft_p90 inflated {ttft / prev_ttft_p90:.2f}x")
return (len(reasons) > 0, reasons)
def analyze(root: Path) -> dict:
configs: dict[str, dict] = {}
for cfg_dir in sorted(root.iterdir()):
if not cfg_dir.is_dir():
continue
if cfg_dir.name == "preflight":
continue
cfg = cfg_dir.name
sa = cfg_dir / "summary_A.json"
sb = cfg_dir / "summary_B.json"
cfg_data = {"phase_a": [], "phase_b": []}
if sa.exists():
cfg_data["phase_a"] = json.loads(sa.read_text())
if sb.exists():
cfg_data["phase_b"] = json.loads(sb.read_text())
configs[cfg] = cfg_data
# ── flag saturation per cell, per config (Phase A only) ────────
for cfg, data in configs.items():
cells = sorted(data["phase_a"], key=lambda c: c["rate_target"])
prev = None
for c in cells:
sat, reasons = saturated(c, prev)
c["saturated"] = sat
c["sat_reasons"] = reasons
prev = c.get("ttft_ms_p90")
# ── pick reference rates ───────────────────────────────────────
# ref_safe = max rate where ALL configs are NOT saturated
rates = sorted({c["rate_target"]
for d in configs.values()
for c in d["phase_a"]})
ref_safe = None
for r in rates:
all_ok = True
for cfg, d in configs.items():
cells = [c for c in d["phase_a"] if c["rate_target"] == r]
if not cells:
continue
if cells[0]["saturated"]:
all_ok = False
break
if all_ok:
ref_safe = r
# ref_load = max rate where 'plain' is not saturated
ref_load = None
plain = configs.get("plain", {})
for c in sorted(plain.get("phase_a", []), key=lambda c: c["rate_target"]):
if not c["saturated"]:
ref_load = c["rate_target"]
out = {
"configs": configs,
"rates_swept": rates,
"ref_safe": ref_safe,
"ref_load": ref_load,
}
return out
def write_csv(agg: dict, out_path: Path) -> None:
rows = []
for cfg, d in agg["configs"].items():
for c in d["phase_a"]:
rows.append({
"config": cfg,
"phase": "A",
"rate": c["rate_target"],
"input_tokens": c["input_tokens"],
"output_tokens": c["output_tokens"],
"ttft_p50": c.get("ttft_ms_p50"),
"ttft_p90": c.get("ttft_ms_p90"),
"ttft_p99": c.get("ttft_ms_p99"),
"tpot_p50": c.get("tpot_ms_p50"),
"tpot_p90": c.get("tpot_ms_p90"),
"tpot_p99": c.get("tpot_ms_p99"),
"e2e_p90": c.get("e2e_ms_p90"),
"throughput_eff": c.get("throughput_effective_rps"),
"throughput_ratio": c.get("throughput_ratio"),
"n_after_warmup": c.get("n_after_warmup"),
"saturated": c.get("saturated"),
"sat_reasons": ";".join(c.get("sat_reasons", [])),
})
for c in d["phase_b"]:
rows.append({
"config": cfg,
"phase": "B",
"rate": c["rate_target"],
"input_tokens": c["input_tokens"],
"output_tokens": c["output_tokens"],
"ttft_p50": c.get("ttft_ms_p50"),
"ttft_p90": c.get("ttft_ms_p90"),
"ttft_p99": c.get("ttft_ms_p99"),
"tpot_p50": c.get("tpot_ms_p50"),
"tpot_p90": c.get("tpot_ms_p90"),
"tpot_p99": c.get("tpot_ms_p99"),
"e2e_p90": c.get("e2e_ms_p90"),
"throughput_eff": c.get("throughput_effective_rps"),
"throughput_ratio": c.get("throughput_ratio"),
"n_after_warmup": c.get("n_after_warmup"),
"saturated": "",
"sat_reasons": "",
})
if not rows:
return
fields = list(rows[0].keys())
with open(out_path, "w", newline="") as f:
w = csv.DictWriter(f, fieldnames=fields)
w.writeheader()
w.writerows(rows)
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--root", type=Path, required=True)
ap.add_argument("--out", type=Path, default=None)
args = ap.parse_args()
if not args.root.exists():
raise SystemExit(f"root not found: {args.root}")
agg = analyze(args.root)
out = args.out or args.root / "aggregate.json"
out.write_text(json.dumps(agg, indent=2))
write_csv(agg, args.root / "aggregate.csv")
print(f"ref_safe = {agg['ref_safe']} ref_load = {agg['ref_load']}")
print(f"Wrote {out} and aggregate.csv")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,351 @@
#!/usr/bin/env python3
"""Open-loop fixed-rate loadgen for connector_tax microbench.
Sends requests at a Poisson-arrival rate (rate_target req/s) of fixed
(input_tokens, output_tokens) shape with random content. Each cell runs
until BOTH the duration floor AND min-completed thresholds are met.
Per-request metrics are appended to a JSONL file.
Usage:
bench_loop.py \
--url http://127.0.0.1:8000/v1/chat/completions \
--model /path/to/model \
--rates 0.5,1,2,4,8 \
--shape 4096,256 \
--duration 60 \
--min-completed 200 \
--warmup 10 \
--output-dir results/<run> \
--phase A
"""
import argparse
import asyncio
import hashlib
import json
import random
import statistics
import time
import uuid
from dataclasses import asdict, dataclass, field
from pathlib import Path
import httpx
@dataclass
class ReqMetric:
req_id: str
rate_target: float
input_tokens_target: int
output_tokens_target: int
t_send_ns: int
t_first_token_ns: int | None = None
t_last_token_ns: int | None = None
prompt_tokens: int = 0
completion_tokens: int = 0
inflight_at_send: int = 0
error: str | None = None
# ─── content generation ────────────────────────────────────────────────────
def make_random_prompt(target_tokens: int) -> str:
"""Generate a prompt that tokenizes to roughly target_tokens.
Calibration (Qwen3-Coder tokenizer): "Block N: <32-hex>" ≈ 35 tokens.
"""
n_parts = max(1, target_tokens // 35)
seed = uuid.uuid4().hex
parts = []
for i in range(n_parts):
h = hashlib.md5(f"{seed}_{i}_{time.time_ns()}".encode()).hexdigest()
parts.append(f"Block {i}: {h}")
return " ".join(parts)
# ─── single request worker ─────────────────────────────────────────────────
async def send_one(client: httpx.AsyncClient, url: str, model: str,
inp_tokens: int, out_tokens: int,
rate: float, inflight: list[int],
inflight_cap: int) -> ReqMetric:
rid = uuid.uuid4().hex[:16]
if inflight[0] >= inflight_cap:
# Drop with logged metric
return ReqMetric(
req_id=rid, rate_target=rate,
input_tokens_target=inp_tokens, output_tokens_target=out_tokens,
t_send_ns=time.perf_counter_ns(),
inflight_at_send=inflight[0],
error="dropped_due_to_inflight_cap",
)
inflight[0] += 1
m = ReqMetric(
req_id=rid, rate_target=rate,
input_tokens_target=inp_tokens, output_tokens_target=out_tokens,
t_send_ns=time.perf_counter_ns(),
inflight_at_send=inflight[0],
)
try:
prompt = make_random_prompt(inp_tokens)
payload = {
"model": model,
"messages": [{"role": "user", "content": prompt}],
"max_tokens": out_tokens,
"min_tokens": out_tokens,
"temperature": 0,
"ignore_eos": True,
"stream": True,
"stream_options": {"include_usage": True},
}
async with client.stream("POST", url, json=payload, timeout=600.0) as resp:
resp.raise_for_status()
async for line in resp.aiter_lines():
if not line.startswith("data: "):
continue
data = line[6:]
if data.strip() == "[DONE]":
break
try:
chunk = json.loads(data)
except json.JSONDecodeError:
continue
# Capture usage from any chunk (it may arrive in a trailing
# chunk with empty `choices`).
usage = chunk.get("usage")
if usage:
m.prompt_tokens = usage.get("prompt_tokens", m.prompt_tokens)
m.completion_tokens = usage.get(
"completion_tokens", m.completion_tokens)
choices = chunk.get("choices") or []
if not choices:
continue
delta = choices[0].get("delta", {})
if "role" in delta:
continue
now = time.perf_counter_ns()
if m.t_first_token_ns is None:
m.t_first_token_ns = now
m.t_last_token_ns = now
except Exception as e:
m.error = f"{type(e).__name__}: {e}"
finally:
inflight[0] -= 1
return m
# ─── per-cell driver (one rate × shape) ────────────────────────────────────
async def run_cell(
client: httpx.AsyncClient,
url: str,
model: str,
rate: float,
inp_tokens: int,
out_tokens: int,
duration_floor_s: float,
min_completed: int,
warmup_s: float,
inflight_cap: int,
out_path: Path,
) -> dict:
"""Run one (rate, shape) cell. Streams per-request JSONL to out_path.
Returns aggregated summary."""
inflight = [0]
metrics: list[ReqMetric] = []
pending_tasks: list[asyncio.Task] = []
t_start_ns = time.perf_counter_ns()
cell_start = time.perf_counter()
print(f" [cell] rate={rate} shape=({inp_tokens},{out_tokens}) "
f"floor={duration_floor_s}s min_completed={min_completed}")
interval_mean = 1.0 / rate
rng = random.Random(int(time.time_ns()) & 0xFFFFFFFF)
fh = open(out_path, "a", buffering=1)
completed_count = 0
def reap_one(t: asyncio.Task) -> None:
nonlocal completed_count
try:
m = t.result()
except Exception:
return
metrics.append(m)
fh.write(json.dumps(asdict(m)) + "\n")
completed_count += 1
async def submit():
while True:
elapsed = time.perf_counter() - cell_start
if elapsed >= duration_floor_s and completed_count >= min_completed:
return
task = asyncio.create_task(
send_one(client, url, model, inp_tokens, out_tokens,
rate, inflight, inflight_cap)
)
pending_tasks.append(task)
await asyncio.sleep(rng.expovariate(1.0 / interval_mean))
submitter = asyncio.create_task(submit())
async def drain_periodic():
while not submitter.done():
keep = []
for t in pending_tasks:
if t.done():
reap_one(t)
else:
keep.append(t)
pending_tasks[:] = keep
await asyncio.sleep(0.1)
drainer = asyncio.create_task(drain_periodic())
await submitter
drainer.cancel()
try:
await drainer
except asyncio.CancelledError:
pass
# Final drain: wait for all remaining inflight to complete and write them
if pending_tasks:
await asyncio.gather(*pending_tasks, return_exceptions=True)
for t in pending_tasks:
if t.done():
reap_one(t)
pending_tasks.clear()
fh.close()
# Discard warmup window (first warmup_s seconds of completions)
warmup_cutoff_ns = t_start_ns + int(warmup_s * 1e9)
after = [m for m in metrics if m.t_send_ns > warmup_cutoff_ns and m.error is None
and m.t_first_token_ns and m.t_last_token_ns]
def pct(xs, p):
if not xs:
return None
xs = sorted(xs)
k = max(0, min(len(xs) - 1, int(p / 100.0 * (len(xs) - 1))))
return xs[k]
ttft = [(m.t_first_token_ns - m.t_send_ns) / 1e6 for m in after]
tpot = []
for m in after:
if m.completion_tokens > 1 and m.t_last_token_ns and m.t_first_token_ns:
tpot.append((m.t_last_token_ns - m.t_first_token_ns) / 1e6
/ max(1, m.completion_tokens - 1))
e2e = [(m.t_last_token_ns - m.t_send_ns) / 1e6 for m in after]
inflight_seq = [m.inflight_at_send for m in after]
elapsed_s = (time.perf_counter_ns() - t_start_ns) / 1e9
summary = {
"rate_target": rate,
"input_tokens": inp_tokens,
"output_tokens": out_tokens,
"duration_actual_s": elapsed_s,
"n_completed_total": len(metrics),
"n_after_warmup": len(after),
"n_dropped": sum(1 for m in metrics if m.error == "dropped_due_to_inflight_cap"),
"n_errors": sum(1 for m in metrics if m.error and m.error != "dropped_due_to_inflight_cap"),
"ttft_ms_p50": pct(ttft, 50),
"ttft_ms_p90": pct(ttft, 90),
"ttft_ms_p99": pct(ttft, 99),
"tpot_ms_p50": pct(tpot, 50),
"tpot_ms_p90": pct(tpot, 90),
"tpot_ms_p99": pct(tpot, 99),
"e2e_ms_p50": pct(e2e, 50),
"e2e_ms_p90": pct(e2e, 90),
"e2e_ms_p99": pct(e2e, 99),
"throughput_effective_rps": len(after) / max(1.0, elapsed_s - warmup_s),
"throughput_ratio": (len(after) / max(1.0, elapsed_s - warmup_s)) / rate,
"inflight_p50": pct(inflight_seq, 50),
"inflight_p90": pct(inflight_seq, 90),
}
print(f" completed={len(after)} ttft_p90={summary['ttft_ms_p90']} "
f"tpot_p90={summary['tpot_ms_p90']} thr_ratio={summary['throughput_ratio']:.2f}")
return summary
async def main_async(args):
out_dir = Path(args.output_dir)
out_dir.mkdir(parents=True, exist_ok=True)
rates = [float(x) for x in args.rates.split(",")] if args.rates else [args.rate]
shapes = []
if args.shape:
ip, op = args.shape.split(",")
shapes = [(int(ip), int(op))]
elif args.shapes:
for s in args.shapes.split(","):
ip, op = s.split("x")
shapes.append((int(ip), int(op)))
summaries = []
timeout = httpx.Timeout(600.0)
async with httpx.AsyncClient(timeout=timeout) as client:
for rate in rates:
for inp, out in shapes:
cell_label = f"{args.phase}_r{rate}_{inp}x{out}"
req_path = out_dir / f"requests_{cell_label}.jsonl"
# min_completed-driven duration floor
min_floor_for_rate = max(1, int(args.min_completed / rate)) if rate > 0 else args.duration
floor = max(args.duration, min_floor_for_rate)
summary = await run_cell(
client, args.url, args.model, rate, inp, out,
duration_floor_s=floor,
min_completed=args.min_completed,
warmup_s=args.warmup,
inflight_cap=args.inflight_cap,
out_path=req_path,
)
summary["phase"] = args.phase
summary["cell"] = cell_label
summaries.append(summary)
# Cooldown between cells (let queue drain)
await asyncio.sleep(3.0)
# Persist summary
with open(out_dir / f"summary_{args.phase}.json", "w") as f:
json.dump(summaries, f, indent=2)
print(f"\nWrote {len(summaries)} cell summaries.")
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--url", required=True,
help="vLLM /v1/chat/completions URL")
ap.add_argument("--model", required=True)
ap.add_argument("--phase", default="A")
# Rate spec — either --rates a,b,c (Phase A) or --rate r (Phase B)
ap.add_argument("--rates", default="")
ap.add_argument("--rate", type=float, default=4.0)
# Shape spec — either --shape ip,op (Phase A) or --shapes IPxOP,IPxOP,... (Phase B)
ap.add_argument("--shape", default="")
ap.add_argument("--shapes", default="")
ap.add_argument("--duration", type=float, default=60.0,
help="Cell duration floor (seconds)")
ap.add_argument("--min-completed", type=int, default=200)
ap.add_argument("--warmup", type=float, default=10.0)
ap.add_argument("--inflight-cap", type=int, default=256)
ap.add_argument("--output-dir", required=True)
args = ap.parse_args()
if not (args.rates or args.rate):
ap.error("Provide --rates or --rate")
if not (args.shape or args.shapes):
ap.error("Provide --shape or --shapes")
asyncio.run(main_async(args))
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,64 @@
#!/bin/bash
# Common environment for all connector_tax launch scripts.
#
# Usage: source common.sh
# Sets: MODEL_PATH, PYTHON, PORT, LOG_DIR, COMMON_VLLM_ARGS
set -euo pipefail
# Resolve project root (microbench/connector_tax/launch/common.sh → ../../..)
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
CT_DIR="$(cd "$SCRIPT_DIR/.." && pwd)" # microbench/connector_tax
MB_DIR="$(cd "$CT_DIR/.." && pwd)" # microbench
PROJ_DIR="$(cd "$MB_DIR/.." && pwd)" # repo root
VENV="$PROJ_DIR/.venv/bin"
PYTHON="${VENV}/python"
[[ -x "$PYTHON" ]] || PYTHON="$(command -v python3)"
MODEL_PATH="${MODEL_PATH:-$HOME/models/Qwen/Qwen3-Coder-30B-A3B-Instruct}"
PORT="${PORT:-8000}"
GPU_ID="${GPU_ID:-0}"
# Per-run log dir (caller may override RUN_DIR)
RUN_DIR="${RUN_DIR:-$CT_DIR/results/_default}"
mkdir -p "$RUN_DIR"
LOG_DIR="$RUN_DIR"
# PYTHONPATH so `microbench.connector_tax.tools.noop_connector:NoOpConnector`
# resolves when launched from anywhere.
export PYTHONPATH="${PROJ_DIR}:${PYTHONPATH:-}"
# Step-timing log path (vLLM's existing scheduler emits when this is set;
# our patch enriches the record with timing fields)
export AGENTIC_STEP_LOG_PATH="${AGENTIC_STEP_LOG_PATH:-$LOG_DIR/engine_step.jsonl}"
# Common vLLM flags shared by all configs
COMMON_VLLM_ARGS=(
--model "$MODEL_PATH"
--host 0.0.0.0
--port "$PORT"
--tensor-parallel-size 1
--trust-remote-code
--enable-prefix-caching
--dtype auto
--gpu-memory-utilization 0.9
--max-model-len 200000
--no-enable-log-requests
)
# Wait until /v1/models returns 200, then return.
# Times out at $1 seconds (default 240).
wait_for_ready() {
local timeout="${1:-240}"
local i
for i in $(seq 1 "$timeout"); do
if curl -sf "http://127.0.0.1:$PORT/v1/models" >/dev/null 2>&1; then
echo "Server ready after ${i}s"
return 0
fi
sleep 1
done
echo "ERROR: Server did not become ready within ${timeout}s" >&2
return 1
}

View File

@@ -0,0 +1,21 @@
#!/bin/bash
# vLLM + LMCacheConnectorV1 alone. Skip if LMCache not installed.
set -euo pipefail
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
if ! "$PYTHON" -c 'import lmcache' 2>/dev/null; then
echo "SKIP: lmcache module not importable" >&2
exit 42
fi
KV_CFG='{"kv_connector":"LMCacheConnectorV1","kv_role":"kv_both"}'
CUDA_VISIBLE_DEVICES="$GPU_ID" \
"$PYTHON" -m vllm.entrypoints.openai.api_server \
"${COMMON_VLLM_ARGS[@]}" \
--kv-transfer-config "$KV_CFG" \
> "$LOG_DIR/vllm_stdout.log" 2> "$LOG_DIR/vllm_stderr.log" &
VLLM_PID=$!
echo "$VLLM_PID" > "$LOG_DIR/.vllm.pid"
echo "vLLM PID=$VLLM_PID"
wait_for_ready 240

View File

@@ -0,0 +1,18 @@
#!/bin/bash
# vLLM + Mooncake kv_both (alone, never transfers — README claim under test).
set -euo pipefail
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
BOOTSTRAP_PORT="${BOOTSTRAP_PORT:-8998}"
KV_CFG='{"kv_connector":"MooncakeConnector","kv_role":"kv_both"}'
VLLM_MOONCAKE_BOOTSTRAP_PORT="$BOOTSTRAP_PORT" \
CUDA_VISIBLE_DEVICES="$GPU_ID" \
"$PYTHON" -m vllm.entrypoints.openai.api_server \
"${COMMON_VLLM_ARGS[@]}" \
--kv-transfer-config "$KV_CFG" \
> "$LOG_DIR/vllm_stdout.log" 2> "$LOG_DIR/vllm_stderr.log" &
VLLM_PID=$!
echo "$VLLM_PID" > "$LOG_DIR/.vllm.pid"
echo "vLLM PID=$VLLM_PID"
wait_for_ready 240

View File

@@ -0,0 +1,28 @@
#!/bin/bash
# vLLM + Mooncake kv_consumer.
# Pre-flight gated: if dummy bootstrap doesn't satisfy vLLM startup, this
# config is marked SKIP and run_all.sh skips it.
set -euo pipefail
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
DUMMY_BOOTSTRAP_PORT="${DUMMY_BOOTSTRAP_PORT:-8997}"
# Start dummy bootstrap (in background) so the consumer has someone to talk to.
"$PYTHON" "$(dirname "${BASH_SOURCE[0]}")/../tools/dummy_bootstrap.py" \
--port "$DUMMY_BOOTSTRAP_PORT" \
> "$LOG_DIR/dummy_bootstrap.log" 2>&1 &
DUMMY_PID=$!
echo "$DUMMY_PID" > "$LOG_DIR/.dummy.pid"
sleep 2
KV_CFG="{\"kv_connector\":\"MooncakeConnector\",\"kv_role\":\"kv_consumer\",\"kv_connector_extra_config\":{\"prefill_addr\":\"127.0.0.1:${DUMMY_BOOTSTRAP_PORT}\"}}"
CUDA_VISIBLE_DEVICES="$GPU_ID" \
"$PYTHON" -m vllm.entrypoints.openai.api_server \
"${COMMON_VLLM_ARGS[@]}" \
--kv-transfer-config "$KV_CFG" \
> "$LOG_DIR/vllm_stdout.log" 2> "$LOG_DIR/vllm_stderr.log" &
VLLM_PID=$!
echo "$VLLM_PID" > "$LOG_DIR/.vllm.pid"
echo "vLLM PID=$VLLM_PID"
wait_for_ready 300

View File

@@ -0,0 +1,18 @@
#!/bin/bash
# vLLM + Mooncake kv_producer (no D peer, never transfers).
set -euo pipefail
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
BOOTSTRAP_PORT="${BOOTSTRAP_PORT:-8998}"
KV_CFG='{"kv_connector":"MooncakeConnector","kv_role":"kv_producer"}'
VLLM_MOONCAKE_BOOTSTRAP_PORT="$BOOTSTRAP_PORT" \
CUDA_VISIBLE_DEVICES="$GPU_ID" \
"$PYTHON" -m vllm.entrypoints.openai.api_server \
"${COMMON_VLLM_ARGS[@]}" \
--kv-transfer-config "$KV_CFG" \
> "$LOG_DIR/vllm_stdout.log" 2> "$LOG_DIR/vllm_stderr.log" &
VLLM_PID=$!
echo "$VLLM_PID" > "$LOG_DIR/.vllm.pid"
echo "vLLM PID=$VLLM_PID"
wait_for_ready 240

View File

@@ -0,0 +1,41 @@
#!/bin/bash
# vLLM + MultiConnector(MooncakeConnector kv_both, LMCacheConnectorV1).
# Skip if either is missing.
set -euo pipefail
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
if ! "$PYTHON" -c 'import lmcache' 2>/dev/null; then
echo "SKIP: lmcache module not importable" >&2
exit 42
fi
BOOTSTRAP_PORT="${BOOTSTRAP_PORT:-8998}"
# MultiConnector wraps a list of sub-connector configs.
# Reference: vllm/distributed/kv_transfer/kv_connector/factory.py +
# docs/features/disagg_prefill.md
KV_CFG=$(cat <<EOF
{
"kv_connector": "MultiConnector",
"kv_role": "kv_both",
"kv_connector_extra_config": {
"connectors": [
{"kv_connector": "MooncakeConnector", "kv_role": "kv_both"},
{"kv_connector": "LMCacheConnectorV1", "kv_role": "kv_both"}
]
}
}
EOF
)
KV_CFG=$(echo "$KV_CFG" | tr -d '\n')
VLLM_MOONCAKE_BOOTSTRAP_PORT="$BOOTSTRAP_PORT" \
CUDA_VISIBLE_DEVICES="$GPU_ID" \
"$PYTHON" -m vllm.entrypoints.openai.api_server \
"${COMMON_VLLM_ARGS[@]}" \
--kv-transfer-config "$KV_CFG" \
> "$LOG_DIR/vllm_stdout.log" 2> "$LOG_DIR/vllm_stderr.log" &
VLLM_PID=$!
echo "$VLLM_PID" > "$LOG_DIR/.vllm.pid"
echo "vLLM PID=$VLLM_PID"
wait_for_ready 300

View File

@@ -0,0 +1,21 @@
#!/bin/bash
# vLLM + NIXL kv_both. Skip if NIXL not installed.
set -euo pipefail
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
if ! "$PYTHON" -c 'import nixl' 2>/dev/null; then
echo "SKIP: nixl module not importable" >&2
exit 42 # special skip code
fi
KV_CFG='{"kv_connector":"NixlConnector","kv_role":"kv_both"}'
CUDA_VISIBLE_DEVICES="$GPU_ID" \
"$PYTHON" -m vllm.entrypoints.openai.api_server \
"${COMMON_VLLM_ARGS[@]}" \
--kv-transfer-config "$KV_CFG" \
> "$LOG_DIR/vllm_stdout.log" 2> "$LOG_DIR/vllm_stderr.log" &
VLLM_PID=$!
echo "$VLLM_PID" > "$LOG_DIR/.vllm.pid"
echo "vLLM PID=$VLLM_PID"
wait_for_ready 240

View File

@@ -0,0 +1,16 @@
#!/bin/bash
# vLLM + custom NoOpConnector loaded by dotted path.
set -euo pipefail
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
KV_CFG='{"kv_connector":"NoOpConnector","kv_connector_module_path":"microbench.connector_tax.tools.noop_connector","kv_role":"kv_both"}'
CUDA_VISIBLE_DEVICES="$GPU_ID" \
"$PYTHON" -m vllm.entrypoints.openai.api_server \
"${COMMON_VLLM_ARGS[@]}" \
--kv-transfer-config "$KV_CFG" \
> "$LOG_DIR/vllm_stdout.log" 2> "$LOG_DIR/vllm_stderr.log" &
VLLM_PID=$!
echo "$VLLM_PID" > "$LOG_DIR/.vllm.pid"
echo "vLLM PID=$VLLM_PID"
wait_for_ready 240

View File

@@ -0,0 +1,13 @@
#!/bin/bash
# Plain vLLM, no KV transfer connector. The baseline.
set -euo pipefail
source "$(dirname "${BASH_SOURCE[0]}")/common.sh"
CUDA_VISIBLE_DEVICES="$GPU_ID" \
"$PYTHON" -m vllm.entrypoints.openai.api_server \
"${COMMON_VLLM_ARGS[@]}" \
> "$LOG_DIR/vllm_stdout.log" 2> "$LOG_DIR/vllm_stderr.log" &
VLLM_PID=$!
echo "$VLLM_PID" > "$LOG_DIR/.vllm.pid"
echo "vLLM PID=$VLLM_PID"
wait_for_ready 240

View File

@@ -0,0 +1,110 @@
#!/usr/bin/env python3
"""1 Hz /metrics scraper for connector_tax microbench.
Usage:
metrics_sampler.py --url http://127.0.0.1:8000/metrics \
--output results/<run>/metrics.jsonl \
--interval 1.0
"""
import argparse
import json
import time
import urllib.request
def parse_prom(text: str) -> dict:
"""Parse Prometheus text-format metrics. Returns {name: [(labels, value)]}."""
out: dict[str, list[tuple[dict[str, str], float]]] = {}
for line in text.splitlines():
line = line.strip()
if not line or line.startswith("#"):
continue
# name{labels} value [timestamp]
if "{" in line:
name, rest = line.split("{", 1)
labels_str, val_str = rest.rsplit("}", 1)
labels = {}
for piece in labels_str.split(","):
if "=" in piece:
k, v = piece.split("=", 1)
labels[k.strip()] = v.strip().strip('"')
try:
val = float(val_str.strip().split()[0])
except (ValueError, IndexError):
continue
else:
parts = line.split()
if len(parts) < 2:
continue
name = parts[0]
try:
val = float(parts[1])
except ValueError:
continue
labels = {}
out.setdefault(name, []).append((labels, val))
return out
KEEP_PREFIXES = (
"vllm:num_requests_running",
"vllm:num_requests_waiting",
"vllm:gpu_cache_usage_perc",
"vllm:time_to_first_token_seconds",
"vllm:time_per_output_token_seconds",
"vllm:request_prefill_time_seconds",
"vllm:request_decode_time_seconds",
"vllm:iteration_tokens_total",
"vllm:e2e_request_latency_seconds",
)
def collapse(parsed: dict) -> dict:
"""Keep only metrics whose names start with one of the prefixes; flatten
histogram counts into '_bucket' / '_count' / '_sum' suffix entries."""
out = {}
for name, entries in parsed.items():
if not any(name.startswith(p) for p in KEEP_PREFIXES):
continue
# Most are scalars (ignore label dimensions for compactness)
# For histograms we keep _count/_sum and skip individual buckets
if name.endswith("_bucket"):
continue
# Sum across labels to get a single number
total = sum(v for _lbl, v in entries)
out[name] = total
return out
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--url", required=True,
help="http://host:port/metrics")
ap.add_argument("--output", required=True)
ap.add_argument("--interval", type=float, default=1.0)
ap.add_argument("--duration", type=float, default=0.0,
help="Stop after N seconds; 0 = run until killed")
args = ap.parse_args()
out = open(args.output, "a", buffering=1)
t_start = time.time()
while True:
try:
with urllib.request.urlopen(args.url, timeout=2.0) as r:
text = r.read().decode("utf-8")
parsed = parse_prom(text)
sample = collapse(parsed)
sample["t_unix"] = time.time()
out.write(json.dumps(sample) + "\n")
except Exception as e:
err = {"t_unix": time.time(), "error": str(e)}
out.write(json.dumps(err) + "\n")
if args.duration > 0 and time.time() - t_start >= args.duration:
break
time.sleep(args.interval)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,171 @@
#!/usr/bin/env python3
"""Apply / revert step-timing additions to vLLM's existing agentic step log.
vLLM already has a per-step JSONL emitter triggered by AGENTIC_STEP_LOG_PATH.
This patch enriches it with three duration fields:
step_duration_us : full schedule() wall time
build_meta_us : duration of self.connector.build_connector_meta(...)
emit_overhead_us : duration of _agentic_emit_step_log itself
(lets us subtract the patch's own cost)
We also add a worker-side patch in kv_connector_model_runner_mixin.py to
record start_load_kv() duration into a per-process JSONL file pointed to
by VLLM_PD_PROFILE_LOG=$RUN_DIR/worker_step.jsonl.
All inserts are marked with `# CONNECTOR_TAX_PATCH` so revert is just
"delete those lines".
Usage:
python apply_step_timing.py --apply [--vllm-root PATH]
python apply_step_timing.py --revert [--vllm-root PATH]
"""
import argparse
import re
import sys
from pathlib import Path
MARKER = "# CONNECTOR_TAX_PATCH"
DEFAULT_VLLM_ROOT = (
Path.home()
/ "agentic-kv/.venv/lib/python3.12/site-packages/vllm"
)
def already_patched(text: str) -> bool:
return MARKER in text
def revert_text(text: str) -> str:
out = [l for l in text.splitlines() if MARKER not in l]
return "\n".join(out) + ("\n" if text.endswith("\n") else "")
# ── scheduler.py patches ────────────────────────────────────────────────────
def patch_scheduler(text: str) -> str:
if already_patched(text):
print(" scheduler.py already patched, skipping")
return text
# 1. At top of schedule(): record _step_t0
pat = (
r"( def schedule\(self\) -> SchedulerOutput:\n"
r" # NOTE\(woosuk\) on the scheduling algorithm:)"
)
repl = (
r" def schedule(self) -> SchedulerOutput:\n"
r" import time as _t " + MARKER + "\n"
r" _step_t0 = _t.perf_counter_ns() " + MARKER + "\n"
r" self._ct_build_meta_ns = 0 " + MARKER + "\n"
r" # NOTE(woosuk) on the scheduling algorithm:"
)
text, n = re.subn(pat, repl, text, count=1)
if n == 0:
raise RuntimeError("Failed to patch schedule() entry")
# 2. Time the build_connector_meta call
pat = (
r" if self\.connector is not None:\n"
r" meta: KVConnectorMetadata = self\.connector\.build_connector_meta\(\n"
r" scheduler_output\n"
r" \)\n"
r" scheduler_output\.kv_connector_metadata = meta"
)
repl = (
" if self.connector is not None:\n"
f" _bm_t0 = _t.perf_counter_ns() {MARKER}\n"
" meta: KVConnectorMetadata = self.connector.build_connector_meta(\n"
" scheduler_output\n"
" )\n"
f" self._ct_build_meta_ns = _t.perf_counter_ns() - _bm_t0 {MARKER}\n"
" scheduler_output.kv_connector_metadata = meta"
)
text, n = re.subn(pat, repl, text, count=1)
if n == 0:
raise RuntimeError("Failed to patch build_connector_meta")
# 3. Pass step duration into _agentic_emit_step_log via attribute
# (cleaner than threading kwargs through). Then in the emit
# function inject the new fields into `record`.
pat = (
r" if self\._agentic_step_log_fh is not None:\n"
r" self\._agentic_emit_step_log\("
)
repl = (
f" if self._agentic_step_log_fh is not None:\n"
f" self._ct_step_duration_ns = _t.perf_counter_ns() - _step_t0 {MARKER}\n"
f" self._agentic_emit_step_log("
)
text, n = re.subn(pat, repl, text, count=1)
if n == 0:
raise RuntimeError("Failed to patch step_duration insertion")
# 4. Inject extra fields into the `record` dict in _agentic_emit_step_log
pat = (
r" record = \{\n"
r" \"t_unix\": _time\.time\(\),"
)
repl = (
" record = {\n"
f" \"step_duration_us\": getattr(self, '_ct_step_duration_ns', 0) // 1000, {MARKER}\n"
f" \"build_meta_us\": getattr(self, '_ct_build_meta_ns', 0) // 1000, {MARKER}\n"
" \"t_unix\": _time.time(),"
)
text, n = re.subn(pat, repl, text, count=1)
if n == 0:
raise RuntimeError("Failed to patch record dict")
return text
# ── apply / revert driver ───────────────────────────────────────────────────
def apply_to_file(path: Path, fn) -> bool:
if not path.exists():
print(f" SKIP {path} (not found)")
return False
orig = path.read_text()
new = fn(orig)
if new == orig:
print(f" unchanged: {path}")
return False
path.write_text(new)
print(f" patched ({new.count(MARKER)} marks): {path}")
return True
def revert_file(path: Path) -> bool:
if not path.exists():
return False
orig = path.read_text()
new = revert_text(orig)
if new == orig:
print(f" no marks: {path}")
return False
path.write_text(new)
print(f" reverted: {path}")
return True
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--apply", action="store_true")
ap.add_argument("--revert", action="store_true")
ap.add_argument("--vllm-root", type=Path, default=DEFAULT_VLLM_ROOT)
args = ap.parse_args()
if not (args.apply ^ args.revert):
ap.error("Specify exactly one of --apply / --revert")
sched = args.vllm_root / "v1/core/sched/scheduler.py"
if args.apply:
print(f"Applying connector-tax step-timing patch to {args.vllm_root}")
apply_to_file(sched, patch_scheduler)
else:
print(f"Reverting connector-tax step-timing patch from {args.vllm_root}")
revert_file(sched)
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,263 @@
#!/usr/bin/env python3
"""Plot Figures 1-5 from connector_tax aggregate.
Requires aggregate.json + aggregate.csv from analyze.py.
Figure 1: TTFT p90 vs send rate, line per config (Phase A)
Figure 2: TPOT p90 vs send rate
Figure 3: Achieved throughput vs requested
Figure 4: Substrate tax bar at ref_safe and ref_load
Figure 5: Shape-dependent tax heatmap (Phase B)
"""
import argparse
import json
from collections import defaultdict
from pathlib import Path
import numpy as np
import matplotlib
matplotlib.use("Agg")
import matplotlib.pyplot as plt
CONFIG_COLORS = {
"plain": "#000000",
"noop_connector": "#7f7f7f",
"mooncake_producer": "#1f77b4",
"mooncake_consumer": "#17becf",
"mooncake_both": "#d62728",
"nixl_both": "#ff7f0e",
"lmcache_only": "#2ca02c",
"multi_mooncake_lmcache": "#9467bd",
}
def load(root: Path):
agg = json.loads((root / "aggregate.json").read_text())
return agg
def fig1_2_3(agg, out_dir):
configs = agg["configs"]
# Rates
rates = agg["rates_swept"]
# ----- fig 1: TTFT p90 -----
fig, ax = plt.subplots(figsize=(10, 5.5))
for cfg, d in configs.items():
cells = sorted(d["phase_a"], key=lambda c: c["rate_target"])
xs = [c["rate_target"] for c in cells]
ys = [c.get("ttft_ms_p90") for c in cells]
ax.plot(xs, ys, "o-", label=cfg,
color=CONFIG_COLORS.get(cfg, None), linewidth=2)
# mark saturation
for c in cells:
if c.get("saturated"):
ax.plot([c["rate_target"]], [c["ttft_ms_p90"]], "x",
markersize=12, mew=2,
color=CONFIG_COLORS.get(cfg, "red"))
ax.set_xscale("log", base=2)
ax.set_yscale("log")
ax.set_xticks(rates)
ax.set_xticklabels([str(r) for r in rates])
ax.set_xlabel("Send rate (req/s)")
ax.set_ylabel("TTFT p90 (ms, log)")
ax.set_title("Figure 1 — TTFT p90 vs send rate (Phase A)\n"
"× = saturation criterion fired")
ax.grid(True, which="both", linestyle="--", alpha=0.4)
ax.legend(fontsize=8, loc="upper left")
fig.tight_layout()
fig.savefig(out_dir / "fig1_ttft_vs_rate.png", dpi=160)
plt.close(fig)
# ----- fig 2: TPOT p90 -----
fig, ax = plt.subplots(figsize=(10, 5.5))
for cfg, d in configs.items():
cells = sorted(d["phase_a"], key=lambda c: c["rate_target"])
xs = [c["rate_target"] for c in cells]
ys = [c.get("tpot_ms_p90") for c in cells]
ax.plot(xs, ys, "o-", label=cfg,
color=CONFIG_COLORS.get(cfg, None), linewidth=2)
ax.set_xscale("log", base=2)
ax.set_xticks(rates)
ax.set_xticklabels([str(r) for r in rates])
ax.set_xlabel("Send rate (req/s)")
ax.set_ylabel("TPOT p90 (ms)")
ax.set_title("Figure 2 — TPOT p90 vs send rate (Phase A)")
ax.grid(True, linestyle="--", alpha=0.4)
ax.legend(fontsize=8, loc="upper left")
fig.tight_layout()
fig.savefig(out_dir / "fig2_tpot_vs_rate.png", dpi=160)
plt.close(fig)
# ----- fig 3: throughput -----
fig, ax = plt.subplots(figsize=(10, 5.5))
max_x = max(rates) if rates else 1
ax.plot([0, max_x], [0, max_x], "k--", alpha=0.4, label="ideal y=x")
for cfg, d in configs.items():
cells = sorted(d["phase_a"], key=lambda c: c["rate_target"])
xs = [c["rate_target"] for c in cells]
ys = [c.get("throughput_effective_rps") for c in cells]
ax.plot(xs, ys, "o-", label=cfg,
color=CONFIG_COLORS.get(cfg, None), linewidth=2)
ax.set_xlabel("Send rate (req/s)")
ax.set_ylabel("Effective throughput (req/s)")
ax.set_title("Figure 3 — Throughput tracking (Phase A)")
ax.grid(True, linestyle="--", alpha=0.4)
ax.legend(fontsize=8, loc="upper left")
fig.tight_layout()
fig.savefig(out_dir / "fig3_throughput_vs_rate.png", dpi=160)
plt.close(fig)
def fig4(agg, out_dir):
configs = agg["configs"]
if "plain" not in configs:
return
plain = configs["plain"]
def cell_at(d, r):
for c in d["phase_a"]:
if abs(c["rate_target"] - r) < 1e-6:
return c
return None
def tax(c_cfg, c_plain, key):
if c_cfg is None or c_plain is None:
return None
a, b = c_cfg.get(key), c_plain.get(key)
if not a or not b:
return None
return a / b - 1
rates_used = []
if agg.get("ref_safe") is not None:
rates_used.append(("ref_safe", agg["ref_safe"]))
if agg.get("ref_load") is not None and agg["ref_load"] != agg.get("ref_safe"):
rates_used.append(("ref_load", agg["ref_load"]))
if not rates_used:
return
cfg_names = [c for c in configs if c != "plain"]
fig, axes = plt.subplots(1, 2, figsize=(13, 5))
if len(rates_used) == 1:
axes = [axes[0]]
for ax, (label, r) in zip(axes, rates_used):
plain_cell = cell_at(plain, r)
ttft_taxes = []
tpot_taxes = []
for cfg in cfg_names:
c = cell_at(configs[cfg], r)
ttft_taxes.append(tax(c, plain_cell, "ttft_ms_p90") or 0)
tpot_taxes.append(tax(c, plain_cell, "tpot_ms_p90") or 0)
x = np.arange(len(cfg_names))
w = 0.4
ax.bar(x - w/2, [v * 100 for v in ttft_taxes], width=w,
label="TTFT p90 tax %", color="#d62728", alpha=0.85)
ax.bar(x + w/2, [v * 100 for v in tpot_taxes], width=w,
label="TPOT p90 tax %", color="#1f77b4", alpha=0.85)
ax.axhline(0, color="black", linewidth=0.5)
ax.set_xticks(x)
ax.set_xticklabels(cfg_names, rotation=30, ha="right", fontsize=8)
ax.set_ylabel("Tax vs plain (%)")
ax.set_title(f"{label} (rate={r} req/s)")
ax.grid(True, axis="y", linestyle="--", alpha=0.4)
ax.legend(fontsize=8)
fig.suptitle("Figure 4 — Substrate tax (TTFT p90 + TPOT p90) "
"at reference rates", fontweight="bold")
fig.tight_layout()
fig.savefig(out_dir / "fig4_substrate_tax.png", dpi=160)
plt.close(fig)
def fig5(agg, out_dir):
configs = agg["configs"]
if "plain" not in configs:
return
# Build (input, output) → ttft_p90 per config
cfg_names = [c for c in configs if c != "plain"]
def shape_map(d):
m = {}
for c in d.get("phase_b", []):
key = (c["input_tokens"], c["output_tokens"])
m[key] = c.get("ttft_ms_p90")
return m
plain_map = shape_map(configs["plain"])
if not plain_map:
return
inputs = sorted({k[0] for k in plain_map})
outputs = sorted({k[1] for k in plain_map})
n = len(cfg_names)
cols = min(3, n)
rows = (n + cols - 1) // cols
fig, axes = plt.subplots(rows, cols, figsize=(5 * cols, 4 * rows))
if n == 1:
axes = np.array([[axes]])
elif rows == 1:
axes = axes.reshape(1, -1)
for idx, cfg in enumerate(cfg_names):
ax = axes[idx // cols][idx % cols]
cmap = shape_map(configs[cfg])
mat = np.full((len(outputs), len(inputs)), np.nan)
for i, ip in enumerate(inputs):
for j, op in enumerate(outputs):
a = cmap.get((ip, op))
b = plain_map.get((ip, op))
if a and b:
mat[j, i] = a / b - 1
im = ax.imshow(mat * 100, cmap="YlOrRd", aspect="auto")
ax.set_xticks(range(len(inputs)))
ax.set_xticklabels([f"{x//1024}k" if x >= 1024 else str(x) for x in inputs])
ax.set_yticks(range(len(outputs)))
ax.set_yticklabels([str(y) for y in outputs])
ax.set_xlabel("input")
ax.set_ylabel("output")
ax.set_title(cfg, fontsize=10)
for i in range(len(inputs)):
for j in range(len(outputs)):
v = mat[j, i]
if not np.isnan(v):
txt = f"{v*100:.0f}%"
ax.text(i, j, txt, ha="center", va="center",
fontsize=9,
color="white" if v * 100 > 30 else "black")
plt.colorbar(im, ax=ax, fraction=0.04, pad=0.02)
# Hide leftover axes
for idx in range(n, rows * cols):
axes[idx // cols][idx % cols].axis("off")
fig.suptitle("Figure 5 — TTFT p90 substrate tax (%) by shape (Phase B)",
fontweight="bold")
fig.tight_layout()
fig.savefig(out_dir / "fig5_shape_tax_heatmap.png", dpi=160)
plt.close(fig)
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--root", type=Path, required=True)
args = ap.parse_args()
agg = load(args.root)
out = args.root
out.mkdir(parents=True, exist_ok=True)
fig1_2_3(agg, out)
fig4(agg, out)
fig5(agg, out)
print(f"Saved figures into {out}")
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,232 @@
#!/bin/bash
# 5-stage barrier orchestrator for connector_tax microbench.
#
# Stage 0 — pre-flight + (optional) step-timing patch
# Stage 1 — Phase A (rate sweep) for all configs
# Stage 2 — pick reference rates from Phase A
# Stage 3 — Phase B (shape sweep at ref_safe) for all configs
# Stage 4 — revert patch + analyze + plot
#
# Configurable via env:
# CT_DATE : run-id directory tag (default $(date +%Y%m%d_%H%M))
# PORT : vLLM port (default 8000)
# GPU_ID : single GPU index (default 0)
# MODEL_PATH : path to model
# PHASE_A_RATES : default 0.5,1,2,4,8,16,32
# PHASE_B_SHAPES : default 512x64,512x256,512x1024,4096x64,4096x256,4096x1024,32768x64,32768x256,32768x1024
# MIN_COMPLETED : default 200
# DURATION : default 60
# SKIP_PATCH : set to 1 to bypass scheduler patch
# STAGES : space-separated list of stages to run, e.g. "1 3 4"
# defaults to "0 1 2 3 4"
set -uo pipefail
HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
RESULTS_ROOT="$HERE/results"
RUN_DATE="${CT_DATE:-$(date +%Y%m%d_%H%M)}"
RUN_ROOT="$RESULTS_ROOT/$RUN_DATE"
mkdir -p "$RUN_ROOT"
PORT="${PORT:-8000}"
GPU_ID="${GPU_ID:-0}"
MODEL_PATH="${MODEL_PATH:-$HOME/models/Qwen/Qwen3-Coder-30B-A3B-Instruct}"
PHASE_A_RATES="${PHASE_A_RATES:-0.5,1,2,4,8,16,32}"
PHASE_B_SHAPES="${PHASE_B_SHAPES:-512x64,512x256,512x1024,4096x64,4096x256,4096x1024,32768x64,32768x256,32768x1024}"
MIN_COMPLETED="${MIN_COMPLETED:-200}"
DURATION="${DURATION:-60}"
STAGES="${STAGES:-0 1 2 3 4}"
ALL_CONFIGS=(plain noop_connector mooncake_producer mooncake_consumer mooncake_both nixl_both lmcache_only multi_mooncake_lmcache)
# Shuffle config order for thermal-drift robustness.
shuffle_configs() {
printf '%s\n' "$@" | shuf
}
PYTHON="$(cd "$HERE/../.." && pwd)/.venv/bin/python"
[[ -x "$PYTHON" ]] || PYTHON="$(command -v python3)"
PROJ_DIR="$(cd "$HERE/../.." && pwd)"
export PYTHONPATH="$PROJ_DIR:${PYTHONPATH:-}"
manifest() {
local config="$1" stage="$2" status="$3" note="$4"
echo "$(date -Iseconds) | $config | $stage | $status | $note" \
>> "$RUN_ROOT/MANIFEST.tsv"
}
kill_vllm() {
local pidfile="$1"
if [[ -f "$pidfile" ]]; then
local pid; pid=$(cat "$pidfile")
if [[ -n "$pid" ]]; then
kill -9 "$pid" 2>/dev/null || true
fi
fi
pkill -f "port $PORT" 2>/dev/null || true
sleep 5
# wait for GPU release
for _ in $(seq 1 30); do
used=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits | head -n $((GPU_ID + 1)) | tail -1)
if [[ "$used" -lt 1000 ]]; then
break
fi
sleep 2
done
}
run_phase_a() {
local config="$1"
local cfg_dir="$RUN_ROOT/$config"
mkdir -p "$cfg_dir"
echo "=== [$config] Phase A ==="
# Launch
RUN_DIR="$cfg_dir" PORT="$PORT" GPU_ID="$GPU_ID" MODEL_PATH="$MODEL_PATH" \
bash "$HERE/launch/launch_${config}.sh"
local rc=$?
if [[ $rc == 42 ]]; then
manifest "$config" "phase_a" "SKIP" "dependency missing"
return 0
fi
if [[ $rc != 0 ]]; then
manifest "$config" "phase_a" "FAIL" "launch rc=$rc"
return 0
fi
# /metrics sampler in background
"$PYTHON" "$HERE/metrics_sampler.py" \
--url "http://127.0.0.1:$PORT/metrics" \
--output "$cfg_dir/metrics_A.jsonl" \
--interval 1.0 &
local sampler_pid=$!
# bench loop
"$PYTHON" "$HERE/bench_loop.py" \
--url "http://127.0.0.1:$PORT/v1/chat/completions" \
--model "$MODEL_PATH" \
--phase A \
--rates "$PHASE_A_RATES" \
--shape "4096,256" \
--duration "$DURATION" \
--min-completed "$MIN_COMPLETED" \
--warmup 10 \
--output-dir "$cfg_dir"
local bench_rc=$?
kill -9 "$sampler_pid" 2>/dev/null || true
kill_vllm "$cfg_dir/.vllm.pid"
[[ -f "$cfg_dir/.dummy.pid" ]] && kill -9 "$(cat "$cfg_dir/.dummy.pid")" 2>/dev/null
if [[ $bench_rc != 0 ]]; then
manifest "$config" "phase_a" "FAIL" "bench rc=$bench_rc"
return 0
fi
manifest "$config" "phase_a" "OK" ""
}
run_phase_b() {
local config="$1"
local cfg_dir="$RUN_ROOT/$config"
mkdir -p "$cfg_dir"
local ref_safe; ref_safe=$(jq -r '.ref_safe // empty' "$RUN_ROOT/aggregate.json" 2>/dev/null)
if [[ -z "$ref_safe" || "$ref_safe" == "null" ]]; then
echo "ref_safe not available; skipping Phase B for $config"
manifest "$config" "phase_b" "SKIP" "ref_safe undefined"
return 0
fi
echo "=== [$config] Phase B (rate=$ref_safe) ==="
RUN_DIR="$cfg_dir" PORT="$PORT" GPU_ID="$GPU_ID" MODEL_PATH="$MODEL_PATH" \
bash "$HERE/launch/launch_${config}.sh"
local rc=$?
if [[ $rc == 42 ]]; then
manifest "$config" "phase_b" "SKIP" "dependency missing"
return 0
fi
if [[ $rc != 0 ]]; then
manifest "$config" "phase_b" "FAIL" "launch rc=$rc"
return 0
fi
"$PYTHON" "$HERE/metrics_sampler.py" \
--url "http://127.0.0.1:$PORT/metrics" \
--output "$cfg_dir/metrics_B.jsonl" \
--interval 1.0 &
local sampler_pid=$!
"$PYTHON" "$HERE/bench_loop.py" \
--url "http://127.0.0.1:$PORT/v1/chat/completions" \
--model "$MODEL_PATH" \
--phase B \
--rate "$ref_safe" \
--shapes "$PHASE_B_SHAPES" \
--duration "$DURATION" \
--min-completed "$MIN_COMPLETED" \
--warmup 10 \
--output-dir "$cfg_dir"
local bench_rc=$?
kill -9 "$sampler_pid" 2>/dev/null || true
kill_vllm "$cfg_dir/.vllm.pid"
[[ -f "$cfg_dir/.dummy.pid" ]] && kill -9 "$(cat "$cfg_dir/.dummy.pid")" 2>/dev/null
if [[ $bench_rc != 0 ]]; then
manifest "$config" "phase_b" "FAIL" "bench rc=$bench_rc"
return 0
fi
manifest "$config" "phase_b" "OK" ""
}
# ── Stage 0 — pre-flight ────────────────────────────────────────────────
if [[ " $STAGES " == *" 0 "* ]]; then
echo "=== Stage 0 — pre-flight ==="
mkdir -p "$RUN_ROOT/preflight"
pip freeze > "$RUN_ROOT/preflight/pip_freeze.txt" 2>&1 || true
nvidia-smi -L > "$RUN_ROOT/preflight/nvidia.txt" 2>&1 || true
# Apply step-timing patch unless SKIP_PATCH=1
if [[ "${SKIP_PATCH:-0}" != "1" ]]; then
if "$PYTHON" "$HERE/patches/apply_step_timing.py" --apply > "$RUN_ROOT/preflight/patch.log" 2>&1; then
echo "step_timing_available=true" > "$RUN_ROOT/preflight/patch_status.txt"
else
echo "step_timing_available=false (apply failed)" > "$RUN_ROOT/preflight/patch_status.txt"
cat "$RUN_ROOT/preflight/patch.log"
fi
else
echo "step_timing_available=false (SKIP_PATCH=1)" > "$RUN_ROOT/preflight/patch_status.txt"
fi
fi
# ── Stage 1 — Phase A all configs ──────────────────────────────────────
if [[ " $STAGES " == *" 1 "* ]]; then
echo "=== Stage 1 — Phase A (all configs, randomized) ==="
for cfg in $(shuffle_configs "${ALL_CONFIGS[@]}"); do
run_phase_a "$cfg"
done
fi
# ── Stage 2 — pick reference rates ─────────────────────────────────────
if [[ " $STAGES " == *" 2 "* ]]; then
echo "=== Stage 2 — pick reference rates ==="
"$PYTHON" "$HERE/analyze.py" --root "$RUN_ROOT"
fi
# ── Stage 3 — Phase B all configs ──────────────────────────────────────
if [[ " $STAGES " == *" 3 "* ]]; then
echo "=== Stage 3 — Phase B (all configs, randomized) ==="
for cfg in $(shuffle_configs "${ALL_CONFIGS[@]}"); do
run_phase_b "$cfg"
done
fi
# ── Stage 4 — analyze + plot + revert ──────────────────────────────────
if [[ " $STAGES " == *" 4 "* ]]; then
echo "=== Stage 4 — re-analyze with Phase B + plot + revert patch ==="
"$PYTHON" "$HERE/analyze.py" --root "$RUN_ROOT"
"$PYTHON" "$HERE/plot_connector_tax.py" --root "$RUN_ROOT"
"$PYTHON" "$HERE/patches/apply_step_timing.py" --revert >> "$RUN_ROOT/preflight/patch.log" 2>&1 || true
echo "Done. Results in $RUN_ROOT"
fi

View File

@@ -0,0 +1,84 @@
"""Dummy Mooncake bootstrap server for kv_consumer pre-flight.
Exposes the same HTTP routes as MooncakeBootstrapServer but returns
empty / accepting responses. Allows a kv_consumer vLLM to start up
without a real prefiller behind it.
Usage:
python dummy_bootstrap.py --port 8997
"""
import argparse
import logging
import threading
from fastapi import FastAPI, Request
from fastapi.responses import JSONResponse
import uvicorn
logging.basicConfig(level=logging.INFO)
log = logging.getLogger("dummy_bootstrap")
def make_app() -> FastAPI:
app = FastAPI()
state = {"workers": {}, "hash_table": {}}
@app.post("/register")
async def register_worker(req: Request):
body = await req.json()
log.info("register_worker: %s", body)
# Pretend success
dp_rank = int(body.get("dp_rank", 0))
engine_id = body.get("engine_id", "dummy-engine")
state["workers"][dp_rank] = {
"engine_id": engine_id,
"worker_addr": body.get("worker_addr", {}),
}
return JSONResponse({"status": "ok"})
@app.get("/query")
async def query():
# Return whatever we have. Empty {} is acceptable for the consumer
# because no PD-sep request will actually trigger a pull.
return JSONResponse(state["workers"])
@app.post("/query_blocks")
async def query_blocks(req: Request):
return JSONResponse({"matched_blocks": []})
@app.post("/unpin_blocks")
async def unpin_blocks(req: Request):
return JSONResponse({"status": "ok"})
@app.post("/push_blocks")
async def push_blocks(req: Request):
return JSONResponse({"status": "ok"})
@app.post("/estimate_hit")
async def estimate_hit(req: Request):
return JSONResponse({"hit_tokens": 0})
@app.get("/health")
async def health():
return JSONResponse({"status": "ok"})
return app
def main():
ap = argparse.ArgumentParser()
ap.add_argument("--host", default="127.0.0.1")
ap.add_argument("--port", type=int, default=8997)
args = ap.parse_args()
app = make_app()
config = uvicorn.Config(app=app, host=args.host, port=args.port,
log_level="info")
server = uvicorn.Server(config)
log.info("Dummy Mooncake bootstrap listening on %s:%d", args.host, args.port)
server.run()
if __name__ == "__main__":
main()

View File

@@ -0,0 +1,90 @@
"""Pure no-op KV connector for measuring vLLM v1 framework overhead.
This connector implements every abstract hook of KVConnectorBase_V1 with
the cheapest possible no-op return. Loaded via:
--kv-transfer-config '{
"kv_connector_module_path":
"microbench.connector_tax.tools.noop_connector:NoOpConnector",
"kv_role": "kv_both"
}'
It does:
- no I/O
- no per-step cache key walk
- no per-layer save/load
- no metadata serialization beyond an empty dataclass
So `tax(NoOpConnector) ≈ pure vLLM v1 framework overhead`.
"""
from typing import TYPE_CHECKING, Any
from vllm.distributed.kv_transfer.kv_connector.v1.base import (
KVConnectorBase_V1,
KVConnectorMetadata,
)
if TYPE_CHECKING:
import torch
from vllm.attention.backends.abstract import AttentionMetadata
from vllm.forward_context import ForwardContext
from vllm.v1.core.kv_cache_manager import KVCacheBlocks
from vllm.v1.core.sched.output import SchedulerOutput
from vllm.v1.request import Request
class NoOpConnector(KVConnectorBase_V1):
"""Empty connector — every hook is a no-op.
Used as a control to isolate vLLM v1 framework dispatch cost
(build_connector_meta walking SchedulerOutput, mixin hooks, etc.)
from any specific connector implementation work (RDMA setup,
per-layer save, hash table walks).
"""
# ---- scheduler-side abstract methods ------------------------------
def get_num_new_matched_tokens(
self,
request: "Request",
num_computed_tokens: int,
) -> tuple[int | None, bool]:
# Never advertises any external cache hits.
return 0, False
def update_state_after_alloc(
self,
request: "Request",
blocks: "KVCacheBlocks",
num_external_tokens: int,
) -> None:
return None
def build_connector_meta(
self,
scheduler_output: "SchedulerOutput",
) -> KVConnectorMetadata:
return KVConnectorMetadata()
# ---- worker-side abstract methods ---------------------------------
def start_load_kv(
self,
forward_context: "ForwardContext",
**kwargs: Any,
) -> None:
return None
def wait_for_layer_load(self, layer_name: str) -> None:
return None
def save_kv_layer(
self,
layer_name: str,
kv_layer: "torch.Tensor",
attn_metadata: "AttentionMetadata",
**kwargs: Any,
) -> None:
return None
def wait_for_save(self) -> None:
return None

View File

@@ -0,0 +1,33 @@
#!/bin/bash
# Pre-flight: verify Mooncake kv_consumer + dummy bootstrap can start
# and answer at least a trivial request.
set -euo pipefail
HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
RUN_DIR="${RUN_DIR:-$HERE/../results/preflight/kv_consumer}"
mkdir -p "$RUN_DIR"
DUMMY_BOOTSTRAP_PORT="${DUMMY_BOOTSTRAP_PORT:-8997}"
PORT="${PORT:-8000}"
bash "$HERE/../launch/launch_mooncake_consumer.sh" || {
echo "SKIP: kv_consumer launch failed (likely Mooncake bootstrap incompatible with dummy)" >&2
exit 42
}
# kv_consumer can be sent a regular (non-PD-sep) request — it will just
# do local prefill+decode. It should succeed.
MODEL="$HOME/models/Qwen/$(ls $HOME/models/Qwen | grep Qwen3-Coder-30B | head -1)"
for i in 1 2 3 4 5; do
code=$(curl -s -o "$RUN_DIR/req_$i.json" -w "%{http_code}" \
-X POST "http://127.0.0.1:$PORT/v1/chat/completions" \
-H 'Content-Type: application/json' \
-d "{\"model\":\"$MODEL\",\"messages\":[{\"role\":\"user\",\"content\":\"hi $i\"}],\"max_tokens\":4,\"temperature\":0}" \
--max-time 30 || echo "000")
if [[ "$code" != "200" ]]; then
echo "FAIL: req $i status=$code" >&2
exit 1
fi
done
echo "OK: kv_consumer reachable, all 5 requests succeeded"

View File

@@ -0,0 +1,35 @@
#!/bin/bash
# Pre-flight: verify MultiConnector(Mooncake kv_both, LMCacheConnectorV1).
# Exits 42 if LMCache missing, 1 on crash, 0 on OK.
set -euo pipefail
HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
RUN_DIR="${RUN_DIR:-$HERE/../results/preflight/multi}"
mkdir -p "$RUN_DIR"
PORT="${PORT:-8000}"
if ! "$(dirname "$HERE")/launch/launch_multi_mooncake_lmcache.sh"; then
rc=$?
if [[ $rc == 42 ]]; then
echo "SKIP: lmcache missing" >&2
exit 42
fi
echo "FAIL: launch failed with code $rc" >&2
exit 1
fi
MODEL="$HOME/models/Qwen/$(ls $HOME/models/Qwen | grep Qwen3-Coder-30B | head -1)"
for i in 1 2 3 4 5; do
code=$(curl -s -o "$RUN_DIR/req_$i.json" -w "%{http_code}" \
-X POST "http://127.0.0.1:$PORT/v1/chat/completions" \
-H 'Content-Type: application/json' \
-d "{\"model\":\"$MODEL\",\"messages\":[{\"role\":\"user\",\"content\":\"multi $i\"}],\"max_tokens\":32,\"temperature\":0}" \
--max-time 60 || echo "000")
if [[ "$code" != "200" ]]; then
echo "FAIL: req $i status=$code" >&2
exit 1
fi
done
echo "OK: MultiConnector reachable, all 5 requests succeeded"

View File

@@ -0,0 +1,26 @@
#!/bin/bash
# Pre-flight: verify NoOpConnector loads and serves requests.
# Exits 0 = OK, 42 = SKIP (dependency missing), nonzero = fail.
set -euo pipefail
HERE="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
RUN_DIR="${RUN_DIR:-$HERE/../results/preflight/noop}"
mkdir -p "$RUN_DIR"
bash "$HERE/../launch/launch_noop_connector.sh"
PORT="${PORT:-8000}"
# Issue 5 short requests
for i in 1 2 3 4 5; do
code=$(curl -s -o "$RUN_DIR/req_$i.json" -w "%{http_code}" \
-X POST "http://127.0.0.1:$PORT/v1/chat/completions" \
-H 'Content-Type: application/json' \
-d "{\"model\":\"$(ls $HOME/models/Qwen | grep Qwen3-Coder-30B | head -1 | xargs -I{} echo $HOME/models/Qwen/{})\",\"messages\":[{\"role\":\"user\",\"content\":\"hello $i\"}],\"max_tokens\":4,\"temperature\":0}" \
--max-time 60 || echo "000")
if [[ "$code" != "200" ]]; then
echo "FAIL: req $i status=$code" >&2
exit 1
fi
done
echo "OK: all 5 requests succeeded"