Validates the elastic_migration_v2 finding that kv_role=kv_both adds
TTFT p90 +45% even when PD-sep never fires. Replicates under
single-instance, synthetic, open-loop workload to disambiguate
mechanism cost from 8-instance feedback amplification.
Configurations (8):
plain, noop_connector, mooncake_{producer,consumer,both},
nixl_both, lmcache_only, multi_mooncake_lmcache.
Pre-flight verification gates risky configs (kv_consumer needs dummy
bootstrap, multi-connector composition, NoOp custom class loading).
Workload: two-phase sweep
Phase A: rate {0.5..32} req/s × shape (4096, 256), saturation criteria
Phase B: ref_safe rate × cartesian (input ∈ {512,4k,32k}, output ∈ {64,256,1024})
Step-timing patch enriches vLLM's existing AGENTIC_STEP_LOG_PATH emit
with step_duration_us and build_meta_us — directly measures per-step
substrate cost, not just user-visible TTFT/TPOT.
run_all.sh runs as 5-stage barrier:
0 pre-flight + apply patch
1 Phase A all configs
2 pick ref_safe / ref_load
3 Phase B all configs
4 revert patch + analyze + plot
Outputs aggregate.{json,csv}, MANIFEST.tsv, and 5 figures.
Estimated runtime: 4-5.5 hours on idle dash0 H20.
665 lines
26 KiB
Markdown
665 lines
26 KiB
Markdown
# Microbench 3: KV Connector Substrate Tax (revision 2)
|
||
|
||
## Goal
|
||
|
||
Validate the headline claim from
|
||
`analysis/characterization/elastic_migration_v2/README.md` Result 1:
|
||
|
||
> Switching the vLLM launch from plain to `kv_role=kv_both` without ever
|
||
> triggering PD-sep already costs **TTFT p90 +45%, TPOT p90 +25%,
|
||
> hotspot index +19%**.
|
||
|
||
That claim was measured on 8 instances with a 1214-request real-trace
|
||
replay under saturated coupling. We replicate it with **single-instance,
|
||
synthetic, open-loop** workload so we can:
|
||
|
||
1. Disambiguate **vLLM-v1-framework cost** from
|
||
**connector-implementation cost** by including a no-op connector.
|
||
2. **Validate (or refute) the agentic-coupling amplification** claim:
|
||
if single-instance synthetic numbers ≈ 8-instance trace numbers
|
||
(38–45%), the coupling is not the main cause. If single-instance is
|
||
much smaller, then the 8-instance saturated coupling does most of
|
||
the damage.
|
||
3. Make the result **reproducible and auditable**: every run dumps full
|
||
raw artifacts + manifest entry + a re-run script.
|
||
|
||
---
|
||
|
||
## Hypotheses (revised based on elastic_migration_v2 prior)
|
||
|
||
The headline trace-replay numbers from elastic_migration_v2 are our
|
||
**prior**, not an open question:
|
||
|
||
```
|
||
trace replay, 8 instances, agentic dispatch coupling, saturated:
|
||
plain TTFT p90 = 7.35 s
|
||
NIXL TTFT p90 = ~10.1 s (+38%)
|
||
Mooncake_both = 10.67 s (+45%)
|
||
```
|
||
|
||
The microbench validates / refutes / refines these:
|
||
|
||
| ID | Hypothesis | Falsifier |
|
||
|---|---|---|
|
||
| **H1: Substrate tax persists at single instance / synthetic load** | Single-instance Mooncake_both TTFT p90 is ≥ 10% higher than plain at the reference rate | If <10% → trace-replay tax is dominated by 8-instance feedback coupling, not connector machinery |
|
||
| **H2: NIXL-vs-Mooncake gap is mechanism-side, not coupling-side** | Single-instance numbers preserve the ~7 pp gap (NIXL tax < Mooncake tax by 5–10 pp) | If gap shrinks/inverts → the gap was a coupling artifact |
|
||
| **H3: Framework-vs-implementation split** | `noop_connector` (v1 framework only, all hooks return no-op) tax is < 50% of Mooncake_both tax | Lets us attribute cost between vLLM's connector dispatch loop and the specific connector's per-step work |
|
||
| **H4: MultiConnector tax is additive** | tax(Mooncake+LMCache) ≈ tax(Mooncake) + tax(LMCache), within 30% | If super-additive → cross-connector interference; if sub-additive → some shared per-step cost is amortized |
|
||
| **H5: Tax is shape-dependent** | tax_TTFT_p90 grows monotonically with input length for Mooncake_both | Confirms E2 audit §6.5 hypothesis (`set(cache.keys())` walks scale with cache size) |
|
||
| **H6: Tax compounds in decode** | tax_TPOT_p90 grows with output length | Confirms connector code runs each decode step |
|
||
|
||
H3 and H4 are the must-have new hypotheses that pulled in the new configs.
|
||
|
||
---
|
||
|
||
## Hardware & Model
|
||
|
||
| Parameter | Value |
|
||
|---|---|
|
||
| GPU | NVIDIA H20 96 GB × 1 (single instance) |
|
||
| Model | Qwen3-Coder-30B-A3B-Instruct |
|
||
| TP | 1 |
|
||
| `max_model_len` | 200 000 |
|
||
| `enable_prefix_caching` | true |
|
||
| `enable_chunked_prefill` | true |
|
||
| `max_num_batched_tokens` | 8192 |
|
||
| `gpu_memory_utilization` | 0.9 |
|
||
|
||
Single GPU per run. Each configuration is a fresh vLLM launch on GPU 0.
|
||
|
||
---
|
||
|
||
## Configurations (8 total, was 6)
|
||
|
||
| ID | Connector | Role | Why we measure it |
|
||
|---|---|---|---|
|
||
| `plain` | (none) | — | Baseline |
|
||
| `noop_connector` | custom `NoOpConnector` (this microbench ships it) | n/a | Isolate **vLLM-v1 framework** cost (build_connector_meta, mixin dispatch, get_finished bookkeeping) without any real connector work — see Note 1 |
|
||
| `mooncake_producer` | MooncakeConnector | `kv_producer` | Isolate P-side stack |
|
||
| `mooncake_consumer` | MooncakeConnector | `kv_consumer` | Isolate D-side stack — pre-flight gated, see §Pre-flight |
|
||
| `mooncake_both` | MooncakeConnector | `kv_both` | The README claim |
|
||
| `nixl_both` | NIXLConnector | `kv_both` | Connector-specific vs framework cost |
|
||
| `lmcache_only` | `LMCacheConnectorV1` | n/a | NEW — gives H4 a denominator |
|
||
| `multi_mooncake_lmcache` | MultiConnector(Mooncake `kv_both` + `LMCacheConnectorV1`) | mixed | Stacked-connector check (gated by pre-flight) |
|
||
|
||
**Note 1 — noop_connector (we ship it, not the vLLM-bundled one)**:
|
||
The vLLM-shipped `ExampleConnector` is NOT a true no-op — it
|
||
implements a debug-grade disk KV cache: stores match metadata,
|
||
serializes safetensors per-layer in `save_kv_layer`, etc. (see
|
||
`third_party/vllm/.../example_connector.py:345`,
|
||
`example_connector.py:250`,
|
||
`kv_transfer_utils.py:49`). Using it would conflate framework cost
|
||
with disk-I/O + per-layer save cost.
|
||
|
||
Instead we ship `microbench/connector_tax/tools/noop_connector.py`
|
||
that subclasses `KVConnectorBase_V1` and returns no-op for **every**
|
||
hook:
|
||
|
||
```python
|
||
class NoOpConnector(KVConnectorBase_V1):
|
||
def get_num_new_matched_tokens(self, req, num_computed): return 0, False
|
||
def update_state_after_alloc(self, *_args, **_kw): pass
|
||
def build_connector_meta(self, scheduler_output): return KVConnectorMetadata()
|
||
def request_finished(self, *_args, **_kw): return False, None
|
||
def start_load_kv(self, *_args, **_kw): pass
|
||
def wait_for_layer_load(self, *_args, **_kw): pass
|
||
def save_kv_layer(self, *_args, **_kw): pass
|
||
def wait_for_save(self): pass
|
||
def get_finished(self, *_args, **_kw): return None, None
|
||
```
|
||
|
||
vLLM loads it via:
|
||
|
||
```
|
||
--kv-transfer-config '{
|
||
"kv_connector_module_path":
|
||
"microbench.connector_tax.tools.noop_connector:NoOpConnector",
|
||
"kv_role": "kv_both"
|
||
}'
|
||
```
|
||
|
||
`PYTHONPATH` is set in `launch_noop_connector.sh` so vLLM can resolve
|
||
the dotted import path.
|
||
|
||
If `noop_connector` overhead ≈ 0 → all substrate tax is in connector
|
||
implementations. If `noop_connector` overhead ≈ 30% of Mooncake_both
|
||
tax → vLLM's framework dispatch alone explains a meaningful slice.
|
||
|
||
---
|
||
|
||
## Pre-flight Verification (NEW — gates risky configs)
|
||
|
||
Two configs depend on infrastructure we can't take for granted. Run
|
||
verification scripts BEFORE the main bench. Skip the config (and record
|
||
SKIP in manifest) if it fails.
|
||
|
||
### `verify_kv_consumer.sh`
|
||
|
||
1. Start a dummy bootstrap process (`tools/dummy_bootstrap.py`).
|
||
2. Launch vLLM with `kv_role=kv_consumer` pointing at the dummy.
|
||
3. Curl `/v1/models` — must return 200 with the model id.
|
||
4. Send one short request (`max_tokens=4`) without `kv_transfer_params`
|
||
— must return 200 in <30 s.
|
||
|
||
If steps 3 or 4 fail, the config is unrunnable and we drop it. We do
|
||
not try harder; the trace-replay paper does not promise consumer-only
|
||
single-instance numbers.
|
||
|
||
### `verify_multi_connector.sh`
|
||
|
||
1. Launch vLLM with `MultiConnector(MooncakeConnector kv_both,
|
||
LMCacheConnectorV1)`.
|
||
2. Send 5 sequential requests, `max_tokens=32`, random content.
|
||
3. All 5 must complete in <60 s.
|
||
4. Verify no engine crashes: `vllm:engine_core_failed_total == 0` from
|
||
`/metrics`.
|
||
|
||
If any check fails, drop the config and mark SKIP (Manifest column:
|
||
"why skipped").
|
||
|
||
### `verify_noop_connector.sh`
|
||
|
||
1. Launch with noop_connector active (loaded via `kv_connector_module_path`).
|
||
2. Send 5 sequential requests, `max_tokens=32`.
|
||
3. Verify all 5 return 200 in <30 s and no engine crash.
|
||
|
||
This one is unlikely to fail but the verification is the same.
|
||
|
||
The verification scripts produce `verify_<config>.log` under
|
||
`results/preflight/` and a `preflight_status.json` summarizing
|
||
skip-or-include decisions for the manifest.
|
||
|
||
---
|
||
|
||
## Workload (revised)
|
||
|
||
**Open-loop, fixed-rate, randomized content, two-phase sweep,
|
||
data-driven saturation criteria.**
|
||
|
||
### Phase A — rate sweep (find saturation per config)
|
||
|
||
| Parameter | Value |
|
||
|---|---|
|
||
| Input length | 4096 tokens (random per request) |
|
||
| Output length | 256 tokens (`max_tokens=256`, `ignore_eos=True`) |
|
||
| Send rates | {0.5, 1, 2, 4, 8, 16, 32} req/s (added 0.5 for low-end calibration) |
|
||
| Duration per cell | `max(60 s, time_to_min_completed)` + 10 s warmup |
|
||
| Min completed per cell | 200 requests |
|
||
| Inflight cap | 256 (drop excess to log) |
|
||
|
||
**Why min_completed = 200**: at p90, the margin-of-error of a Monte
|
||
Carlo percentile estimate from N samples is ≈ 1.65 √(0.9 × 0.1 / N).
|
||
For N=200 this is ~3.5% absolute, ~10% relative — acceptable. For
|
||
N=30 (which 0.5 req/s × 60 s gives) it's ~28% relative, useless for
|
||
saturation detection. So at low rates the cell automatically extends:
|
||
0.5 req/s → ≥ 400 s, 1 req/s → ≥ 200 s. At 4 req/s and above the
|
||
60-second floor dominates.
|
||
|
||
Updated Phase A duration per config (rounded):
|
||
|
||
| Rate (req/s) | Duration (s) | Note |
|
||
|---|---|---|
|
||
| 0.5 | 410 | extended to hit 200 completed |
|
||
| 1 | 210 | extended |
|
||
| 2 | 110 | extended slightly |
|
||
| 4 | 70 | floor |
|
||
| 8 | 70 | floor |
|
||
| 16 | 70 | floor |
|
||
| 32 | 70 | floor |
|
||
| **sum per config (excl. warmup)** | **~1010 s** | |
|
||
|
||
Total Phase A: 8 configs × (90 s vLLM warmup + 1010 s of cells +
|
||
60 s GPU release) = 8 × 1160 s ≈ **155 min**.
|
||
|
||
### Saturation criteria (data-driven, was hardcoded inflight>8)
|
||
|
||
A config is **saturated at rate r** if **any** of:
|
||
|
||
1. `effective_throughput(r) / r < 0.95` — vLLM can't keep up
|
||
2. `num_requests_waiting p50 (from /metrics) > 1` — vLLM has visible queue
|
||
3. `TTFT p90 (r) / TTFT p90 (r/2) > 1.5` — TTFT inflating super-linearly
|
||
|
||
The **per-config saturation rate** is the lowest r that triggers ≥ 1
|
||
criterion. We log which criterion fired so reviewers can disagree.
|
||
|
||
### Reference rate selection (revised)
|
||
|
||
We define **two reference rates** for Phase B, both computed from
|
||
Phase A data:
|
||
|
||
```
|
||
ref_safe = max rate where ALL 8 configs are NOT saturated
|
||
ref_load = max rate where plain is NOT saturated
|
||
(some other configs may be saturated here)
|
||
```
|
||
|
||
`ref_safe` measures the **pure substrate per-step tax** under no
|
||
queueing.
|
||
|
||
`ref_load` measures **the tax in the regime closer to deployment** —
|
||
where plain is happily under-loaded but Mooncake is starting to hurt.
|
||
The gap `tax(ref_load) − tax(ref_safe)` is the **non-linear queueing
|
||
amplification** of the substrate tax. This is exactly the effect the
|
||
reviewer worried about and now we measure it explicitly instead of
|
||
ignoring it.
|
||
|
||
Both rates are reported. The headline number we cite is `ref_safe`
|
||
because it's the cleanest decomposition. The `ref_load` numbers tell
|
||
us how much worse the tax gets near saturation.
|
||
|
||
### Phase B — shape sweep (substrate tax across length regimes)
|
||
|
||
| Parameter | Value |
|
||
|---|---|
|
||
| Send rate | `ref_safe` (one value, single rate to keep cost bounded) |
|
||
| Input lengths | {512, 4096, 32768} tokens |
|
||
| Output lengths | {64, 256, 1024} tokens (32 promoted to 64 — see Note 2) |
|
||
| Duration per cell | `max(60 s, time_to_min_completed)` + 10 s warmup |
|
||
| Min completed per cell | 200 requests |
|
||
| Cartesian shapes | 3 × 3 = 9 |
|
||
|
||
The same min-completed extension applies. If `ref_safe ≥ 4 req/s`,
|
||
each cell hits the 60 s floor and per-config Phase B cell time is
|
||
9 × 70 s = 630 s. If `ref_safe = 2 req/s`, cells extend to 110 s and
|
||
per-config cell time is 9 × 110 s = 990 s.
|
||
|
||
Total Phase B (worst case, `ref_safe = 2`): 8 configs × (90 s warmup
|
||
+ 990 s of cells + 60 s GPU release) ≈ **152 min**. Best case
|
||
(`ref_safe ≥ 4`): 8 × (90 + 630 + 60) ≈ **104 min**.
|
||
|
||
If after Phase A we find `ref_load` differs meaningfully from
|
||
`ref_safe`, we add a small Phase B' run on `ref_load` for the 4
|
||
high-priority configs (plain, mooncake_both, nixl_both, lmcache_only)
|
||
on 3 representative shapes (512/256, 4096/256, 32768/256). That is
|
||
4 configs × 3 shapes × 70 s ≈ 14 min, controlled trade-off.
|
||
|
||
**Note 2 — output 64 instead of 32**: with 32 output tokens TPOT is
|
||
estimated from 31 inter-token intervals — too few samples for stable
|
||
p90. Bumping to 64 gives 63 samples, comfortable for percentile
|
||
estimation. The output=32 regime is also less common in agentic
|
||
deployments where a tool result frame is rarely <64 tokens.
|
||
|
||
### Common settings
|
||
|
||
| Parameter | Value |
|
||
|---|---|
|
||
| `temperature` | 0 (deterministic) |
|
||
| `ignore_eos` | True (force exact output length) |
|
||
| Content | random UUID + hash per request, zero prefix cache hit |
|
||
| Concurrent inflight cap | 256 |
|
||
|
||
---
|
||
|
||
## Metrics (revised — adds A3 step-level engine_state)
|
||
|
||
### Client-side (per-request, JSONL)
|
||
|
||
Same as before: `t_send_ns`, `t_first_token_ns`, `t_last_token_ns`,
|
||
`prompt_tokens`, `completion_tokens`, `inflight_at_send`.
|
||
|
||
### Server-side `/metrics` sampling (1 Hz)
|
||
|
||
Captured into `metrics_<cfg>_<phase>_<cell>.jsonl`. Same fields as
|
||
prior version.
|
||
|
||
### Step-level timing instrumentation (NEW — we ship the patch)
|
||
|
||
The reviewer correctly noted that the existing A3 step log
|
||
(`third_party/vllm/.../scheduler.py:953`) only records per-step token
|
||
counts and request lists, **not** step duration or per-callback
|
||
timing. So we cannot just turn on AGENTIC_STEP_LOG_PATH and get
|
||
Figure 6/7's "direct evidence" — that data does not exist yet.
|
||
|
||
This microbench ships its own scheduler timing patch at
|
||
`microbench/connector_tax/patches/scheduler_step_timing.py`, modelled
|
||
on the idempotent `microbench/patches/apply_patches.py` we wrote for
|
||
Microbench 2. It uses the same `_pd_profile.py` emit pattern.
|
||
|
||
The patch instruments:
|
||
|
||
1. `Scheduler.schedule()` entry → `t_step_enter` (perf_counter_ns)
|
||
2. `Scheduler.schedule()` exit → `t_step_exit`
|
||
3. Around `connector.build_connector_meta(scheduler_output)`
|
||
→ `build_meta_us`
|
||
4. Around `connector.get_finished(...)` call
|
||
(in `_update_from_output` / mixin)
|
||
→ `get_finished_us`
|
||
5. Around `connector.start_load_kv(...)` (in the worker mixin
|
||
`_get_kv_connector_output`)
|
||
→ `start_load_kv_us` (worker-side; emitted from worker process)
|
||
|
||
Each step emits one JSONL record:
|
||
|
||
```json
|
||
{
|
||
"t_ns": <step_enter perf_counter_ns>,
|
||
"step_id": <monotonic int>,
|
||
"step_duration_us": <step_exit - step_enter>,
|
||
"build_meta_us": <build_connector_meta duration>,
|
||
"get_finished_us": <connector get_finished duration>,
|
||
"start_load_kv_us": <worker start_load_kv; null on scheduler-only proc>,
|
||
"num_running": <int>,
|
||
"num_waiting": <int>,
|
||
"prefill_tokens": <int>,
|
||
"decode_tokens": <int>
|
||
}
|
||
```
|
||
|
||
Output goes to `AGENTIC_STEP_LOG_PATH` (one file per process; we use
|
||
`engine_step_<phase>_<cell>.jsonl` paths from launch scripts).
|
||
|
||
Apply / revert is idempotent — same `# CONNECTOR_TAX_PATCH` marker
|
||
strategy as Microbench 2.
|
||
|
||
```
|
||
microbench/connector_tax/patches/
|
||
├── _step_profile.py # the emitter (ported from _pd_profile)
|
||
├── scheduler_step_timing.py # patch installer / reverter
|
||
└── apply.sh # invoked by run_all.sh; revert at end
|
||
```
|
||
|
||
**Fallback if the patch fails to apply on a future vLLM version**:
|
||
the bench drops to client-side TTFT/TPOT only. Figures 6 (per-step
|
||
CDF) and 7 (decomposition stack) are not produced; the manifest
|
||
records `step_timing_available=false`. The other figures and the
|
||
H1 / H2 / H4 headline numbers do not depend on this patch, so the
|
||
bench is still useful in fallback mode.
|
||
|
||
### Derived (post-processing)
|
||
|
||
For each (config, rate-or-shape) cell after warmup:
|
||
|
||
- TTFT/TPOT/E2E p50/p90/p99
|
||
- `effective_throughput`, `requested_throughput`, throughput_ratio
|
||
- `saturation_flag` (which criterion, if any, triggered)
|
||
- (when `step_timing_available=true`):
|
||
- `step_duration_us` p50/p90
|
||
- `build_meta_us` p50/p90
|
||
- `get_finished_us` p50/p90
|
||
- `start_load_kv_us` p50/p90 (worker-process file)
|
||
- `connector_total_us` p50/p90 (sum of the 3 callback timings)
|
||
|
||
### Substrate tax definition
|
||
|
||
```
|
||
tax_TTFT_p90(X, ref) = TTFT_p90(X, ref) / TTFT_p90(plain, ref) - 1
|
||
tax_TPOT_p90(X, ref) = TPOT_p90(X, ref) / TPOT_p90(plain, ref) - 1
|
||
tax_step_p50(X) = step_duration_us p50 (X) - step_duration_us p50 (plain)
|
||
tax_callback_p50(X) = connector_total_us p50 (X) # plain has no callbacks
|
||
```
|
||
|
||
`tax_step` is the **gross** per-step penalty (any cause).
|
||
`tax_callback` is the **callback-attributable** penalty (sum of the
|
||
three measured connector hooks). The difference `tax_step −
|
||
tax_callback` is "step-time overhead not attributable to instrumented
|
||
callbacks" — block-pool walks, scheduler-state churn, etc. Reporting
|
||
both lets reviewers see whether our instrumentation accounts for the
|
||
full cost.
|
||
|
||
---
|
||
|
||
## Auditability & Reproducibility Plan
|
||
|
||
### Run artifacts (per config × phase × cell)
|
||
|
||
```
|
||
microbench/connector_tax/results/
|
||
<date>_<config>/
|
||
config.json # parameters used
|
||
launch.sh # exact vLLM launch command
|
||
vllm_stdout.log # full vLLM stdout
|
||
vllm_stderr.log # full vLLM stderr
|
||
requests_<phase>_<cell>.jsonl
|
||
metrics_<phase>_<cell>.jsonl
|
||
engine_step_<phase>_<cell>.jsonl # if A3 active
|
||
summary.json # per-cell percentiles
|
||
env.txt # pip freeze, vLLM SHA, GPU info
|
||
preflight/
|
||
verify_kv_consumer.log
|
||
verify_multi_connector.log
|
||
verify_noop_connector.log
|
||
preflight_status.json # which configs are SKIP'd and why
|
||
```
|
||
|
||
### Manifest
|
||
|
||
`microbench/connector_tax/MANIFEST.md` lists every run with date,
|
||
vLLM version + git SHA, Mooncake version, NIXL version, LMCache
|
||
version, GPU id (`nvidia-smi -L`), config name, launch command, result
|
||
directory, A3-active flag, and skip-status (with reason).
|
||
|
||
### Re-run script
|
||
|
||
`microbench/connector_tax/run_all.sh` runs in **three barrier stages**.
|
||
Phase A across all configs must finish before Phase B can pick a
|
||
reference rate.
|
||
|
||
**Stage 0 — Pre-flight + patch:**
|
||
1. Run `verify_kv_consumer.sh`, `verify_multi_connector.sh`, and
|
||
`verify_noop_connector.sh`. Persist `preflight_status.json`.
|
||
2. Apply `microbench/connector_tax/patches/scheduler_step_timing.py`
|
||
to the active vLLM. Record `step_timing_available=true|false`
|
||
in the manifest based on whether the patch applied cleanly.
|
||
|
||
**Stage 1 — Phase A (all configs, randomized order):**
|
||
For each non-SKIP config:
|
||
1. `launch_<config>.sh` → wait for `/v1/models`.
|
||
2. `bench_loop.py --rates 0.5,1,2,4,8,16,32 --shape 4096,256
|
||
--duration 60 --min-completed 200`.
|
||
3. Kill vLLM, wait 60 s for GPU release.
|
||
4. Append manifest row.
|
||
|
||
After **all** configs have finished Stage 1:
|
||
|
||
**Stage 2 — Reference rate selection (CPU only):**
|
||
1. Compute saturation flags from each cell using the data-driven
|
||
criteria.
|
||
2. Choose `ref_safe` = max rate where ALL configs that completed
|
||
Phase A are not saturated.
|
||
3. Choose `ref_load` = max rate where `plain` is not saturated.
|
||
4. Persist `reference_rates.json`.
|
||
|
||
**Stage 3 — Phase B (all configs, randomized order):**
|
||
For each non-SKIP config:
|
||
1. `launch_<config>.sh` → wait for ready.
|
||
2. `bench_loop.py --rate <ref_safe> --shapes 512x64,512x256,
|
||
...,32768x1024 --duration 60 --min-completed 200`.
|
||
3. (If `ref_load != ref_safe`) Run Phase B' for priority configs
|
||
(plain, mooncake_both, nixl_both, lmcache_only) on shapes
|
||
{512x256, 4096x256, 32768x256} at `ref_load`.
|
||
4. Kill vLLM, wait 60 s, append manifest row.
|
||
|
||
**Stage 4 — Patch revert + analysis:**
|
||
1. Revert the scheduler_step_timing patch.
|
||
2. `analyze.py --root results/`.
|
||
3. `plot_connector_tax.py`.
|
||
|
||
A reviewer with a fresh checkout runs:
|
||
|
||
```
|
||
cd microbench/connector_tax
|
||
bash run_all.sh
|
||
```
|
||
|
||
and gets the figures + manifest + raw artifacts. The script is
|
||
re-runnable: any stage can be skipped via `--skip-stage N` if the
|
||
artifacts exist.
|
||
|
||
### Determinism notes
|
||
|
||
Same as previous: temperature=0 + ignore_eos give shape determinism;
|
||
content varies per request via seeded UUID. We do not promise
|
||
bit-exact reproducibility, only distribution-level reproducibility.
|
||
|
||
### Updated runtime estimate (was 1.5–2 h, **now 4–5.5 h**)
|
||
|
||
| Phase | Time |
|
||
|---|---|
|
||
| Pre-flight (3 verify scripts) | 15 min |
|
||
| Phase A: 8 configs × (90 s warmup + 1010 s cells + 60 s GPU clear) | 155 min |
|
||
| Phase A → ref_safe selection (CPU) | <1 min |
|
||
| Phase B (best, `ref_safe ≥ 4`): 8 × (90 + 630 + 60) | 104 min |
|
||
| Phase B (worst, `ref_safe = 2`): 8 × (90 + 990 + 60) | 152 min |
|
||
| Optional Phase B' (4 configs × 3 shapes × ≥70 s + 4 × 90 s warmup) | 20 min |
|
||
| Analysis + figures | 5 min |
|
||
| **Total (best case)** | **~5 h** |
|
||
| **Total (worst case)** | **~5.5 h** |
|
||
|
||
This is honest. The reviewer's earlier estimate of 2.5–3 h
|
||
underestimated how long low-rate cells must run to give stable p90.
|
||
|
||
---
|
||
|
||
## Analysis & Figures
|
||
|
||
### Figure 1: TTFT p90 vs send rate, per configuration (Phase A)
|
||
|
||
Same as before, now 8 lines plus saturation markers (× per criterion).
|
||
|
||
### Figure 2: TPOT p90 vs send rate, per configuration (Phase A)
|
||
|
||
Same.
|
||
|
||
### Figure 3: Achieved throughput vs requested (Phase A)
|
||
|
||
Same, 8 lines + y=x reference + saturation knee annotations.
|
||
|
||
### Figure 4: Substrate tax bar (TTFT p90 + TPOT p90)
|
||
|
||
At `ref_safe` and `ref_load`, side-by-side bars per non-plain config.
|
||
Shows:
|
||
- Pure tax (`ref_safe`)
|
||
- Tax + non-linear queueing (`ref_load`)
|
||
- The gap is the **coupling amplification** the reviewer flagged.
|
||
|
||
### Figure 5: Shape-dependent tax heatmap (Phase B)
|
||
|
||
3×3 heatmap (input × output) of tax_TTFT_p90 for each non-plain
|
||
config. 6 heatmaps in a row, including noop_connector,
|
||
mooncake_both, nixl_both, lmcache_only, mooncake_producer,
|
||
multi_mooncake_lmcache. (Skip mooncake_consumer if pre-flight
|
||
dropped it.)
|
||
|
||
### Figure 6: Per-step latency CDF, ref_safe rate (Phase A)
|
||
|
||
X = step duration (μs), Y = CDF, line per config. **The most direct
|
||
visualization of "what each step costs."** Shipped only if A3 step log
|
||
is available.
|
||
|
||
### Figure 7: Tax decomposition stack
|
||
|
||
For each non-plain config at ref_safe, stacked bar:
|
||
- "framework cost" estimated = tax(noop_connector)
|
||
- "implementation cost" estimated = tax(this config) − tax(noop_connector)
|
||
|
||
If `noop_connector` doesn't run (we'd document why), we drop this
|
||
figure and report tax as a single number per config.
|
||
|
||
### Figure 8: H4 additivity check
|
||
|
||
3-bar group: tax(mooncake_both), tax(lmcache_only), tax(multi). The
|
||
sum-of-first-two compared against multi visualizes additivity.
|
||
|
||
---
|
||
|
||
## Risks & Mitigations (revised)
|
||
|
||
| Risk | Impact | Mitigation |
|
||
|---|---|---|
|
||
| `kv_consumer` won't start with dummy bootstrap | Skip the config | Pre-flight; documented SKIP in manifest |
|
||
| `multi_mooncake_lmcache` crashes engine | Skip the config | Pre-flight |
|
||
| NIXL not installed | Skip nixl_both | Tolerant; warn + continue |
|
||
| LMCache not installed | Skip lmcache_only AND multi config | Tolerant; warn + continue |
|
||
| GPU thermal drift across 3+ h | Skews late configs | Run order randomized; consider running twice on different days and reporting both |
|
||
| Open-loop blow-up at 32 req/s | Memory blowup | Inflight cap 256, drop with logged counter |
|
||
| Cold-start of first request | Inflates mean TTFT | 10 s warmup discarded |
|
||
| `scheduler_step_timing` patch fails to apply on a future vLLM version | Lose Figures 6 and 7 | Document `step_timing_available=false` in manifest; H1/H2/H4 still report from client-side TTFT/TPOT |
|
||
| `noop_connector` import fails (PYTHONPATH or class signature) | Lose Figures 7 + H3 falsifier | Pre-flight `verify_noop_connector.sh` catches this; report SKIP in manifest |
|
||
|
||
---
|
||
|
||
## Success Criteria (revised)
|
||
|
||
1. **H1 falsifiable**: tax_TTFT_p90 for `mooncake_both` at `ref_safe`
|
||
is reported. We accept the prior (≈45%) if measurement is in
|
||
[25%, 60%]; we **revise the prior** if outside.
|
||
2. **H2 testable**: NIXL-vs-Mooncake gap at `ref_safe` is reported.
|
||
The trace-replay difference was ~7 pp. We document agreement /
|
||
disagreement.
|
||
3. **H3 disambiguated**: tax(noop_connector) at `ref_safe` is
|
||
reported. We label substrate tax as
|
||
"framework-cost-dominated" if noop_connector ≥ 50% of
|
||
mooncake_both tax, "implementation-cost-dominated" if < 30%.
|
||
4. **H4 additivity**: |tax(multi) − (tax(mooncake_both) +
|
||
tax(lmcache_only))| / tax(multi) ≤ 0.30 → linear.
|
||
5. **H5 + H6 directional**: report whether tax_TTFT_p90 grows with
|
||
input and tax_TPOT_p90 grows with output (sign + magnitude).
|
||
6. **All artifacts present**: every config that ran has the 6 file
|
||
types; every SKIP config has a reason in `preflight_status.json`.
|
||
7. **Bench finishes < 6 h** wall clock on idle dash0
|
||
(Phase A + Phase B + optional Phase B' combined; reflects min-completed
|
||
extension at low rates).
|
||
|
||
---
|
||
|
||
## Out of Scope
|
||
|
||
- Multi-node Mooncake (RDMA over actual network).
|
||
- Patching Mooncake or vLLM to optimize the substrate (the point of
|
||
this microbench is to measure baseline cost as shipped).
|
||
- Varying `chunk_size`, `max_num_seqs`, or other vLLM scheduler
|
||
parameters; fixed at trace-replay defaults.
|
||
- chunk-boundary effects (input ∈ {8192, 16384}). The reviewer noted
|
||
this is a real follow-up but adding it doubles Phase B runtime.
|
||
Documented as a follow-up if Phase B shows shape-dependent tax that
|
||
can't be explained by total token count.
|
||
|
||
---
|
||
|
||
## Cross-references
|
||
|
||
- `analysis/characterization/elastic_migration_v2/README.md` — the
|
||
trace-replay paper this microbench validates / refutes.
|
||
- `microbench/interference/` — Microbench 1 (B2 same-worker
|
||
interference; complementary).
|
||
- `microbench/lifecycle/` — Microbench 2 (PD-sep transfer breakdown;
|
||
uses different vLLM patches).
|
||
- `microbench/patches/` — `_pd_profile.py` template if A3 fallback
|
||
is needed.
|
||
|
||
---
|
||
|
||
## Files
|
||
|
||
```
|
||
microbench/connector_tax/
|
||
├── DESIGN.md # this file
|
||
├── MANIFEST.md # filled per run
|
||
├── tools/
|
||
│ ├── noop_connector.py # custom NoOpConnector for H3
|
||
│ ├── dummy_bootstrap.py # for kv_consumer pre-flight
|
||
│ ├── verify_kv_consumer.sh
|
||
│ ├── verify_multi_connector.sh
|
||
│ └── verify_noop_connector.sh
|
||
├── patches/
|
||
│ ├── _step_profile.py # event emitter (ports _pd_profile)
|
||
│ ├── scheduler_step_timing.py # idempotent install/revert
|
||
│ └── apply.sh # invoked by run_all.sh
|
||
├── launch/
|
||
│ ├── launch_plain.sh
|
||
│ ├── launch_noop_connector.sh
|
||
│ ├── launch_mooncake_producer.sh
|
||
│ ├── launch_mooncake_consumer.sh
|
||
│ ├── launch_mooncake_both.sh
|
||
│ ├── launch_nixl_both.sh
|
||
│ ├── launch_lmcache_only.sh
|
||
│ └── launch_multi_mooncake_lmcache.sh
|
||
├── bench_loop.py # open-loop loadgen (--min-completed)
|
||
├── metrics_sampler.py # /metrics scraper
|
||
├── analyze.py # raw → percentiles + saturation flags
|
||
├── plot_connector_tax.py # all figures
|
||
├── run_all.sh # 4-stage barrier orchestrator
|
||
└── results/<date>_<config>/ # per-run artifacts
|
||
└── results/preflight/ # pre-flight verification
|
||
```
|