- plot_pd_crossover.py fig_conc: lead with request-completion % (the honest
collapse signal; latency percentiles count successes only), then mean-E2E /
TPS; note PD-capped/colo-uncapped in the title.
- add microbench/fresh_setup/gpu_monitor.sh (referenced by the committed
mb5_run_gpu.sh:73 for per-GPU util collection).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Reuse and concurrency axes redone with proper controlled variables, plus
the orchestration used to run them on dash0:
- run_reuse_fixed.sh: hold REAL prefill work (delta) constant, vary only
cached prefix -> reuse = C/(C+U). Supersedes old fig1 (which held
input=8192 and sliced prefix out, confounding "more reuse" with "less
prefill").
- run_conc.sh: agentic-corner config (in=32768, delta=512, reuse=0.984,
out=128) that exposes PD's structural KV-transfer tax. Supersedes old fig3.
- run_campaign{,2,3}.sh, backfill_d2048o128.sh: serial campaign drivers
(strictly one driver at a time), out=128 sweeps, PD wall-cap for
collapse-draining high-reuse arms, and flaked-arm backfill.
- mb5_run_gpu.sh: per-config bring-up / replay / teardown orchestrator.
- plot_pd_crossover.py: render the reuse_compare figures from fig_agg dumps.
- fig_agg.py: tolerate null stats from fully-collapsed arms (0 successes
write the stat keys as null; `dict.get(k, {})` returns null, not {}).
Data: fig1_reuse_fixed.json, fig1_reuse_d{1024,2048}_o128.json
Figs: reuse_compare_AB.png, reuse_compare_ABC.png
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Follow-up to the LMetric sweep: rerun with --policy linear (cache-aware
load + sticky session affinity, the cache_aware_proxy default) and cap
each PD-disagg arm at 2x the colo bench wall (SIGTERM bench.sh once cap
is exceeded; the cleanup trap clears vLLM and proxy; capped runs lack
metrics.summary.json so the analysis script computes from raw
metrics.jsonl).
Headline: the success-rate ceiling is policy-invariant.
arm linear (capped at 2x) lmetric (uncapped)
colo 807/807 = 100%, 964s 807/807 = 100%, 1021s
pd6 (6:2) 472/807 = 58%, 2280s ⊗ 474/807 = 59%, 3325s
pd4 (4:4) 349/807 = 43%, 2281s ⊗ 348/807 = 43%, 6850s
pd2 (2:6) 176/807 = 22%, 2280s ⊗ 180/807 = 22%, 19275s
Routing affects only how much wall is wasted timing out unreachable
requests at 600s each. Linear hits the same ceiling in 2280s as
LMetric does in 3300-19000s. This *strengthens* the §5 D-pool
capacity-ceiling thesis -- the cap is structural, not a routing
artifact.
Artifacts:
analysis/v2/fig4r_linear.json -- 4-arm linear summary
analysis/v2/PD_DISAGG_LMETRIC.md -- extended with wall-cap section
figs/v2/fig4_linear_vs_lmetric.png -- 3-panel side-by-side comparison
microbench/fresh_setup/plot_fig4_linear_vs_lmetric.py
Anchor experiment for the clean-stack PD comparison using the canonical
cache-aware proxy with --policy lmetric (scripts/bench.sh harness). Two
traces x four arms = eight runs on dash1.
Headline: with the right routing baseline (LMetric), PD-colo holds 100%
completion on both traces while every static PD-disagg ratio fails
(14-65% completion), and the failure mode rotates with the split --
no static partition has a working operating point on this workload.
LMetric improves colo dramatically (TTFT p50 1.0s vs original §3 RR
7.0s; 7x) but does NOT rescue PD-disagg, confirming the bottleneck is
structural (D-pool admission + multi-turn KV accumulation), not routing.
Completion matrix:
first600s full
colo 100% 100%
pd6 (6:2) 58.7% 65.3% (decode-bound)
pd4 (4:4) 43.1% 43.9% (both bottlenecks)
pd2 (2:6) 22.3% 13.9% (prefill-bound)
The original §3 RR "100% PD completion" appears to be a measurement
artifact of e13391e: producer-KV eviction acted as a relief valve,
letting more requests squeeze under the 600s timeout at the (uncosted)
price of cross-turn re-prefill. With the eviction off, PD-disagg is
worse than §3 advertised, not better.
Artifacts:
analysis/v2/fig4l_lmetric.json -- 8-arm summary data
analysis/v2/PD_DISAGG_LMETRIC.md -- writeup + reproduce recipe
figs/v2/fig4_lmetric_pd_vs_colo.png -- 4-panel comparison figure
microbench/fresh_setup/plot_fig4l_lmetric.py -- plot script
Adds dated, non-destructive correction notes to the contaminated PD-vs-colo
artifacts after the producer-eviction bug (`evict_blocks(sent_block_ids)` on
`finished_sending`, deployed over the "fresh" pip vLLM by
`scripts/deploy_vllm_patches.sh`) was found and gated behind
`VLLM_EVICT_SENT_BLOCKS` (default off).
PD_DISAGG_RESULTS.md top CORRECTION banner + §6 RETRACTED marker.
§6 (session-affinity hot-pin) was an `e13391e`
artifact under controlled concurrency; §3 RR, §4
TPOT win, §5 D-pool ceiling, §5.1 consumer crash
stand.
RESULTS_SUMMARY.md §4 confirm+refine note: clean ablation confirms
the D-pool capacity thesis and adds regime-
dependence.
pd_separation_analysis.md scoped caution: thesis confirmed; flags
only reuse-dependent figures for cross-check
(this study used a different stack).
figs/mb5/CORRECTION.md flags mb5_producer_hotspot.png as retracted;
§3 RR and §5 D-pool figures stand.
- b3_isolated_policy.sh: HEALTH_MAX_TRIES now env-overridable (default 180 ->
360s unchanged); slow-node launches can pass HEALTH_MAX_TRIES=300 (600s) to
ride out a single-instance startup flake without aborting the whole arm.
- run_5policy_both_modes.sh: runs run_5policy_600s.sh twice on the SAME ttp
trace with REPLAY_DISPATCH_MODE={tracets,thinktime}, so the only variable is
dispatch mode. Outputs to outputs/policy5_600s_{mode}_<date>/.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
gen_synthetic_trace.py --mode regular: maximally-regular multi-turn trace
(fixed prefix/delta/turns, constant arrivals, zero session skew) to isolate
the structural PD cost (per-turn full-context transfer + P/D capacity split)
from the skew/hot-pin artifact.
analysis/crossover/: SLO-goodput PD_advantage sweeps bracketing the
prefill<->decode bottleneck axis (D1 grow input -> prefill-bound; D2 grow
output -> decode-bound). figs/crossover_pd_advantage.png shows the crossover
(y=1) with the agentic operating region annotated.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
traces/w600_r0.0015_st30_first600s.jsonl: first-600s cut of the shipped w600
trace (807 reqs, 274 sessions, all turn-1s + early later-turns; theoretical
APC ceiling ~70% vs 80% full). Faster iteration (~18 min/arm) but a colder,
lower-locality regime; whitelisted alongside the parent anonymized trace.
analysis/lpwl_5policy_600s.md: LPWL vs LMetric/sticky/unified/unified+A+B on
the 600s trace (dash1 8xH20, cold APC, n=1). LPWL is overall best with zero
knobs — TTFT p90 7983ms vs tuned A+B 11562 (-31%), E2E p90 -16%, best request
balance; APC 0.648 (emergent affinity, far above LMetric 0.507); only loss is
E2E p99 from heavy-class decode concentration. Demonstrates anti-overfit: A+B
was tuned on full w600 yet is beaten by the knob-free policy on this regime.
Includes the run_5policy_600s.sh repro driver.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Per-port vllm:prefix_cache_{queries,hits}_total -> instance_apc.txt. For PD
this is the only honest reuse signal: producer ports show cross-turn prefix
hits, while the consumer's per-request cached_tokens just counts transferred KV.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Real-time engine state is NOT the routing lever. Across 6 policies × es0/es1,
real state reshuffles 44-76% of decisions but never beats the champion
(unified+A+B, p90 7.62s). The effect's SIGN is set by reactivity: one-shot
placement (sticky) HELPS -26%; per-request affinity-dominated is a wash;
per-request pure-load (lmetric +17%, load_only +27%) HURTS via herding (stale
shadow was a dampener). Feed verified fresh (median 25ms, <=92ms during
prefills). Prior shadow-state results stand. ES_ABLATION_RESULTS.md has the
table + mechanism; run_full_ablation.sh / fresh_sampler.py / cmp_es.py are the
harness.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Brings the worktree-mooncake-layerwise line (layerwise Mooncake connector,
write-mode proxy, real engine-state feed + eff_ accessors, mb7 microbench,
v3 trace re-profile, A/B x migration matrix runner) into main so the repo
is self-contained for these experiments. Disjoint paths
(microbench/connector_tax/layerwise/*) => clean merge.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
gen_synthetic_trace (vanilla Poisson, zero prefix reuse — the regime where
PD-disagg is expected to win), mutate_trace (morph reuse/burst/skew toward
the agentic regime), and plot_crossover. Emits the replayer's JSONL schema.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Standalone smoke tests validating KV-migration correctness paths before
trace replay: full migrate-cache, partial-prefill transfer, and a
NIXL-connector variant, each with a runner.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
MIGRATION_TRANSFER_COST.md: under real load, migration KV transfer runs at
~3 GB/s vs ~10 GB/s idle. Decomposed (instruments + MB6 microbench) into
~55% RDMA-actual (HBM/PCIe contention with running kernels: 7.6->4.0 GB/s)
+ ~45% control-plane GIL starvation during long prefills. Reproduced on a
fresh upstream venv (byte-identical transfer path) -> upstream/hardware
inherent, not our patch. Layerwise is the wrong lever; the tax is structural
on a loaded agentic cluster. Includes mb6_transfer_under_load + run_mb6,
instrument_dst_migration/mooncake, and the dst/transfer decomposition analyzers.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
cache_aware_proxy: add lmetric_decode_weight (decode-load penalty in the
LMetric fallback score) and a v3 anti-hotspot recent-migration penalty
(effective_load = num_req + recent-migration count over a sliding window),
preventing back-to-back migration clustering. UNIFIED_ABLATION.md documents
the A (overload_factor=1.3) + B' (decode-weight, max(num_req,1)) + RaceFix
sweep: A+B'+RaceFix reaches TTFT p90 7770ms, beating v3 PD-sep migration by
~20%. Runners/analyzer for the b3 trace replay included.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Direct per-producer KV-pool evidence for the session-affinity backfire.
At the same 4P+4D ratio:
- round-robin: 4 producers within 1pp of each other (spread 0pp, CV 0.01)
- session-affinity: spread 49pp (one producer ~93%, another 45%; CV 0.25)
A 25x jump in producer load imbalance — heavy multi-turn sessions
concentrate onto single producers, the same hot-pinning pathology as
sticky routing in the colocated §3.3 study.
plot_producer_hotspot.py: reduce (numpy, per-producer KV timeline from
snapshots, runs on the serving host) + plot (matplotlib, 2-panel rr vs
session comparison) — same two-stage pattern as aggregate_mb5.py.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Swept session-affinity P routing (MB5_P_ROUTING=session) across all
four ratios on the metrics-fixed stack. Findings:
- Strictly worse than round-robin at every ratio. 4P+4D: round-robin
100% vs session-affinity 36% completion.
- Success DECREASES monotonically as decode capacity grows
(6P+2D 59% -> 4P+4D 36% -> 3P+5D 24% -> 2P+6D 19%) — refutes the
"session prefill is faster so it needs more D" hypothesis.
- GPUs sit at ~0% utilization (2P+6D entirely idle) — the cluster
stalls on KV-transfer/admission coordination, not compute. This is
the deepest anti-PD argument: paid-for hardware does nothing while
requests pile up; colocation keeps every GPU busy.
- Mechanism: session-affinity pins heavy multi-turn sessions onto
single producers (producer hot-pinning, same pathology as sticky
routing in the colocated §3.3 study); fewer producers -> worse
concentration -> the monotonic decline. Failed transfers also pin
producer KV (kv_load_failure_policy=fail), compounding to deadlock.
Verdict: neither ratio tuning nor routing policy rescues static
PD-disagg for this agentic workload — the failure is structural.
mb5_launch.sh: add 5P+3D / 3P+5D ratios for the sweep.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
InstanceState.eff_{num_requests,pending_prefill,ongoing_decode,ongoing_tokens}
= max(shadow, real) when feed fresh (fixes 30s-stale under-count, keeps
in-flight RaceFix), plus real-only r_max_prefill_remaining / r_kv_used_frac.
Wired into load_only, lmetric, sticky, unified(_kv_both), unified_v3, and
snapshot logging. Feed off => identical to before. run_v3_trace.sh gains ES=1
toggle (always deploys enhanced proxy); run_ablation_es.sh runs each config
ES0-vs-ES1 to test whether real state changes policy performance/ranking.
All unit-tested without GPU.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
vLLM scheduler publishes real state (running/waiting, KV free, and the
max-in-progress-prefill signal /metrics lacks) to a tmpfs/redis store ~20Hz;
router reads it and avoids GIL-stall (mid-large-prefill) + KV-capacity-wall
targets, using real load over 30s-stale shadow counters. Components:
engine_state.py (canonical+reader), instrument_engine_state.py (scheduler
patch, file/redis writer), migration_target.py (scorer), proxy wiring
(--engine-state-uri, off=unchanged). All unit-tested without GPU; not yet
run live. See P2_ENGINE_STATE.md.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
1213/1214 success; matched migrations (4 common) improved -2.6 to -7.2s,
scaling with prefill hidden behind transfer. Trace-level TTFT p90 -6% / p99
-5% (modest: migrations are 2% of reqs and partly queue-bound). Confirms
layer-wise removes the transfer half of migration overhead but not the
control-plane/queue residual. DESIGN.md updated with results.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Scheduler tracks per-producer block_ids (accumulated from scheduler_output)
and emits per-step LWSendMeta with cumulative computed_tokens. Worker
lw_wait_for_save records a CUDA event per step and enqueues progress; the
sender-loop ship loop drains it, shipping only computed+dst-wanted+unshipped
blocks in order (correct under chunked prefill). Per-transfer state =
concurrent-safe. Keeps v1 single-transfer version as reference.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
mb7 with background decode load (8/instance). Critical-path transfer overhead
stays ~constant ~90ms for layerwise vs 158/239/749ms baseline (up to 7.9x at
32k), prefill not slowed, KV correct. Confirms the overlap holds on busy
instances. DESIGN.md updated with idle-vs-load table + the two blockers
(chunk-safety, concurrent-transfer safety) that the full 1200-req trace needs.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Implements per-layer KV push during prefill (write mode) on vLLM's
MooncakeConnector, env-gated by MOONCAKE_LAYERWISE=1. 2-instance microbench
(mb7) shows correctness (KV lands, cached==prompt) and that the transfer is
hidden behind prefill compute: critical-path overhead drops from O(KV size)
(123/202/529ms for 8k/16k/32k) to a flat ~58ms (2-9x), with no prefill
slowdown, on idle instances. Caveats: idle-only, chunked-prefill disabled,
single concurrent transfer — see DESIGN.md.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Root cause of the 6P+2D run-to-run collapse (rep1 100%, rep2 56%,
rep3 80%, session-routing 6.6%): not load-shedding, but a consumer
EngineCore crash.
Failure chain observed in the consumer logs:
1. D-pool fills to ~97% (decode-side capacity ceiling, the H1 story)
2. a large request's KV transfer fails: "Mooncake transfer engine
returned -1" (112k-token request, pool full)
3. scheduler fails the request (kv_load_failure_policy=fail)
4. PromptTokenStats.local_cache_hit = num_cached + recomputed -
num_external_computed goes NEGATIVE (external transfer exceeded
cached count)
5. loggers.record() calls Counter.inc(negative) -> prometheus raises
"Counters can only be incremented by non-negative amounts."
6. EngineCore dies -> every subsequent request fails (the cliff:
all successes in the first ~110s, zero after)
This turns ONE failed request into a total config collapse, and is
what made the round-robin 6P+2D reps look randomly variable.
Fix: clamp the three per-source prompt-token counts to >= 0 in
loggers.record() before they hit Counter.inc(). Pure insertion,
revertible via the existing sentinel mechanism. Lets a transfer
failure stay a single failed request instead of killing the engine,
so routing arms can be compared on equal footing.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
aggregate_mb5.py:
- Split the cluster KV timeline by role (P-pool vs D-pool) using a
PID->role map parsed from vllm_logs filenames. The cluster average
hid the result — 6P+2D/4P+4D look ~45% utilized but the decode pool
is actually pegged at ~100% while prefill idles at ~30%.
- Two-stage reduce/plot: --reduce-to (numpy-only, runs on the serving
host over multi-GB snapshot dirs) dumps a compact JSON; --from-reduced
(matplotlib) renders locally. matplotlib import is now lazy.
- New plot_role_split figure + p/d peak/steady columns in the CSV.
PD_DISAGG_RESULTS.md: consolidated writeup with figures inline.
Verdict: no static P:D ratio beats 8C colocation. The binding
constraint moves with the ratio (D-pool saturates at 6P+2D/4P+4D,
P-pool jams at 2P+6D -> 91% request loss); 8C's shared pool stays
elastic at 34% steady, 100% completion. PD wins TPOT (10-35x cleaner,
the MB1 phase-isolation benefit is real) but loses TTFT and sheds
load. Round-robin P routing also zeroes prefix-cache reuse; a
session-affinity re-run of 6P+2D is in flight to test the fix.
Figures (rep1): mb5_kv_timeline, mb5_role_split, mb5_peak_utilization,
mb5_latency_compare + mb5_summary.csv.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The upstream mooncake_connector_proxy round-robins both P and D
selection. For agentic multi-turn sessions this destroys prefix-cache
reuse on the producer side — every turn of a session lands on a
different P, so the prefix-cache hit ratio collapses to 0 (observed in
the 6P+2D round-robin baseline) and every turn re-prefills from
scratch, piling extra load on the P pool.
Add an env-gated routing mode so the same proxy serves both arms of a
clean A/B:
MB5_P_ROUTING=rr round-robin (default, = upstream behavior)
MB5_P_ROUTING=session consistent md5 hash on X-Session-Id -> same
producer for all turns of a session
Decode side stays round-robin (load balance) in both modes — decode
KV is freshly transferred per turn, so D gains nothing from affinity
but everything from even load spreading.
mb5_launch.sh threads MB5_P_ROUTING through to the proxy and logs the
active mode. Default path is byte-for-byte the old behavior, so an
in-flight round-robin sweep is unaffected if this is redeployed.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Phase 0 infrastructure (vendored proxy, dual-file vLLM patcher,
per-instance + cross-config plotters) is fully assembled and
smoke-validated. Sweep RUN_TAG=20260527_164040 (4 configs × 3 reps
on w600) is running on dash1.
Also realigned the figure list with what `aggregate_mb5.py`
actually produces (mb5_kv_timeline, mb5_peak_utilization,
mb5_latency_compare, mb5_summary.csv).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Reads sweep root + tag, for each (config, rep):
- merges per-PID snapshots into cluster-wide KV timeline (carry-forward
for PIDs without a sample in the bin)
- computes peak (max) and steady-state (10-90% median) pool utilization
- pulls latency p50/p90/p99 from replay_metrics.summary.json
Produces 4 outputs in --out-dir:
- mb5_kv_timeline.png — N-panel cluster KV % over time, one panel per
config, faint per-rep lines + bold median
- mb5_peak_utilization.png — bar chart (peak vs steady) with ±std error bars
- mb5_latency_compare.png — bar chart p50/p90/p99 e2e latency per config
- mb5_summary.csv — flat per-(config, rep) table for the writeup
Validated on 4P+4D × 20-req smoke:
4P+4D rep1: peak=12.8% steady=10.7% peak_wait=1
p50=1.3s p90=10.5s p99=17.1s (vs. <1s for 8C — expected gap).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three independent bugs were blocking PD-disagg smoke; each fix is
isolated so the next PD experiment doesn't re-hit them.
1. mb5_launch.sh
- stop_all() also kills mb5_pd_proxy.py (our vendored copy),
not just the upstream filename, and asserts ports 8000-8007 +
PROXY_PORT are free before launching — stale proxies were
silently passing the readiness check.
- Proxy readiness uses a generic "any HTTP response" probe;
mooncake_connector_proxy only exposes /v1/completions so
/v1/models 404 is expected.
2. mb5_pd_proxy.py (vendored from third_party so deploy.sh ships it)
- Force min_tokens=1 on the prefill leg. Clients that set
min_tokens == max_tokens (our replayer does) collide with
vLLM's min_tokens<=max_tokens check after the proxy caps
max_tokens=1.
3. instrument_kv_snapshot.py
- Adds a second patch target: initialize
MooncakeConnectorWorker.bootstrap_server = None in __init__.
vLLM 0.18.1 only sets it under the is_kv_producer branch, so
kv_consumer hits AttributeError as soon as the first remote
prefill request lands.
- apply/revert refactored to iterate over (path, patches) pairs.
plot_kv_pool_timeline.py also handles snapshot files that never
captured a running request (would otherwise IndexError on an empty
stackplot input).
Smoke: 4P+4D × 20 reqs → 20/20 success, mean 3.9s, p99 17s, 8 PIDs
all writing snapshots (601 total), well above the 8C baseline.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two bugs caught by 8C smoke:
mb5_launch.sh
${env_bp_arg} expanded as a literal command line prefix doesn't work
when env_bp_arg is itself a variable — bash only treats VAR=val as
an env assignment if it sees the literal in the parsed command, not
after expansion. Fix: always export VLLM_MOONCAKE_BOOTSTRAP_PORT as
a literal, defaulting to 9999 when caller passed no port (consumer
mode ignores the var so the placeholder is harmless).
mb5_run.sh
replayer's actual CLI flags are --trace / --output / --endpoint /
--model, not the --*-path / --*-name variants I had. Plus dash1
has no `bc`; compute wall_clock_s via python instead.
Both fixed; 8C smoke (CONFIG=8C REPS=1 REQUEST_LIMIT=20) now runs
end-to-end in ~30 s:
- 8 vLLM kv_both instances on GPU 0-7 come up
- replayer round-robins 20 reqs across them
- MB5 instrumentation captures 8 snapshot files (one per EngineCore
PID), ranging 7-139 snapshots each = ~10 Hz throttle works
- plot_kv_pool_timeline.py renders the stacked-area + queue-depth
chart cleanly (figs/mb5_smoke/*.png)
Pipeline validated. Ready for the real PD-ratio sweep.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three new files to drive the PD ratio sweep + per-request KV occupancy
capture, plus a deploy.sh update so the patched replayer rides along
to the fresh-venv host.
mb5_launch.sh
One script handles all four configs we plan to sweep:
CONFIG=8C / 6P+2D / 4P+4D / 2P+6D
- For 8C: 8 vLLM instances with kv_role=kv_both on GPU 0-7. Replayer
talks to them via the existing comma-separated round-robin in
replayer/replay.py — no proxy.
- For PD configs: kv_role=kv_producer for the P pool (with
VLLM_MOONCAKE_BOOTSTRAP_PORT) + kv_role=kv_consumer for the D pool,
routed by the official vLLM example
third_party/vllm/examples/online_serving/disaggregated_serving/
mooncake_connector/mooncake_connector_proxy.py — no policy choice
made by us, per user instruction to use the standard recipe.
- Applies instrument_kv_snapshot.py before launching so every
EngineCore writes its per-step KV snapshot to
$RUN_ROOT/kv_snapshots/mb5_kv_snapshot_pid<pid>.jsonl
- Reverts the patch on stop.
- Emits ENDPOINTS= line on stdout for the orchestrator to read.
mb5_run.sh
For each CONFIG × rep: launch, replay w600 trace via the existing
replayer, capture wall-clock, tear down, cool down 10 s. Defaults:
CONFIGS="8C 6P+2D 4P+4D 2P+6D"
REPS=3
TRACE=traces/w600_r0.0015_st30.jsonl
All artefacts go under $FRESH_ROOT/mb5_runs/$RUN_TAG_${config}_rep${rep}/
(vllm_logs/, kv_snapshots/, replay_metrics.jsonl, wall_clock_s.txt).
plot_kv_pool_timeline.py
Reads one or more mb5_kv_snapshot_pid*.jsonl files and renders a
stacked-area chart per file:
x = wall-clock since first snapshot
y = KV block count, stacked by per-request contribution
overlay: pool-total ceiling, 90% line, waiting-queue depth subplot
Bands are colored by a deterministic hash of request_id so individual
requests are visually tractable across the run.
This is the figure the user asked for — turns headline "PD-disagg is
10× worse" into a system-level picture of *where* the KV pool is
blocked, when, and by which requests.
deploy.sh
Also tar-syncs the local replayer/ dir to
/home/admin/cpfs/wjh/agentic-kv-fresh/replayer/ so mb5_run.sh can
`python -m replayer` against the patched (trace_span_s/amplification)
version, not the older copy under /home/admin/cpfs/wjh/agentic-kv/.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The §3.2 H1 (D-pool capacity wall) argument needs system-level evidence,
not just headline latency. This patch lets us record, every ~100 ms,
the exact composition of each vLLM instance's KV pool:
- total / free / used block counts
- for each RUNNING request: blocks held, computed tokens, prompt tokens
- for each WAITING request: prompt tokens, status
Hook: inside Scheduler.schedule() right before the return. Per-request
blocks come from coordinator.single_type_managers[*].req_to_blocks
(vLLM 0.18.1's own per-request bookkeeping; no new tracking layer).
Throttled by MB5_PERIOD_MS env var (default 100 ms = 10 Hz) so a
13-min trace replay produces ~8 k snapshots per instance instead of
~80 k unthrottled.
Output: $MB5_LOG_DIR/mb5_kv_snapshot_pid<pid>.jsonl
(default MB5_LOG_DIR=/tmp). One file per EngineCore PID.
Apply/revert idempotent, same pattern as instrument_mooncake.py.
Markers: # MB5_INSTRUMENT_START / # MB5_INSTRUMENT_END.
Validated on dash1 venv: apply → py_compile ok → revert → py_compile ok.
With this in place we can build the stacked-area "KV pool composition
over time" figure the user asked for: x = wall-clock, y = block count,
colored bands = per-request portions. Comparing 8C colo vs 4P+4D
on the same trace will directly show whether (and when) the D pool
hits its ceiling — turning "PD-disagg is X× worse" into "PD-disagg
is X× worse BECAUSE these specific requests at this specific time
filled the pool and forced this queue depth".
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
We don't have paper-grade evidence yet that PD-disagg fails in agentic.
MB1+MB2 corrected accounting puts phase-isolation cost-benefit on
PD-disagg's side; the only direct support is colleague's one data point
on a patched dash0 build (TTFT p50 62×, success 52%) and the f4b
geometric capacity argument. To close §3.2 properly we need fresh-venv
empirical replication PLUS system-level instrumentation that tells the
reviewer *which* component is the bottleneck — not just headline
latency.
This document tracks the four candidate failure hypotheses (H1 D-pool
capacity, H2 static-partition mismatch, H3 cache reuse + P-pool hotspot,
H4 end-to-end throughput loss), their current evidence status, and the
phased experiment plan to address each.
Key findings already recorded:
- Phase 0 TODO 0.1 (find standard PD-disagg deployment) is done — vLLM
ships an official example at
examples/online_serving/disaggregated_serving/mooncake_connector/
with a kv_producer+kv_consumer launcher and a Mooncake-aware proxy
that supports arbitrary P:D ratios via env vars. Per user direction,
we will NOT polish PD-disagg policy ourselves; we use the official
recipe as the "PD-disagg" baseline in §3.2 / §5.2.
- Phase 1 (MB5+3 combined: PD ratio sweep with D-pool occupancy logging)
is the critical path. Designed to either confirm H1 with system
breakdown evidence (D-pool ≥ 90% for ≥ 30% of trace + queue depth
spike) or falsify it (some ratio matches 8C colo, in which case §3.2
needs rewriting).
- D-pool occupancy timeline is the single most important new
instrumentation — turns "PD-disagg is 10× worse" into "PD-disagg is
10× worse BECAUSE the D pool sits at >90% for X% of the trace".
Configurations to run on dash1 8-GPU first:
8C (colo baseline), 6P+2D, 4P+4D, 2P+6D × 3 reps × w600 trace.
Open question still in the doc: vLLM 0.18.1 had an AttributeError on
self.bootstrap_server in kv_consumer mode when we hit it during MB2
sanity; likely the issue was bad kv_transfer_params from our side
(missing transfer_id, wrong field names), which we have since fixed.
Official proxy uses the same handshake we now have, so it should
just work. If not, single-line patch to initialize self.bootstrap_server
= None for consumer mode.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
The §3.2 cost-vs-benefit math in commits 029821c (MB1 plot +
pd_cost_vs_benefit.png) and abde010 (RESULTS_SUMMARY.md) was wrong.
What was wrong:
I framed PD-disagg's max phase-isolation benefit as "≤ decode duration
of the new request (~50–200 ms)" — implicitly treating the benefit as
per-request and bounded by that request's own decode. The correct
accounting is per-prefill-event across all stalled streams:
benefit_per_prefill = D × T_prefill × (1 − TPOT_baseline/TPOT_during)
≈ D × T_prefill
which follows from the chunked-prefill math (each of L/N chunks slows
D ongoing decode steps from ~10 ms to t ms, summing to D × T_prefill).
Plug MB1 + MB2 numbers in:
prefill size | T_prefill | T_transfer | D=8 benefit | cost/benefit
2k tok | 0.14 s | 8 ms | 1.1 s | 0.7 %
33k tok | 4.5 s | 320 ms | 36 s | 0.9 %
125k tok | 57 s | 1.9 s | 456 s | 0.4 %
On the phase-isolation axis alone, PD-disagg WINS by 100×–250× — the
opposite of what the deleted figure showed.
The actual dominant reason static PD-disagg fails in agentic is the
D-side KV pool capacity wall (figs/f4b_pdsep_kv_wall.png) — p99
single-request KV is 11.5 GiB, per-D-instance pool is 38 GiB, so 4P+4D
halves system decode capacity. Colleague's 4P+4D experiment showed
TTFT p50 62× worse and success rate 99.5% → 52%, driven by pool
overflow + queueing, not by transfer latency.
Changes (all touched files explicitly listed; no `git add -u`):
- figs/pd_cost_vs_benefit.png : DELETED (figure built on wrong math)
- microbench/fresh_setup/plot_mb1.py : drop the pd_cost_vs_benefit
function; keep mb1_interference.png and update its title to note
per-prefill aggregate stall = D × T_prefill (not capped by decode)
- figs/mb1_interference.png : regenerated, no misleading band annotation
- analysis/mb1/README.md : Summary block rewritten ("what MB1 measures";
no more "max benefit = decode duration" claim); §3.2 implications
section replaced with the corrected per-prefill-event table; explicit
⚠ Correction note documents what was wrong
- analysis/mb2/README.md : Summary block + §3.2 implications section
rewritten the same way; ⚠ Correction note links to RESULTS_SUMMARY §4
- RESULTS_SUMMARY.md §4 + §6 : §4 reordered to lead with the D-side
capacity argument (the real failure mode), MB1/MB2 demoted from
"kill-shot for PD-disagg" to "supporting context inputs to a
cost-benefit table that actually favors PD-disagg on this axis";
§6 paper-claims list reordered to remove the wrong "PD-disagg loses
on cost-vs-benefit" claim and replace with the corrected ones
PAPER_OUTLINE.md and MEETING.md were checked and never picked up this
specific wrong claim — they already (correctly) frame §3.2 around the
D-side KV memory wall.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Single-GPU bench on dash1 GPU 0 (vanilla vLLM 0.18.1, chunked-prefill on,
no kv_connector). 3 decode batch sizes × 5 prefill sizes × 3 reps.
Method recap (driver: microbench/interference/driver.py, repurposed):
- Pin D streaming decode requests at constant max_tokens
- Inject one prefill-only request (max_tokens=1) of varying input length
- Bin decode-stream token timestamps into "during prefill" vs baseline
- Headline metric: effective per-stream TPOT during the prefill burst,
= prefill_ttft / (num_tokens_during_prefill / D). This is the average
rate at which each decode stream produces tokens during the burst.
p50 of inter-token intervals is deceptive (chunked-prefill makes most
intervals look normal); the burst-average gives the true cost.
Results (D=8 row, the most agentic-realistic case):
P (tokens) | prefill_ttft | per-stream TPOT during | penalty
2048 | 143 ms | 32 ms | 4×
8192 | 583 ms | 114 ms | 15×
32768 | 4520 ms | 388 ms | 52×
65536 | 15615 ms | 757 ms | 99×
131072 | 56991 ms | 1419 ms | 183×
Baseline TPOT at D=8: ~7.7 ms. So during a 131k-token prefill burst
each ongoing decode is running ~183× slower (i.e. essentially halted)
for ~57 seconds.
§3.2 implication: PD-disagg's promised phase-isolation benefit per
agentic request is bounded by the decode duration, which is 50–200 ms
for tool-call output. MB2 says the KV-transfer cost of PD-disagg
is 300 ms – 10 s for agentic-size requests. Cost > benefit for every
KV size above ~80 MiB (well below trace mean 192 MiB).
The new figs/pd_cost_vs_benefit.png overlays MB1 benefit ceiling
(50–200 ms band, capped by decode) onto MB2 transfer cost curve and
marks the agentic-distribution waypoints (trace mean, p90, p95, p99)
on the x-axis. Across the entire agentic distribution, the cost curve
sits above the benefit band.
Adds:
- microbench/fresh_setup/mb1_launch.sh: single-GPU vLLM launcher (no
kv_connector, default chunked_prefill=on, max_num_batched_tokens=8192)
- microbench/fresh_setup/mb1_driver.py: copy of the existing
microbench/interference/driver.py for cpfs deployment
- microbench/fresh_setup/analyze_mb1.py: aggregator emitting
per-(D, P) effective-TPOT-during + max PD-disagg-benefit table
- microbench/fresh_setup/plot_mb1.py: mb1 standalone +
pd_cost_vs_benefit headline figure
- analysis/mb1/summary.csv: 45 raw rows from the sweep
- analysis/mb1/breakdown.json: per-(D, P) aggregate
- analysis/mb1/README.md: persistent doc
- figs/mb1_interference.png: effective TPOT during prefill, one line per D
- figs/pd_cost_vs_benefit.png: §3.2 headline (cost > benefit everywhere)
Caveats noted in README:
- chunk_tokens=8192 only; Sarathi-Serve's smaller chunks would
interleave decode more aggressively. Chunk-size sensitivity is
flagged as next run.
- D ≤ 8; higher D may saturate or shrink the penalty further.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sweep on dash1 GPU 0 → dash2 GPU 0 over 200 Gbps RoCE.
remote_bootstrap_addr=http://172.27.123.142:8998. Same 9-size × 5-rep
config as the 2026-05-27 intra-node run.
Per-size pure_transfer (p50) lines up within 1–3% of the intra-node
numbers across all sizes:
size intra p50 inter p50
512 tok 5.3 ms 5.2 ms
2048 tok 20.6 20.0
8192 tok 83.7 80.9
32k tok 320.9 309.6
64k tok 1895 1734 (bimodal in both)
128k tok 2835 2818 (bimodal in both)
=> Mooncake's batch_transfer_sync_write **does not use NVLink** for
intra-node peers; both paths go through the 200 Gbps RDMA NIC, with
the 200 Gbps NIC (not the GPU interconnect) being the bottleneck. The
~9.7 GB/s steady-state ceiling and the 6+ GiB variance regime are
identical across topologies.
Operational implication for §3.2: PD-disaggregation does not get
cheaper by co-locating P and D on the same node — every routed request
pays the same ~10 GB/s ceiling for KV transfer, no matter where it
lands. Halving the transfer cost cannot be bought back by topology.
Caveat: B's receive_kv events did not log on dash2 — `MB2_LOG_DIR`
env var did not propagate through vLLM's EngineCore subprocess on
the consumer host (cat /proc/$ENGINE_PID/environ is empty on dash2
for that var, but the producer host on dash1 worked). For this run
pure_transfer numbers are from A's send_blocks alone; full rx_total
breakdown is not available, but pure_transfer is the dominant term.
Adds:
- analyze_mb2_send_only.py — analyzer that works from A's send_blocks
alone when B's receive_kv events are absent
- plot_mb2_compare.py — overlay intra vs inter on the same axes
- plot_mb2.py — tolerate the `rows`-less send-only schema
- figs/mb2_transfer_{time,bw}_inter.png — inter-node single-curve
- figs/mb2_transfer_{time,bw}_compare.png — intra vs inter overlay
- analysis/mb2/A_inter_kvboth.jsonl, inter_kvboth_client.json,
inter_kvboth_breakdown.json
- analysis/mb2/README.md — Summary block updated to reference both
paths, dated 2026-05-27 run-log entry appended with the full table
and the topology-independence framing
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Adds the pieces needed to run the producer on dash1 and the consumer on
dash2 with the same shared cpfs venv:
start_vllm_single.sh
INSTANCE / GPU / PORT / BP / MASTER / ROLE env vars; brings up ONE
vLLM instance + applies the mooncake instrumentation patch (idempotent
since the venv is cpfs-shared, so the first invocation applies and the
second is a no-op). Per-instance MB2_LOG_DIR keeps producer/consumer
events separate even though both directories live on the same cpfs
path visible to both hosts.
mb2_kv_transfer.py
New --src-host / --dst-host args. Defaults stay 127.0.0.1 for
backward-compat with the intra-node sweep. /v1/completions URLs and
/query URLs now use the supplied hosts. remote_bootstrap_addr is
built as http://<src_host>:<src_bp> so the consumer's
do_remote_prefill request carries a routable address.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Full sweep result on dash1 GPU 0+1 with vanilla vLLM 0.18.1 +
mooncake-transfer-engine 0.3.11, kv_both connector. Per-stage decomposition
via the instrumentation patch (analyze_mb2.py pairs A's send_blocks with
B's receive_kv enter/finish by time window).
Steady-state (1k..32k tokens, 96 MiB..3 GiB KV):
pure_transfer ≈ size / 9.7 GB/s
rx_overhead ≈ 2–3 ms (ZMQ handshake + P-side setup)
bandwidth ≈ 9.6–10.1 GB/s, very stable
Large-size regime (65k..131k tokens, 6..12 GiB):
p50 bandwidth collapses to 3.4–4.5 GB/s
max bandwidth still hits ~9.7 GB/s (some runs achieve it)
p99 agentic request (11.5 GiB) lands here
Implication for §3.2 PD-disaggregation cost argument:
median agentic decode = 50–200 ms (tool-call JSON output)
median agentic-tail KV transfer (p99 11.5 GiB):
best case (9.7 GB/s) ≈ 1.19 s
observed range 1.5 – 10 s
⇒ KV transfer is 8–100× larger than the decode it enables.
This is intra-node — the lower-bound transfer cost. Inter-node RDMA
will be slower; that's MB2 phase 2.
Adds:
- analyze_mb2.py: pair A.send_blocks ↔ B.receive_kv by time window;
per-size aggregation (n, ms_p50, ms_min/max, GB/s_p50/max)
- plot_mb2.py: log-log transfer-time chart + bandwidth-vs-size chart
- analysis/mb2/A_intra_kvboth.jsonl, B_intra_kvboth.jsonl: raw events
(51 + 102 events including the sanity preamble)
- analysis/mb2/intra_kvboth_breakdown.json: paired and aggregated
- figs/mb2_transfer_time_intra.png, figs/mb2_transfer_bw_intra.png
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit closes the loop on the fresh-venv MB2 path. Three corrections
on top of the previous scaffold made the bench fire successfully on
dash1 GPU 0+1 with kv_both connector roles:
1. Re-target instrumentation patch to vLLM's shipped MooncakeConnector
(vllm/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py).
The mooncake-package's own mooncake_connector_v1.py turned out not to
be the implementation vLLM 0.18.1 loads — the
'{"kv_connector": "MooncakeConnector"}' config picks up the vLLM-shipped
one. Patches go at _send_blocks (P-side) and receive_kv_from_single_worker
(D-side, async, both entry and FINISH branch).
2. /query lives on the mooncake bootstrap port, not the vLLM HTTP port.
Add --src-bp / --dst-bp args; default 8998 / 8999.
3. kv_transfer_params schema for the vanilla connector:
do_remote_decode → {transfer_id}
do_remote_prefill → {transfer_id, remote_engine_id, remote_bootstrap_addr}
where remote_bootstrap_addr must include the http:// scheme. The dash0
smoke_test_migrate_cache.py was written for the patched build, which
used a different field-name set (remote_host, remote_port,
remote_block_ids); those are rejected here.
Also discovered (and worked around): vLLM 0.18.1 with kv_role=kv_consumer
raises AttributeError on `self.bootstrap_server` because that attribute
is only assigned conditionally inside `if not self.is_kv_consumer`. We
sidestep by running kv_both for the microbench — transfer mechanics are
identical (same batch_transfer_sync_write call); the role gate only
affects which request types each instance accepts. For §5 strict PD-disagg
baseline we'll need either to fix this bug or front the pair with a
role-aware proxy.
Sanity smoke (3 sizes × 2 repeats, dash1 GPU 0+1, kv_both intra-node):
input KV-MiB send_blocks_ms (P) receive_kv_ms (D) client_step2_ms
512 48 5–23 7–33 18–91
2048 192 21 23 37
8192 768 85 88 110
=> intra-node bandwidth ~9 GB/s on the actual transfer for 768 MiB,
which is well below NVLink p2p; likely PCIe-staged. Worth verifying.
Next step (in flight): full sweep 512..128k tokens × 5 repeats with
the per-stage analyzer.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
start_vllm_pair.sh
ROLE_A / ROLE_B env vars (default kv_producer / kv_consumer for strict
PD-disagg). Override to kv_both for the kv_both control. The role is
injected into --kv-transfer-config so vLLM imposes the role restriction.
mb2_kv_transfer.py
--skip-verify flag drops step 3 (the plain completion sanity-check on
the destination), required when the dst is kv_consumer-only since a
kv_consumer instance refuses to serve a request without
do_remote_prefill. The transfer-time itself is still measured from
step 2 (do_remote_prefill on the consumer).
Also: per-step client-side wall-clock timestamps (t_step1_client_unix,
t_step2_client_unix, t_step2_end_unix) are now captured so the
post-hoc breakdown analyzer can join with the per-instance JSONL logs
on absolute time.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Per-stage breakdown of "step 2" (the B-side do_remote_prefill) requires
vLLM/mooncake-internal timing — we cannot infer it from black-box HTTP
E2E. This commit adds the four pieces to do that breakdown:
instrument_mooncake.py
apply / revert / check patches on mooncake_connector_v1.py to emit
structured JSONL transfer events at two key sites:
send_blocks (P-side, on batch_transfer_sync_write):
{event, remote_session, total_bytes, duration_s, t_start_unix,
ret, tp_rank, t_log_unix}
receive_kv (D-side, on the ZMQ-driven pull request):
{event, path, local_req_ids, remote_req_ids, duration_s,
t_start_unix, tp_rank, t_log_unix}
All injected code is bracketed by `# MB2_INSTRUMENT_START/END` so the
--revert pass is a single regex scan. Apply-revert round-trip
validated on dash1 (PATCHED → py_compile ok → revert → CLEAN → ok).
start_vllm_pair.sh (updated)
- Picks up instrument_mooncake.py via SCRIPT_DIR.
- On `start`: applies patch before launching the two vLLM instances.
- On `stop` (or trap exit): reverts patch.
- Sets per-instance MB2_LOG_DIR = $FRESH_ROOT/mb2_transfer_logs/{A,B}/
so send-side and receive-side events land in cleanly separated dirs.
deploy.sh
tar-over-ssh sync of microbench/fresh_setup/ → cpfs
/home/admin/cpfs/wjh/agentic-kv-fresh/scripts/ so dash1 / dash2 see
the same scripts (dash{1,2} don't have rsync; tar pipe works).
The mb2_kv_transfer.py client still uses black-box E2E timing — the
next commit will teach it to ingest the per-instance JSONL logs to
produce the 4-way breakdown (queueing / setup / transfer / decode).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Two new files prepare measurement of T_transfer(KV_size, network_path),
the gap §3.2's PD-disagg cost argument has had since day one.
microbench/fresh_setup/start_vllm_pair.sh
start | status | stop two vLLM 0.18.1 instances on local GPUs (A, B)
with --kv-transfer-config '{"kv_connector":"MooncakeConnector",
"kv_role":"kv_both"}' running off the fresh venv (vanilla wheel +
vanilla mooncake 0.3.11, NOT the dash0 patched build). GPU IDs and
ports are env-overridable so the same script drives the intra-node
pair (GPU_A=0 GPU_B=1 on one host) and the inter-node pair (GPU_A=0
on dash1, GPU_B=0 on dash2 — launched per host separately).
microbench/fresh_setup/mb2_kv_transfer.py
Three-step measurement borrowed from connector_tax/.../smoke_test_
migrate_cache.py:
1. do_remote_decode on A (compute & cache KV; max_tokens=1)
2. do_remote_prefill on B (pull KV from A — this is the timed step)
3. plain completion on B (sanity check: cached_tokens ≈ prompt len)
Sweeps input_tokens ∈ {512, 1k, 2k, 4k, 8k, 16k, 32k, 64k} with 5
repeats each; reports mean / p50 / p90 transfer time and a per-size
raw log. Per-token KV is 98304 B (Qwen3-Coder-30B-A3B), so the upper
end ≈ 6 GiB transfers — within the p99 11.5 GiB range from §2 but
below it (the model's max_model_len 200000 caps the absolute upper).
What we will NOT learn from this design:
- Bandwidth saturation when the system is loaded (single-request bench)
- vLLM-internal scheduling overhead vs pure transfer (the timed step
folds them together — but for the §3.2 argument that's the right
"what does PD-disagg actually pay" number)
Intentionally not committed yet: an orchestrator that loops over
intra-/inter-node configs. We start manual on dash1 intra-node to
verify the measurement is sane before scaling out.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Install script lives in microbench/fresh_setup/install.sh. Single shared
venv at /home/admin/cpfs/wjh/agentic-kv-fresh/.venv (cpfs is mounted at
the same path on dash0/1/2 so one install serves all three).
vllm : 0.18.1 (official wheel)
mooncake-transfer-engine: 0.3.11.post1
Smoke-tested on dash1 + dash2: imports succeed, kv_transfer module
resolves. This venv is the vanilla reference for all subsequent
microbench / PD-disagg experiments — not the dash0 patched build that
carries the connector_tax fix.
The script defines proxyOn inline (ipads 127.0.0.1:11235) so it works
under non-interactive ssh (~/.bashrc proxyOn is interactive-only).
Sets -eo pipefail (not -u) because venv activation references unset
PS1-like vars under -u.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>