Per-port vllm:prefix_cache_{queries,hits}_total -> instance_apc.txt. For PD
this is the only honest reuse signal: producer ports show cross-turn prefix
hits, while the consumer's per-request cached_tokens just counts transferred KV.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Two bugs caught by 8C smoke:
mb5_launch.sh
${env_bp_arg} expanded as a literal command line prefix doesn't work
when env_bp_arg is itself a variable — bash only treats VAR=val as
an env assignment if it sees the literal in the parsed command, not
after expansion. Fix: always export VLLM_MOONCAKE_BOOTSTRAP_PORT as
a literal, defaulting to 9999 when caller passed no port (consumer
mode ignores the var so the placeholder is harmless).
mb5_run.sh
replayer's actual CLI flags are --trace / --output / --endpoint /
--model, not the --*-path / --*-name variants I had. Plus dash1
has no `bc`; compute wall_clock_s via python instead.
Both fixed; 8C smoke (CONFIG=8C REPS=1 REQUEST_LIMIT=20) now runs
end-to-end in ~30 s:
- 8 vLLM kv_both instances on GPU 0-7 come up
- replayer round-robins 20 reqs across them
- MB5 instrumentation captures 8 snapshot files (one per EngineCore
PID), ranging 7-139 snapshots each = ~10 Hz throttle works
- plot_kv_pool_timeline.py renders the stacked-area + queue-depth
chart cleanly (figs/mb5_smoke/*.png)
Pipeline validated. Ready for the real PD-ratio sweep.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Three new files to drive the PD ratio sweep + per-request KV occupancy
capture, plus a deploy.sh update so the patched replayer rides along
to the fresh-venv host.
mb5_launch.sh
One script handles all four configs we plan to sweep:
CONFIG=8C / 6P+2D / 4P+4D / 2P+6D
- For 8C: 8 vLLM instances with kv_role=kv_both on GPU 0-7. Replayer
talks to them via the existing comma-separated round-robin in
replayer/replay.py — no proxy.
- For PD configs: kv_role=kv_producer for the P pool (with
VLLM_MOONCAKE_BOOTSTRAP_PORT) + kv_role=kv_consumer for the D pool,
routed by the official vLLM example
third_party/vllm/examples/online_serving/disaggregated_serving/
mooncake_connector/mooncake_connector_proxy.py — no policy choice
made by us, per user instruction to use the standard recipe.
- Applies instrument_kv_snapshot.py before launching so every
EngineCore writes its per-step KV snapshot to
$RUN_ROOT/kv_snapshots/mb5_kv_snapshot_pid<pid>.jsonl
- Reverts the patch on stop.
- Emits ENDPOINTS= line on stdout for the orchestrator to read.
mb5_run.sh
For each CONFIG × rep: launch, replay w600 trace via the existing
replayer, capture wall-clock, tear down, cool down 10 s. Defaults:
CONFIGS="8C 6P+2D 4P+4D 2P+6D"
REPS=3
TRACE=traces/w600_r0.0015_st30.jsonl
All artefacts go under $FRESH_ROOT/mb5_runs/$RUN_TAG_${config}_rep${rep}/
(vllm_logs/, kv_snapshots/, replay_metrics.jsonl, wall_clock_s.txt).
plot_kv_pool_timeline.py
Reads one or more mb5_kv_snapshot_pid*.jsonl files and renders a
stacked-area chart per file:
x = wall-clock since first snapshot
y = KV block count, stacked by per-request contribution
overlay: pool-total ceiling, 90% line, waiting-queue depth subplot
Bands are colored by a deterministic hash of request_id so individual
requests are visually tractable across the run.
This is the figure the user asked for — turns headline "PD-disagg is
10× worse" into a system-level picture of *where* the KV pool is
blocked, when, and by which requests.
deploy.sh
Also tar-syncs the local replayer/ dir to
/home/admin/cpfs/wjh/agentic-kv-fresh/replayer/ so mb5_run.sh can
`python -m replayer` against the patched (trace_span_s/amplification)
version, not the older copy under /home/admin/cpfs/wjh/agentic-kv/.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>