Per-stage breakdown of "step 2" (the B-side do_remote_prefill) requires
vLLM/mooncake-internal timing — we cannot infer it from black-box HTTP
E2E. This commit adds the four pieces to do that breakdown:
instrument_mooncake.py
apply / revert / check patches on mooncake_connector_v1.py to emit
structured JSONL transfer events at two key sites:
send_blocks (P-side, on batch_transfer_sync_write):
{event, remote_session, total_bytes, duration_s, t_start_unix,
ret, tp_rank, t_log_unix}
receive_kv (D-side, on the ZMQ-driven pull request):
{event, path, local_req_ids, remote_req_ids, duration_s,
t_start_unix, tp_rank, t_log_unix}
All injected code is bracketed by `# MB2_INSTRUMENT_START/END` so the
--revert pass is a single regex scan. Apply-revert round-trip
validated on dash1 (PATCHED → py_compile ok → revert → CLEAN → ok).
start_vllm_pair.sh (updated)
- Picks up instrument_mooncake.py via SCRIPT_DIR.
- On `start`: applies patch before launching the two vLLM instances.
- On `stop` (or trap exit): reverts patch.
- Sets per-instance MB2_LOG_DIR = $FRESH_ROOT/mb2_transfer_logs/{A,B}/
so send-side and receive-side events land in cleanly separated dirs.
deploy.sh
tar-over-ssh sync of microbench/fresh_setup/ → cpfs
/home/admin/cpfs/wjh/agentic-kv-fresh/scripts/ so dash1 / dash2 see
the same scripts (dash{1,2} don't have rsync; tar pipe works).
The mb2_kv_transfer.py client still uses black-box E2E timing — the
next commit will teach it to ingest the per-instance JSONL logs to
produce the 4-way breakdown (queueing / setup / transfer / decode).
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>