agentic-kvc

gahow/agentic-kvc

Fork 0

Commit Graph

Author	SHA1	Message	Date
Gahow Wang	e9abd70c8d	MB5 driver: launcher, orchestrator, KV-pool timeline plotter Three new files to drive the PD ratio sweep + per-request KV occupancy capture, plus a deploy.sh update so the patched replayer rides along to the fresh-venv host. mb5_launch.sh One script handles all four configs we plan to sweep: CONFIG=8C / 6P+2D / 4P+4D / 2P+6D - For 8C: 8 vLLM instances with kv_role=kv_both on GPU 0-7. Replayer talks to them via the existing comma-separated round-robin in replayer/replay.py — no proxy. - For PD configs: kv_role=kv_producer for the P pool (with VLLM_MOONCAKE_BOOTSTRAP_PORT) + kv_role=kv_consumer for the D pool, routed by the official vLLM example third_party/vllm/examples/online_serving/disaggregated_serving/ mooncake_connector/mooncake_connector_proxy.py — no policy choice made by us, per user instruction to use the standard recipe. - Applies instrument_kv_snapshot.py before launching so every EngineCore writes its per-step KV snapshot to $RUN_ROOT/kv_snapshots/mb5_kv_snapshot_pid<pid>.jsonl - Reverts the patch on stop. - Emits ENDPOINTS= line on stdout for the orchestrator to read. mb5_run.sh For each CONFIG × rep: launch, replay w600 trace via the existing replayer, capture wall-clock, tear down, cool down 10 s. Defaults: CONFIGS="8C 6P+2D 4P+4D 2P+6D" REPS=3 TRACE=traces/w600_r0.0015_st30.jsonl All artefacts go under $FRESH_ROOT/mb5_runs/$RUN_TAG_${config}_rep${rep}/ (vllm_logs/, kv_snapshots/, replay_metrics.jsonl, wall_clock_s.txt). plot_kv_pool_timeline.py Reads one or more mb5_kv_snapshot_pid.jsonl files and renders a stacked-area chart per file: x = wall-clock since first snapshot y = KV block count, stacked by per-request contribution overlay: pool-total ceiling, 90% line, waiting-queue depth subplot Bands are colored by a deterministic hash of request_id so individual requests are visually tractable across the run. This is the figure the user asked for — turns headline "PD-disagg is 10× worse" into a system-level picture of where* the KV pool is blocked, when, and by which requests. deploy.sh Also tar-syncs the local replayer/ dir to /home/admin/cpfs/wjh/agentic-kv-fresh/replayer/ so mb5_run.sh can `python -m replayer` against the patched (trace_span_s/amplification) version, not the older copy under /home/admin/cpfs/wjh/agentic-kv/. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 23:02:57 +08:00
Gahow Wang	efdcf3c555	MB2: per-stage instrumentation patch + launcher integration Per-stage breakdown of "step 2" (the B-side do_remote_prefill) requires vLLM/mooncake-internal timing — we cannot infer it from black-box HTTP E2E. This commit adds the four pieces to do that breakdown: instrument_mooncake.py apply / revert / check patches on mooncake_connector_v1.py to emit structured JSONL transfer events at two key sites: send_blocks (P-side, on batch_transfer_sync_write): {event, remote_session, total_bytes, duration_s, t_start_unix, ret, tp_rank, t_log_unix} receive_kv (D-side, on the ZMQ-driven pull request): {event, path, local_req_ids, remote_req_ids, duration_s, t_start_unix, tp_rank, t_log_unix} All injected code is bracketed by `# MB2_INSTRUMENT_START/END` so the --revert pass is a single regex scan. Apply-revert round-trip validated on dash1 (PATCHED → py_compile ok → revert → CLEAN → ok). start_vllm_pair.sh (updated) - Picks up instrument_mooncake.py via SCRIPT_DIR. - On `start`: applies patch before launching the two vLLM instances. - On `stop` (or trap exit): reverts patch. - Sets per-instance MB2_LOG_DIR = $FRESH_ROOT/mb2_transfer_logs/{A,B}/ so send-side and receive-side events land in cleanly separated dirs. deploy.sh tar-over-ssh sync of microbench/fresh_setup/ → cpfs /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/ so dash1 / dash2 see the same scripts (dash{1,2} don't have rsync; tar pipe works). The mb2_kv_transfer.py client still uses black-box E2E timing — the next commit will teach it to ingest the per-instance JSONL logs to produce the 4-way breakdown (queueing / setup / transfer / decode). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 18:12:44 +08:00

Author

SHA1

Message

Date

Gahow Wang

e9abd70c8d

MB5 driver: launcher, orchestrator, KV-pool timeline plotter

Three new files to drive the PD ratio sweep + per-request KV occupancy
capture, plus a deploy.sh update so the patched replayer rides along
to the fresh-venv host.

mb5_launch.sh
  One script handles all four configs we plan to sweep:
    CONFIG=8C / 6P+2D / 4P+4D / 2P+6D
  - For 8C: 8 vLLM instances with kv_role=kv_both on GPU 0-7. Replayer
    talks to them via the existing comma-separated round-robin in
    replayer/replay.py — no proxy.
  - For PD configs: kv_role=kv_producer for the P pool (with
    VLLM_MOONCAKE_BOOTSTRAP_PORT) + kv_role=kv_consumer for the D pool,
    routed by the official vLLM example
    third_party/vllm/examples/online_serving/disaggregated_serving/
    mooncake_connector/mooncake_connector_proxy.py — no policy choice
    made by us, per user instruction to use the standard recipe.
  - Applies instrument_kv_snapshot.py before launching so every
    EngineCore writes its per-step KV snapshot to
    $RUN_ROOT/kv_snapshots/mb5_kv_snapshot_pid<pid>.jsonl
  - Reverts the patch on stop.
  - Emits ENDPOINTS= line on stdout for the orchestrator to read.

mb5_run.sh
  For each CONFIG × rep: launch, replay w600 trace via the existing
  replayer, capture wall-clock, tear down, cool down 10 s. Defaults:
    CONFIGS="8C 6P+2D 4P+4D 2P+6D"
    REPS=3
    TRACE=traces/w600_r0.0015_st30.jsonl
  All artefacts go under $FRESH_ROOT/mb5_runs/$RUN_TAG_${config}_rep${rep}/
  (vllm_logs/, kv_snapshots/, replay_metrics.jsonl, wall_clock_s.txt).

plot_kv_pool_timeline.py
  Reads one or more mb5_kv_snapshot_pid*.jsonl files and renders a
  stacked-area chart per file:
    x = wall-clock since first snapshot
    y = KV block count, stacked by per-request contribution
    overlay: pool-total ceiling, 90% line, waiting-queue depth subplot
  Bands are colored by a deterministic hash of request_id so individual
  requests are visually tractable across the run.
  This is the figure the user asked for — turns headline "PD-disagg is
  10× worse" into a system-level picture of *where* the KV pool is
  blocked, when, and by which requests.

deploy.sh
  Also tar-syncs the local replayer/ dir to
  /home/admin/cpfs/wjh/agentic-kv-fresh/replayer/ so mb5_run.sh can
  `python -m replayer` against the patched (trace_span_s/amplification)
  version, not the older copy under /home/admin/cpfs/wjh/agentic-kv/.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-27 23:02:57 +08:00

Gahow Wang

efdcf3c555

MB2: per-stage instrumentation patch + launcher integration

Per-stage breakdown of "step 2" (the B-side do_remote_prefill) requires
vLLM/mooncake-internal timing — we cannot infer it from black-box HTTP
E2E. This commit adds the four pieces to do that breakdown:

instrument_mooncake.py
  apply / revert / check patches on mooncake_connector_v1.py to emit
  structured JSONL transfer events at two key sites:

    send_blocks (P-side, on batch_transfer_sync_write):
      {event, remote_session, total_bytes, duration_s, t_start_unix,
       ret, tp_rank, t_log_unix}
    receive_kv (D-side, on the ZMQ-driven pull request):
      {event, path, local_req_ids, remote_req_ids, duration_s,
       t_start_unix, tp_rank, t_log_unix}

  All injected code is bracketed by `# MB2_INSTRUMENT_START/END` so the
  --revert pass is a single regex scan. Apply-revert round-trip
  validated on dash1 (PATCHED → py_compile ok → revert → CLEAN → ok).

start_vllm_pair.sh (updated)
  - Picks up instrument_mooncake.py via SCRIPT_DIR.
  - On `start`: applies patch before launching the two vLLM instances.
  - On `stop` (or trap exit): reverts patch.
  - Sets per-instance MB2_LOG_DIR = $FRESH_ROOT/mb2_transfer_logs/{A,B}/
    so send-side and receive-side events land in cleanly separated dirs.

deploy.sh
  tar-over-ssh sync of microbench/fresh_setup/ → cpfs
  /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/ so dash1 / dash2 see
  the same scripts (dash{1,2} don't have rsync; tar pipe works).

The mb2_kv_transfer.py client still uses black-box E2E timing — the
next commit will teach it to ingest the per-instance JSONL logs to
produce the 4-way breakdown (queueing / setup / transfer / decode).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-27 18:12:44 +08:00

2 Commits