Gahow Wang 91673f1fb8 MB2: working end-to-end intra-node KV transfer microbench
This commit closes the loop on the fresh-venv MB2 path. Three corrections
on top of the previous scaffold made the bench fire successfully on
dash1 GPU 0+1 with kv_both connector roles:

1. Re-target instrumentation patch to vLLM's shipped MooncakeConnector
   (vllm/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py).
   The mooncake-package's own mooncake_connector_v1.py turned out not to
   be the implementation vLLM 0.18.1 loads — the
   '{"kv_connector": "MooncakeConnector"}' config picks up the vLLM-shipped
   one. Patches go at _send_blocks (P-side) and receive_kv_from_single_worker
   (D-side, async, both entry and FINISH branch).

2. /query lives on the mooncake bootstrap port, not the vLLM HTTP port.
   Add --src-bp / --dst-bp args; default 8998 / 8999.

3. kv_transfer_params schema for the vanilla connector:
     do_remote_decode  → {transfer_id}
     do_remote_prefill → {transfer_id, remote_engine_id, remote_bootstrap_addr}
   where remote_bootstrap_addr must include the http:// scheme. The dash0
   smoke_test_migrate_cache.py was written for the patched build, which
   used a different field-name set (remote_host, remote_port,
   remote_block_ids); those are rejected here.

Also discovered (and worked around): vLLM 0.18.1 with kv_role=kv_consumer
raises AttributeError on `self.bootstrap_server` because that attribute
is only assigned conditionally inside `if not self.is_kv_consumer`. We
sidestep by running kv_both for the microbench — transfer mechanics are
identical (same batch_transfer_sync_write call); the role gate only
affects which request types each instance accepts. For §5 strict PD-disagg
baseline we'll need either to fix this bug or front the pair with a
role-aware proxy.

Sanity smoke (3 sizes × 2 repeats, dash1 GPU 0+1, kv_both intra-node):
  input    KV-MiB  send_blocks_ms (P)  receive_kv_ms (D)  client_step2_ms
   512        48          5–23                  7–33               18–91
  2048       192            21                    23                  37
  8192       768            85                    88                 110
=> intra-node bandwidth ~9 GB/s on the actual transfer for 768 MiB,
   which is well below NVLink p2p; likely PCIe-staged. Worth verifying.

Next step (in flight): full sweep 512..128k tokens × 5 repeats with
the per-stage analyzer.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 18:53:25 +08:00
Description
No description provided
48 MiB
Languages
Python 82.9%
Shell 17.1%