91673f1fb82a97e2670f61e2c0da057da6daf8f1
This commit closes the loop on the fresh-venv MB2 path. Three corrections
on top of the previous scaffold made the bench fire successfully on
dash1 GPU 0+1 with kv_both connector roles:
1. Re-target instrumentation patch to vLLM's shipped MooncakeConnector
(vllm/distributed/kv_transfer/kv_connector/v1/mooncake/mooncake_connector.py).
The mooncake-package's own mooncake_connector_v1.py turned out not to
be the implementation vLLM 0.18.1 loads — the
'{"kv_connector": "MooncakeConnector"}' config picks up the vLLM-shipped
one. Patches go at _send_blocks (P-side) and receive_kv_from_single_worker
(D-side, async, both entry and FINISH branch).
2. /query lives on the mooncake bootstrap port, not the vLLM HTTP port.
Add --src-bp / --dst-bp args; default 8998 / 8999.
3. kv_transfer_params schema for the vanilla connector:
do_remote_decode → {transfer_id}
do_remote_prefill → {transfer_id, remote_engine_id, remote_bootstrap_addr}
where remote_bootstrap_addr must include the http:// scheme. The dash0
smoke_test_migrate_cache.py was written for the patched build, which
used a different field-name set (remote_host, remote_port,
remote_block_ids); those are rejected here.
Also discovered (and worked around): vLLM 0.18.1 with kv_role=kv_consumer
raises AttributeError on `self.bootstrap_server` because that attribute
is only assigned conditionally inside `if not self.is_kv_consumer`. We
sidestep by running kv_both for the microbench — transfer mechanics are
identical (same batch_transfer_sync_write call); the role gate only
affects which request types each instance accepts. For §5 strict PD-disagg
baseline we'll need either to fix this bug or front the pair with a
role-aware proxy.
Sanity smoke (3 sizes × 2 repeats, dash1 GPU 0+1, kv_both intra-node):
input KV-MiB send_blocks_ms (P) receive_kv_ms (D) client_step2_ms
512 48 5–23 7–33 18–91
2048 192 21 23 37
8192 768 85 88 110
=> intra-node bandwidth ~9 GB/s on the actual transfer for 768 MiB,
which is well below NVLink p2p; likely PCIe-staged. Worth verifying.
Next step (in flight): full sweep 512..128k tokens × 5 repeats with
the per-stage analyzer.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Description
No description provided
Languages
Python
82.9%
Shell
17.1%