74374226183261ca6989a799ed81ba828af8105d
Two new files prepare measurement of T_transfer(KV_size, network_path),
the gap §3.2's PD-disagg cost argument has had since day one.
microbench/fresh_setup/start_vllm_pair.sh
start | status | stop two vLLM 0.18.1 instances on local GPUs (A, B)
with --kv-transfer-config '{"kv_connector":"MooncakeConnector",
"kv_role":"kv_both"}' running off the fresh venv (vanilla wheel +
vanilla mooncake 0.3.11, NOT the dash0 patched build). GPU IDs and
ports are env-overridable so the same script drives the intra-node
pair (GPU_A=0 GPU_B=1 on one host) and the inter-node pair (GPU_A=0
on dash1, GPU_B=0 on dash2 — launched per host separately).
microbench/fresh_setup/mb2_kv_transfer.py
Three-step measurement borrowed from connector_tax/.../smoke_test_
migrate_cache.py:
1. do_remote_decode on A (compute & cache KV; max_tokens=1)
2. do_remote_prefill on B (pull KV from A — this is the timed step)
3. plain completion on B (sanity check: cached_tokens ≈ prompt len)
Sweeps input_tokens ∈ {512, 1k, 2k, 4k, 8k, 16k, 32k, 64k} with 5
repeats each; reports mean / p50 / p90 transfer time and a per-size
raw log. Per-token KV is 98304 B (Qwen3-Coder-30B-A3B), so the upper
end ≈ 6 GiB transfers — within the p99 11.5 GiB range from §2 but
below it (the model's max_model_len 200000 caps the absolute upper).
What we will NOT learn from this design:
- Bandwidth saturation when the system is loaded (single-request bench)
- vLLM-internal scheduling overhead vs pure transfer (the timed step
folds them together — but for the §3.2 argument that's the right
"what does PD-disagg actually pay" number)
Intentionally not committed yet: an orchestrator that loops over
intra-/inter-node configs. We start manual on dash1 intra-node to
verify the measurement is sane before scaling out.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Description
No description provided
Languages
Python
82.9%
Shell
17.1%