agentic-kvc

Go to file

Gahow Wang 7437422618 MB2 scaffolding: launch script for vLLM pair + KV-transfer-time client

Two new files prepare measurement of T_transfer(KV_size, network_path),
the gap §3.2's PD-disagg cost argument has had since day one.

microbench/fresh_setup/start_vllm_pair.sh
  start | status | stop two vLLM 0.18.1 instances on local GPUs (A, B)
  with --kv-transfer-config '{"kv_connector":"MooncakeConnector",
  "kv_role":"kv_both"}' running off the fresh venv (vanilla wheel +
  vanilla mooncake 0.3.11, NOT the dash0 patched build). GPU IDs and
  ports are env-overridable so the same script drives the intra-node
  pair (GPU_A=0 GPU_B=1 on one host) and the inter-node pair (GPU_A=0
  on dash1, GPU_B=0 on dash2 — launched per host separately).

microbench/fresh_setup/mb2_kv_transfer.py
  Three-step measurement borrowed from connector_tax/.../smoke_test_
  migrate_cache.py:
    1. do_remote_decode  on A   (compute & cache KV; max_tokens=1)
    2. do_remote_prefill on B   (pull KV from A — this is the timed step)
    3. plain completion on B    (sanity check: cached_tokens ≈ prompt len)
  Sweeps input_tokens ∈ {512, 1k, 2k, 4k, 8k, 16k, 32k, 64k} with 5
  repeats each; reports mean / p50 / p90 transfer time and a per-size
  raw log. Per-token KV is 98304 B (Qwen3-Coder-30B-A3B), so the upper
  end ≈ 6 GiB transfers — within the p99 11.5 GiB range from §2 but
  below it (the model's max_model_len 200000 caps the absolute upper).

What we will NOT learn from this design:
  - Bandwidth saturation when the system is loaded (single-request bench)
  - vLLM-internal scheduling overhead vs pure transfer (the timed step
    folds them together — but for the §3.2 argument that's the right
    "what does PD-disagg actually pay" number)

Intentionally not committed yet: an orchestrator that loops over
intra-/inter-node configs. We start manual on dash1 intra-node to
verify the measurement is sane before scaling out.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-27 17:47:04 +08:00