MB2 scaffolding: launch script for vLLM pair + KV-transfer-time client

Two new files prepare measurement of T_transfer(KV_size, network_path), the gap §3.2's PD-disagg cost argument has had since day one. microbench/fresh_setup/start_vllm_pair.sh start | status | stop two vLLM 0.18.1 instances on local GPUs (A, B) with --kv-transfer-config '{"kv_connector":"MooncakeConnector", "kv_role":"kv_both"}' running off the fresh venv (vanilla wheel + vanilla mooncake 0.3.11, NOT the dash0 patched build). GPU IDs and ports are env-overridable so the same script drives the intra-node pair (GPU_A=0 GPU_B=1 on one host) and the inter-node pair (GPU_A=0 on dash1, GPU_B=0 on dash2 — launched per host separately). microbench/fresh_setup/mb2_kv_transfer.py Three-step measurement borrowed from connector_tax/.../smoke_test_ migrate_cache.py: 1. do_remote_decode on A (compute & cache KV; max_tokens=1) 2. do_remote_prefill on B (pull KV from A — this is the timed step) 3. plain completion on B (sanity check: cached_tokens ≈ prompt len) Sweeps input_tokens ∈ {512, 1k, 2k, 4k, 8k, 16k, 32k, 64k} with 5 repeats each; reports mean / p50 / p90 transfer time and a per-size raw log. Per-token KV is 98304 B (Qwen3-Coder-30B-A3B), so the upper end ≈ 6 GiB transfers — within the p99 11.5 GiB range from §2 but below it (the model's max_model_len 200000 caps the absolute upper). What we will NOT learn from this design: - Bandwidth saturation when the system is loaded (single-request bench) - vLLM-internal scheduling overhead vs pure transfer (the timed step folds them together — but for the §3.2 argument that's the right "what does PD-disagg actually pay" number) Intentionally not committed yet: an orchestrator that loops over intra-/inter-node configs. We start manual on dash1 intra-node to verify the measurement is sane before scaling out. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-27 17:47:04 +08:00
parent 0a63de5bcf
commit 7437422618
2 changed files with 309 additions and 0 deletions
--- a/microbench/fresh_setup/mb2_kv_transfer.py
+++ b/microbench/fresh_setup/mb2_kv_transfer.py
@@ -0,0 +1,204 @@
 #!/usr/bin/env python3
 """MB2: measure KV transfer time between two vLLM instances over Mooncake.
 Pattern (adapted from microbench/connector_tax/cache_sweep/smoke_test_migrate_cache.py):
  1. Prefill on A:  do_remote_decode with max_tokens=1  (A computes & caches KV)
  2. Pull to B:     do_remote_prefill on B with kv_transfer_params from step 1
                    (this is the operation that performs the KV transfer)
  3. Verify:        send a follow-up to B; cached_tokens should equal the
                    prompt length (confirms the KV landed on B)
 We time step 2 — that gives us E2E "transfer + B's prefill check" latency.
 By sweeping input_length we trace T_transfer(KV_size).
 The follow-up step gives us a sanity check (correctness) but isn't timed.
 """
 from __future__ import annotations
 import argparse
 import asyncio
 import json
 import statistics
 import time
 import uuid
 from pathlib import Path
 import httpx
 MODEL_PATH = "/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct"
 async def get_engine_id(client: httpx.AsyncClient, port: int) -> str:
    r = await client.get(f"http://127.0.0.1:{port}/query")
    r.raise_for_status()
    data = r.json()
    return data["0"]["engine_id"]
 async def completion(
    client: httpx.AsyncClient,
    port: int,
    prompt_token_ids: list[int],
    max_tokens: int,
    kv_transfer_params: dict | None = None,
 ) -> tuple[float, dict]:
    payload = {
        "model": MODEL_PATH,
        "prompt": prompt_token_ids,
        "max_tokens": max_tokens,
        "min_tokens": max_tokens if max_tokens == 1 else 1,
        "temperature": 0.0,
        "stream": False,
    }
    if kv_transfer_params:
        payload["kv_transfer_params"] = kv_transfer_params
    t0 = time.perf_counter()
    r = await client.post(
        f"http://127.0.0.1:{port}/v1/completions",
        json=payload, timeout=600.0,
    )
    elapsed_s = time.perf_counter() - t0
    r.raise_for_status()
    return elapsed_s, r.json()
 def synth_prompt(rng_seed: int, n_tokens: int) -> list[int]:
    """Deterministic token-id sequence, far enough from special tokens."""
    import random
    rng = random.Random(rng_seed)
    return [rng.randint(100, 150000) for _ in range(n_tokens)]
 async def measure_one(
    client: httpx.AsyncClient,
    src_port: int, dst_port: int,
    src_eid: str, dst_eid: str,
    input_tokens: int,
    rng_seed: int,
 ) -> dict:
    prompt = synth_prompt(rng_seed, input_tokens)
    session = uuid.uuid4().hex
    # Step 1: prefill on A. max_tokens=1 ensures KV is cached but no real decode work.
    t_prefill_s, prefill_resp = await completion(
        client, src_port, prompt, max_tokens=1,
        kv_transfer_params={
            "do_remote_decode": True,
            "remote_block_ids": None,
            "remote_engine_id": src_eid,
            "remote_host": "127.0.0.1",
            "remote_port": src_port,
        },
    )
    src_kvp = prefill_resp.get("kv_transfer_params") or {}
    # Step 2: pull from A to B (the transfer step we time)
    t_transfer_s, pull_resp = await completion(
        client, dst_port, prompt, max_tokens=1,
        kv_transfer_params={
            "do_remote_prefill": True,
            "remote_block_ids": src_kvp.get("remote_block_ids"),
            "remote_engine_id": src_eid,
            "remote_host": "127.0.0.1",
            "remote_port": src_kvp.get("remote_port", src_port),
        },
    )
    # Step 3: follow-up, no kv_transfer_params — should hit B's cache fully
    t_followup_s, follow_resp = await completion(
        client, dst_port, prompt, max_tokens=1,
    )
    usage = (follow_resp.get("usage") or {})
    details = usage.get("prompt_tokens_details") or {}
    cached_followup = details.get("cached_tokens", 0) or usage.get("cached_tokens", 0)
    return {
        "input_tokens": input_tokens,
        "session": session,
        "t_prefill_s": t_prefill_s,
        "t_transfer_s": t_transfer_s,
        "t_followup_s": t_followup_s,
        "cached_followup": cached_followup,
        "ok": cached_followup >= input_tokens * 0.9,  # ≥90 % cached = transfer succeeded
    }
 async def main_async(args: argparse.Namespace) -> None:
    sizes_str = args.sizes
    sizes = [int(s) for s in sizes_str.split(",")]
    repeats = args.repeats
    src_port, dst_port = args.src_port, args.dst_port
    limits = httpx.Limits(max_connections=10, max_keepalive_connections=10)
    async with httpx.AsyncClient(limits=limits, trust_env=False) as client:
        src_eid = await get_engine_id(client, src_port)
        dst_eid = await get_engine_id(client, dst_port)
        print(f"[mb2] src_eid={src_eid[:16]}...  dst_eid={dst_eid[:16]}...")
        results = []
        for sz in sizes:
            for r in range(repeats):
                row = await measure_one(
                    client, src_port, dst_port, src_eid, dst_eid,
                    input_tokens=sz, rng_seed=sz * 1000 + r,
                )
                print(f"  size={sz:>6}  rep={r}  "
                      f"transfer={row['t_transfer_s']*1000:7.1f}ms  "
                      f"followup_cached={row['cached_followup']}/{sz}  "
                      f"ok={row['ok']}")
                results.append(row)
    # Summarise per-size
    summary = []
    for sz in sizes:
        ts = [r["t_transfer_s"] for r in results if r["input_tokens"] == sz and r["ok"]]
        if not ts:
            continue
        summary.append({
            "input_tokens": sz,
            "n_ok": len(ts),
            "transfer_s_mean": statistics.mean(ts),
            "transfer_s_p50": statistics.median(ts),
            "transfer_s_p90": statistics.quantiles(ts, n=10)[-1] if len(ts) >= 10 else max(ts),
            "transfer_s_min": min(ts),
            "transfer_s_max": max(ts),
        })
    out = {
        "model": MODEL_PATH,
        "kv_bytes_per_token": 98304,
        "src_port": src_port,
        "dst_port": dst_port,
        "config_label": args.label,
        "raw": results,
        "summary": summary,
    }
    Path(args.out).write_text(json.dumps(out, indent=2))
    print(f"[mb2] wrote {args.out}")
    for s in summary:
        sz = s["input_tokens"]
        kv_mib = sz * 98304 / 1024 / 1024
        print(f"  {sz:>6} tok ({kv_mib:>7.1f} MiB KV): "
              f"mean {s['transfer_s_mean']*1000:7.1f} ms · "
              f"p50 {s['transfer_s_p50']*1000:7.1f} · "
              f"p90 {s['transfer_s_p90']*1000:7.1f} "
              f"(n_ok={s['n_ok']})")
 def main() -> None:
    p = argparse.ArgumentParser()
    p.add_argument("--src-port", type=int, default=8000)
    p.add_argument("--dst-port", type=int, default=8001)
    p.add_argument(
        "--sizes",
        default="512,1024,2048,4096,8192,16384,32768,65536",
        help="Comma-separated input_token sizes to sweep",
    )
    p.add_argument("--repeats", type=int, default=5)
    p.add_argument("--label", default="intra-node",
                   help="Label written into the output (e.g. intra-node / inter-node)")
    p.add_argument("--out", default="mb2_result.json")
    args = p.parse_args()
    asyncio.run(main_async(args))
 if __name__ == "__main__":
    main()
--- a/microbench/fresh_setup/start_vllm_pair.sh
+++ b/microbench/fresh_setup/start_vllm_pair.sh
@@ -0,0 +1,105 @@
 #!/usr/bin/env bash
 # Start 2 vLLM instances with Mooncake kv_connector (kv_both) for MB2.
 #
 # Default config: both on local GPU 0 and 1 (intra-node A/B test).
 # Override via GPU_A / GPU_B / HOST_A / HOST_B env vars.
 #
 # This uses the FRESH venv at /home/admin/cpfs/wjh/agentic-kv-fresh/.venv
 # (vanilla vllm 0.18.1 + vanilla mooncake-transfer-engine 0.3.11), NOT
 # the dash0 patched build.
 #
 # Usage:
 #   GPU_A=0 GPU_B=1 bash microbench/fresh_setup/start_vllm_pair.sh
 #   bash microbench/fresh_setup/start_vllm_pair.sh status
 #   bash microbench/fresh_setup/start_vllm_pair.sh stop
 set -eo pipefail
 FRESH_ROOT="/home/admin/cpfs/wjh/agentic-kv-fresh"
 VENV="${FRESH_ROOT}/.venv"
 MODEL="${MODEL:-/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct}"
 LOGS_DIR="${LOGS_DIR:-${FRESH_ROOT}/mb2_logs}"
 mkdir -p "${LOGS_DIR}"
 GPU_A="${GPU_A:-0}"
 GPU_B="${GPU_B:-1}"
 PORT_A=8000
 PORT_B=8001
 BP_A=8998
 BP_B=8999
 MASTER_A=29500
 MASTER_B=29501
 stop_all() {
    pkill -9 -f "vllm serve" 2>/dev/null || true
    pkill -9 -f "EngineCore" 2>/dev/null || true
    sleep 2
 }
 case "${1:-start}" in
    stop)
        stop_all
        exit 0
        ;;
    status)
        for p in "${PORT_A}" "${PORT_B}"; do
            if curl -sf "http://127.0.0.1:${p}/health" >/dev/null 2>&1; then
                echo "port ${p}: UP"
            else
                echo "port ${p}: DOWN"
            fi
        done
        exit 0
        ;;
    start)
        ;;
    *)
        echo "Unknown command: $1"; exit 1;;
 esac
 stop_all
 source "${VENV}/bin/activate"
 launch() {
    local idx="$1" gpu="$2" port="$3" bp="$4" master="$5"
    echo "[mb2] launching instance ${idx} on GPU ${gpu}, port ${port}, bp ${bp}"
    PYTHONHASHSEED=42 \
    VLLM_MOONCAKE_BOOTSTRAP_PORT="${bp}" \
    CUDA_VISIBLE_DEVICES="${gpu}" \
    MASTER_PORT="${master}" \
    nohup vllm serve "${MODEL}" \
        --host 0.0.0.0 --port "${port}" \
        --tensor-parallel-size 1 \
        --trust-remote-code --enable-prefix-caching \
        --dtype auto --gpu-memory-utilization 0.9 \
        --max-model-len 200000 \
        --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_both"}' \
        --enable-prompt-tokens-details \
        > "${LOGS_DIR}/vllm_${idx}_gpu${gpu}.log" 2>&1 &
    disown
 }
 launch A "${GPU_A}" "${PORT_A}" "${BP_A}" "${MASTER_A}"
 sleep 3
 launch B "${GPU_B}" "${PORT_B}" "${BP_B}" "${MASTER_B}"
 echo "[mb2] waiting for both /health endpoints..."
 for port in "${PORT_A}" "${PORT_B}"; do
    tries=0
    while ! curl -sf "http://127.0.0.1:${port}/health" >/dev/null 2>&1; do
        tries=$((tries+1))
        if [ ${tries} -gt 180 ]; then
            echo "[mb2] FATAL port ${port} did not come up in 6 min"
            tail -40 "${LOGS_DIR}/vllm_"*"_gpu"*".log" || true
            exit 1
        fi
        sleep 2
    done
    echo "  port=${port} ready"
 done
 echo "[mb2] both instances UP"
 echo "  A: 127.0.0.1:${PORT_A} (GPU ${GPU_A}, bp ${BP_A})"
 echo "  B: 127.0.0.1:${PORT_B} (GPU ${GPU_B}, bp ${BP_B})"
 echo "  logs: ${LOGS_DIR}"