Files
agentic-kvc/v2/exp_a_tier_latency/run_rdma.sh
Gahow Wang dc8e6dd5a8 v2 exp(a): add remote KV-store (RDMA) tier
Extends the hit-latency microbench to a 4th tier: a remote global-KV-store
hit over RDMA, the Mooncake-Store mechanism. Two kv_both MooncakeConnector
instances (run_rdma.sh); for each prefix length, instance B serves the
request by pulling instance A's cached prefix over RDMA (do_remote_prefill,
via microbench/fresh_setup/mb2_kv_transfer.py) instead of recomputing -- the
timed pull is the remote-hit latency.

Result (TTFT p50, 11 reps): strict tier ordering
GPU(HBM) < CPU(local DRAM) < remote-RDMA-store << miss, gaps growing with
context. At 64k: GPU 0.11s, CPU 0.27s, RDMA 0.97s, miss 15.2s -> miss/RDMA
15.8x, RDMA/CPU 3.6x, CPU/GPU 2.4x. So a global RDMA store is a real win
over recompute (the blog's 46x) but pays the NIC tax (~5-7 GB/s effective)
and sits a tier below local CPU and two below GPU -- reinforcing
GPU-hit-first. README + figure updated to four tiers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 12:48:37 +08:00

54 lines
2.2 KiB
Bash

#!/bin/bash
# Exp (a) 4th tier: remote global-KV-store hit over RDMA (Mooncake).
# Two kv_both MooncakeConnector instances (GPU0=src, GPU1=dst). For each prefix
# length: src prefills+caches the KV, dst serves the request by PULLING that KV
# over RDMA (do_remote_prefill) instead of recomputing -> that pull time is the
# remote-store hit latency. Mirrors the Mooncake-Store blog mechanism.
set -uo pipefail
cd /home/admin/cpfs/wjh/agentic-kv
PY=.venv/bin/python
MODEL=/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct
OUT=v2/exp_a_tier_latency/results
mkdir -p "$OUT"
PIDS=()
launch() { # $1 gpu, $2 http port, $3 bootstrap port, $4 master port
VLLM_MOONCAKE_BOOTSTRAP_PORT=$3 MASTER_PORT=$4 CUDA_VISIBLE_DEVICES=$1 VLLM_LOGGING_LEVEL=WARNING \
$PY -m vllm.entrypoints.openai.api_server --model "$MODEL" \
--host 0.0.0.0 --port $2 --tensor-parallel-size 1 --trust-remote-code \
--enable-prefix-caching --enforce-eager --dtype auto --max-model-len 70000 \
--gpu-memory-utilization 0.9 \
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_both"}' \
> "$OUT/vllm_rdma_$2.log" 2>&1 &
PIDS+=($!)
}
teardown() {
for p in "${PIDS[@]:-}"; do kill -TERM "$p" 2>/dev/null; done
sleep 6
for p in $(pgrep -f "VLLM::EngineCore"); do kill -9 "$p" 2>/dev/null; done
sleep 3
}
trap teardown EXIT
echo ">>> launch 2 kv_both instances (GPU0:8000/bp8998, GPU1:8001/bp8999)"
launch 0 8000 8998 29550
launch 1 8001 8999 29551
for port in 8000 8001; do
echo -n " wait health $port..."
timeout 900 bash -c "until curl -sf http://127.0.0.1:$port/health >/dev/null 2>&1; do sleep 5; done" \
&& echo " ok" || { echo " FAIL"; tail -25 "$OUT/vllm_rdma_$port.log"; exit 1; }
done
for bp in 8998 8999; do
timeout 180 bash -c "until curl -s http://127.0.0.1:$bp/query >/dev/null 2>&1; do sleep 2; done"
done
echo " bootstrap ports ready."
sleep 3
$PY microbench/fresh_setup/mb2_kv_transfer.py \
--src-host 127.0.0.1 --dst-host 127.0.0.1 \
--src-port 8000 --dst-port 8001 --src-bp 8998 --dst-bp 8999 \
--sizes 1024,2048,4096,8192,16384,32768,65536 --repeats 11 \
--label rdma-intra-node --out "$OUT/rdma.json"
echo "=== exp (a) RDMA tier DONE ==="