v2 exp(a): add remote KV-store (RDMA) tier

Extends the hit-latency microbench to a 4th tier: a remote global-KV-store
hit over RDMA, the Mooncake-Store mechanism. Two kv_both MooncakeConnector
instances (run_rdma.sh); for each prefix length, instance B serves the
request by pulling instance A's cached prefix over RDMA (do_remote_prefill,
via microbench/fresh_setup/mb2_kv_transfer.py) instead of recomputing -- the
timed pull is the remote-hit latency.

Result (TTFT p50, 11 reps): strict tier ordering
GPU(HBM) < CPU(local DRAM) < remote-RDMA-store << miss, gaps growing with
context. At 64k: GPU 0.11s, CPU 0.27s, RDMA 0.97s, miss 15.2s -> miss/RDMA
15.8x, RDMA/CPU 3.6x, CPU/GPU 2.4x. So a global RDMA store is a real win
over recompute (the blog's 46x) but pays the NIC tax (~5-7 GB/s effective)
and sits a tier below local CPU and two below GPU -- reinforcing
GPU-hit-first. README + figure updated to four tiers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
2026-05-30 12:48:37 +08:00
parent ad754cfe0b
commit dc8e6dd5a8
5 changed files with 1137 additions and 26 deletions

View File

@@ -28,29 +28,38 @@ Run: `GPU=1 bash v2/exp_b_capacity_knee/run_sweep.sh` then
## Results (dash0, 2026-05-30) ## Results (dash0, 2026-05-30)
### Exp (a) — GPU hit CPU hit ≫ miss (`figs/exp_a_tier_latency.png`) ### Exp (a) — GPU hit > CPU hit > remote-store(RDMA) hit ≫ miss (`figs/exp_a_tier_latency.png`)
TTFT (s, p50 over reps) to serve a reused prefix of length L. CPU-tier hits were TTFT (s, p50 over reps) to serve a reused prefix of length L from each KV tier.
100% verified via `vllm:external_prefix_cache_hits`. Local CPU-tier hits were 100% verified via `vllm:external_prefix_cache_hits`;
the **remote KV-store** tier is a real cross-instance Mooncake hit — instance B
serves the request by **pulling the cached prefix from instance A over RDMA**
(`do_remote_prefill`) instead of recomputing (the Mooncake-Store-blog mechanism),
measured with `microbench/fresh_setup/mb2_kv_transfer.py`.
| prefix L | miss (recompute) | CPU-tier hit | GPU-tier hit | miss/CPU | **CPU/GPU** | | prefix L | miss (recompute) | **remote RDMA store** | CPU-tier (local) | GPU-tier (HBM) | miss/RDMA | RDMA/CPU | CPU/GPU |
|---:|---:|---:|---:|---:|---:| |---:|---:|---:|---:|---:|---:|---:|---:|
| 1k | 0.078 | 0.057 | 0.042 | 1.4× | 1.4× | | 1k | 0.078 | 0.061 | 0.057 | 0.042 | 1.3× | 1.1× | 1.4× |
| 4k | 0.261 | 0.064 | 0.046 | 4.1× | 1.4× | | 8k | 0.588 | 0.151 | 0.076 | 0.053 | 3.9× | 2.0× | 1.5× |
| 8k | 0.588 | 0.076 | 0.053 | 7.7× | 1.4× | | 16k | 1.547 | 0.262 | 0.105 | 0.063 | 5.9× | 2.5× | 1.7× |
| 16k | 1.547 | 0.105 | 0.063 | 14.8× | 1.7× | | 32k | 4.604 | 0.680 | 0.158 | 0.080 | 6.8× | 4.3× | 2.0× |
| 32k | 4.604 | 0.158 | 0.080 | 29.2× | 2.0× | | **64k** | **15.230** | **0.966** | **0.272** | **0.111** | **15.8×** | **3.6×** | **2.4×** |
| **64k** | **15.230** | **0.272** | **0.111** | **56.0×** | **2.4×** |
- **GPU hit is ~flat** (42→111 ms over 1k→64k): a hit returns the whole prefix from - **GPU hit is ~flat** (42→111 ms over 1k→64k): a hit returns the whole prefix from
HBM, only the last token is recomputed. HBM, only the last token is recomputed.
- **miss grows superlinearly** (→15.2 s at 64k): a miss pays the full prefill. - **miss grows superlinearly** (→15.2 s at 64k): a miss pays the full prefill.
- **CPU hit grows transfer-bound** (PCIe H2D measured **~54 GB/s**); CPU-hit TTFT ≈ - **local CPU hit grows transfer-bound** (PCIe H2D measured **~54 GB/s**); CPU-hit
GPU-hit + KV/PCIe + ~0.15 s connector overhead (the dashed PCIe floor sits just TTFT ≈ GPU-hit + KV/PCIe + ~0.15 s overhead (dashed PCIe floor sits just under it).
under the orange curve, confirming the decomposition). - **remote RDMA-store hit** is the L3 tier the Mooncake-Store blog advocates: it is
- **Takeaway:** among hits, **GPU beats CPU by 1.42.5×** and the gap widens with a big win over recompute (**up to 16× lower TTFT**, consistent with the blog's
context. A CPU hit is a useful backstop (up to 56× better than recompute) but is 46× at higher hit rates) — but it pays the **NIC tax** (~57 GB/s effective here,
strictly worse than keeping the prefix resident in HBM. cf. ~9.7 GB/s raw Mooncake RDMA in MB2; multi-NIC pooling would raise it). So it
is **3.6× slower than a local CPU hit and ~9× slower than a GPU hit** at 64k, and
the gap **grows with context length**.
- **Takeaway — the tier ordering is strict and widens with context:**
**GPU < CPU-local < remote-RDMA-store ≪ miss.** A global KV store helps (vs
recompute), which is why that approach exists; but every step *toward* the GPU is
another 1.44× of TTFT. The reuse that matters most is the GPU-resident kind.
### Exp (b) — APC and latency knee at small GPU capacity (`figs/exp_b_capacity_knee.png`) ### Exp (b) — APC and latency knee at small GPU capacity (`figs/exp_b_capacity_knee.png`)
@@ -77,9 +86,13 @@ intra-session APC ceiling 71%), sweeping GPU KV capacity.
## Conclusion (for §2.2) ## Conclusion (for §2.2)
1. **Hits on GPU > hits on CPU** is now measured, not asserted: a GPU(HBM) hit is 1. **The KV-tier hierarchy is now measured, not asserted** (Exp a):
1.42.5× faster than a CPU(DRAM-offload) hit and 14137× faster than recompute, `GPU(HBM) < CPU(local DRAM) < remote KV-store(RDMA) ≪ miss`. At 64k tokens a GPU
with the GPU advantage growing in context length (Exp a). hit (0.11 s) is 2.4× faster than a local CPU hit, ~9× faster than a remote RDMA
store hit, and 137× faster than recompute; the gaps **grow with context length**.
A global RDMA store (Mooncake-Store blog) is a real win over recompute (up to 16×
here / 46× in the blog) — but it pays the NIC tax, so it sits a tier *below* local
CPU and two below GPU. Each step toward the GPU is another 1.44× of TTFT.
2. **You only need to hold the *active working set* on GPU.** Realized APC and 2. **You only need to hold the *active working set* on GPU.** Realized APC and
latency saturate once HBM covers the concurrent sessions' working set (3.6 GB latency saturate once HBM covers the concurrent sessions' working set (3.6 GB
here); past that, extra capacity — and the entire CPU/storage tier built to chase here); past that, extra capacity — and the entire CPU/storage tier built to chase
@@ -94,6 +107,13 @@ intra-session APC ceiling 71%), sweeping GPU KV capacity.
C1/f2c); it isolates the capacity→APC→latency mechanism. Knee *position* scales C1/f2c); it isolates the capacity→APC→latency mechanism. Knee *position* scales
with concurrency × per-session working set. with concurrency × per-session working set.
- Single H20; PCIe H2D ~54 GB/s is intra-node (cf. 9.7 GB/s Mooncake inter-node RDMA). - Single H20; PCIe H2D ~54 GB/s is intra-node (cf. 9.7 GB/s Mooncake inter-node RDMA).
- Remote-RDMA tier is a single-node 2-instance Mooncake measurement (RDMA loopback
through the NIC; MB2 showed intra ≈ inter, NIC-bound). `t_transfer` includes the
request + 1-token decode + dst scheduling, so effective BW (~57 GB/s) is below the
raw ~9.7 GB/s; this is the realistic end-to-end remote-hit latency, not just the
wire transfer. The connector's retention-verify (`cached_followup`) is 0 because
kv_both `do_remote_prefill` does not reinsert the pulled prefix into dst's
persistent prefix cache — it does not affect the measured pull latency.
- The 80.3% point at the knee slightly exceeds the 71% intra-session ceiling - The 80.3% point at the knee slightly exceeds the 71% intra-session ceiling
(transient full residency / generated-token reuse); steady state is 72.9%. (transient full residency / generated-token reuse); steady state is 72.9%.

View File

@@ -18,6 +18,7 @@ def load(name):
miss, gpu, cpu, pcie = load("miss.json"), load("gpu.json"), load("cpu.json"), load("pcie.json") miss, gpu, cpu, pcie = load("miss.json"), load("gpu.json"), load("cpu.json"), load("pcie.json")
rdma = load("rdma.json")
def series(d): def series(d):
@@ -27,14 +28,35 @@ def series(d):
return [a for a, _ in items], [b for _, b in items] return [a for a, _ in items], [b for _, b in items]
def rdma_series():
"""Remote KV-store hit over RDMA: p50 of t_transfer_s per prefix length
(dst pulls the cached prefix from the remote pool instead of recomputing)."""
if not rdma:
return [], {}
import statistics
from collections import defaultdict
by = defaultdict(list)
for r in rdma["raw"]:
by[r["input_tokens"]].append(r["t_transfer_s"])
xs = sorted(by)
return xs, {L: statistics.median(by[L]) for L in xs}
rdma_x, rdma_p50 = rdma_series()
fig, ax = plt.subplots(figsize=(7.2, 5.0)) fig, ax = plt.subplots(figsize=(7.2, 5.0))
for d, lab, mk, c in [(miss, "miss (recompute)", "o", "#d62728"), for d, lab, mk, c in [(miss, "miss (recompute)", "o", "#d62728"),
(cpu, "CPU-tier hit (DRAM offload)", "s", "#ff7f0e"), (cpu, "CPU-tier hit (local DRAM, PCIe)", "s", "#ff7f0e"),
(gpu, "GPU-tier hit (HBM APC)", "^", "#2ca02c")]: (gpu, "GPU-tier hit (HBM APC)", "^", "#2ca02c")]:
xs, ys = series(d) xs, ys = series(d)
if xs: if xs:
ax.plot(xs, ys, marker=mk, label=lab, color=c, linewidth=2, markersize=7) ax.plot(xs, ys, marker=mk, label=lab, color=c, linewidth=2, markersize=7)
if rdma_x:
ax.plot(rdma_x, [rdma_p50[L] for L in rdma_x], marker="D", color="#9467bd",
linewidth=2, markersize=7, label="remote KV-store hit (Mooncake RDMA)")
if pcie: if pcie:
items = sorted(((int(k), v["transfer_s"]) for k, v in pcie["by_length"].items())) items = sorted(((int(k), v["transfer_s"]) for k, v in pcie["by_length"].items()))
xs = [a for a, _ in items]; ys = [b for _, b in items] xs = [a for a, _ in items]; ys = [b for _, b in items]
@@ -44,7 +66,8 @@ if pcie:
ax.set_xscale("log", base=2); ax.set_yscale("log") ax.set_xscale("log", base=2); ax.set_yscale("log")
ax.set_xlabel("Reused prefix length (tokens)") ax.set_xlabel("Reused prefix length (tokens)")
ax.set_ylabel("TTFT (s, log)") ax.set_ylabel("TTFT (s, log)")
ax.set_title("Cost of serving a reused prefix from each KV tier\nQwen3-Coder-30B-A3B, 1xH20") ax.set_title("Cost of serving a reused prefix from each KV tier\n"
"Qwen3-Coder-30B-A3B, H20 (local tiers 1 GPU; RDMA pool 2 GPUs)")
ax.grid(True, which="both", alpha=0.3) ax.grid(True, which="both", alpha=0.3)
ax.legend() ax.legend()
FIG.parent.mkdir(parents=True, exist_ok=True) FIG.parent.mkdir(parents=True, exist_ok=True)
@@ -52,16 +75,18 @@ fig.tight_layout(); fig.savefig(FIG, dpi=140)
print("wrote", FIG) print("wrote", FIG)
# Table # Table
print(f"\n{'L':>7} {'miss(s)':>10} {'cpu(s)':>10} {'gpu(s)':>10} {'miss/cpu':>9} {'cpu/gpu':>9}") print(f"\n{'L':>7} {'miss':>9} {'rdma':>9} {'cpu':>9} {'gpu':>9} "
f"{'miss/rdma':>9} {'rdma/cpu':>9} {'cpu/gpu':>9}")
allL = sorted({int(k) for d in (miss, gpu, cpu) if d for k in d["by_length"]}) allL = sorted({int(k) for d in (miss, gpu, cpu) if d for k in d["by_length"]})
for L in allL: for L in allL:
m = miss["by_length"].get(str(L), {}).get("ttft_p50") if miss else None m = miss["by_length"].get(str(L), {}).get("ttft_p50") if miss else None
c = cpu["by_length"].get(str(L), {}).get("ttft_p50") if cpu else None c = cpu["by_length"].get(str(L), {}).get("ttft_p50") if cpu else None
g = gpu["by_length"].get(str(L), {}).get("ttft_p50") if gpu else None g = gpu["by_length"].get(str(L), {}).get("ttft_p50") if gpu else None
rd = rdma_p50.get(L)
f = lambda x: f"{x:.4f}" if x is not None else " - " f = lambda x: f"{x:.4f}" if x is not None else " - "
r1 = f"{m/c:.1f}x" if (m and c) else " -" rr = lambda a, b: f"{a/b:.1f}x" if (a and b) else " -"
r2 = f"{c/g:.1f}x" if (c and g) else " -" print(f"{L:>7} {f(m):>9} {f(rd):>9} {f(c):>9} {f(g):>9} "
print(f"{L:>7} {f(m):>10} {f(c):>10} {f(g):>10} {r1:>9} {r2:>9}") f"{rr(m,rd):>9} {rr(rd,c):>9} {rr(c,g):>9}")
if cpu: if cpu:
vf = {k: v.get("verified_frac") for k, v in cpu["by_length"].items()} vf = {k: v.get("verified_frac") for k, v in cpu["by_length"].items()}

File diff suppressed because it is too large Load Diff

View File

@@ -0,0 +1,53 @@
#!/bin/bash
# Exp (a) 4th tier: remote global-KV-store hit over RDMA (Mooncake).
# Two kv_both MooncakeConnector instances (GPU0=src, GPU1=dst). For each prefix
# length: src prefills+caches the KV, dst serves the request by PULLING that KV
# over RDMA (do_remote_prefill) instead of recomputing -> that pull time is the
# remote-store hit latency. Mirrors the Mooncake-Store blog mechanism.
set -uo pipefail
cd /home/admin/cpfs/wjh/agentic-kv
PY=.venv/bin/python
MODEL=/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct
OUT=v2/exp_a_tier_latency/results
mkdir -p "$OUT"
PIDS=()
launch() { # $1 gpu, $2 http port, $3 bootstrap port, $4 master port
VLLM_MOONCAKE_BOOTSTRAP_PORT=$3 MASTER_PORT=$4 CUDA_VISIBLE_DEVICES=$1 VLLM_LOGGING_LEVEL=WARNING \
$PY -m vllm.entrypoints.openai.api_server --model "$MODEL" \
--host 0.0.0.0 --port $2 --tensor-parallel-size 1 --trust-remote-code \
--enable-prefix-caching --enforce-eager --dtype auto --max-model-len 70000 \
--gpu-memory-utilization 0.9 \
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_both"}' \
> "$OUT/vllm_rdma_$2.log" 2>&1 &
PIDS+=($!)
}
teardown() {
for p in "${PIDS[@]:-}"; do kill -TERM "$p" 2>/dev/null; done
sleep 6
for p in $(pgrep -f "VLLM::EngineCore"); do kill -9 "$p" 2>/dev/null; done
sleep 3
}
trap teardown EXIT
echo ">>> launch 2 kv_both instances (GPU0:8000/bp8998, GPU1:8001/bp8999)"
launch 0 8000 8998 29550
launch 1 8001 8999 29551
for port in 8000 8001; do
echo -n " wait health $port..."
timeout 900 bash -c "until curl -sf http://127.0.0.1:$port/health >/dev/null 2>&1; do sleep 5; done" \
&& echo " ok" || { echo " FAIL"; tail -25 "$OUT/vllm_rdma_$port.log"; exit 1; }
done
for bp in 8998 8999; do
timeout 180 bash -c "until curl -s http://127.0.0.1:$bp/query >/dev/null 2>&1; do sleep 2; done"
done
echo " bootstrap ports ready."
sleep 3
$PY microbench/fresh_setup/mb2_kv_transfer.py \
--src-host 127.0.0.1 --dst-host 127.0.0.1 \
--src-port 8000 --dst-port 8001 --src-bp 8998 --dst-bp 8999 \
--sizes 1024,2048,4096,8192,16384,32768,65536 --repeats 11 \
--label rdma-intra-node --out "$OUT/rdma.json"
echo "=== exp (a) RDMA tier DONE ==="

Binary file not shown.

Before

Width:  |  Height:  |  Size: 81 KiB

After

Width:  |  Height:  |  Size: 100 KiB