v2 exp(a): add remote KV-store (RDMA) tier

Extends the hit-latency microbench to a 4th tier: a remote global-KV-store hit over RDMA, the Mooncake-Store mechanism. Two kv_both MooncakeConnector instances (run_rdma.sh); for each prefix length, instance B serves the request by pulling instance A's cached prefix over RDMA (do_remote_prefill, via microbench/fresh_setup/mb2_kv_transfer.py) instead of recomputing -- the timed pull is the remote-hit latency. Result (TTFT p50, 11 reps): strict tier ordering GPU(HBM) < CPU(local DRAM) < remote-RDMA-store << miss, gaps growing with context. At 64k: GPU 0.11s, CPU 0.27s, RDMA 0.97s, miss 15.2s -> miss/RDMA 15.8x, RDMA/CPU 3.6x, CPU/GPU 2.4x. So a global RDMA store is a real win over recompute (the blog's 46x) but pays the NIC tax (~5-7 GB/s effective) and sits a tier below local CPU and two below GPU -- reinforcing GPU-hit-first. README + figure updated to four tiers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 12:48:37 +08:00
parent ad754cfe0b
commit dc8e6dd5a8
5 changed files with 1137 additions and 26 deletions
--- a/v2/README.md
+++ b/v2/README.md
@@ -28,29 +28,38 @@ Run: `GPU=1 bash v2/exp_b_capacity_knee/run_sweep.sh` then

 ## Results (dash0, 2026-05-30)

-### Exp (a) — GPU hit ≫ CPU hit ≫ miss  (`figs/exp_a_tier_latency.png`)
+### Exp (a) — GPU hit > CPU hit > remote-store(RDMA) hit ≫ miss  (`figs/exp_a_tier_latency.png`)

-TTFT (s, p50 over reps) to serve a reused prefix of length L. CPU-tier hits were
-100% verified via `vllm:external_prefix_cache_hits`.
+TTFT (s, p50 over reps) to serve a reused prefix of length L from each KV tier.
+Local CPU-tier hits were 100% verified via `vllm:external_prefix_cache_hits`;
+the **remote KV-store** tier is a real cross-instance Mooncake hit — instance B
+serves the request by **pulling the cached prefix from instance A over RDMA**
+(`do_remote_prefill`) instead of recomputing (the Mooncake-Store-blog mechanism),
+measured with `microbench/fresh_setup/mb2_kv_transfer.py`.

-| prefix L | miss (recompute) | CPU-tier hit | GPU-tier hit | miss/CPU | **CPU/GPU** |
-|---:|---:|---:|---:|---:|---:|
-| 1k  | 0.078 | 0.057 | 0.042 | 1.4× | 1.4× |
-| 4k  | 0.261 | 0.064 | 0.046 | 4.1× | 1.4× |
-| 8k  | 0.588 | 0.076 | 0.053 | 7.7× | 1.4× |
-| 16k | 1.547 | 0.105 | 0.063 | 14.8× | 1.7× |
-| 32k | 4.604 | 0.158 | 0.080 | 29.2× | 2.0× |
-| **64k** | **15.230** | **0.272** | **0.111** | **56.0×** | **2.4×** |
+| prefix L | miss (recompute) | **remote RDMA store** | CPU-tier (local) | GPU-tier (HBM) | miss/RDMA | RDMA/CPU | CPU/GPU |
+|---:|---:|---:|---:|---:|---:|---:|---:|
+| 1k  | 0.078 | 0.061 | 0.057 | 0.042 | 1.3× | 1.1× | 1.4× |
+| 8k  | 0.588 | 0.151 | 0.076 | 0.053 | 3.9× | 2.0× | 1.5× |
+| 16k | 1.547 | 0.262 | 0.105 | 0.063 | 5.9× | 2.5× | 1.7× |
+| 32k | 4.604 | 0.680 | 0.158 | 0.080 | 6.8× | 4.3× | 2.0× |
+| **64k** | **15.230** | **0.966** | **0.272** | **0.111** | **15.8×** | **3.6×** | **2.4×** |

 - **GPU hit is ~flat** (42→111 ms over 1k→64k): a hit returns the whole prefix from
  HBM, only the last token is recomputed.
 - **miss grows superlinearly** (→15.2 s at 64k): a miss pays the full prefill.
- **CPU hit grows transfer-bound** (PCIe H2D measured **~54 GB/s**); CPU-hit TTFT ≈
-  GPU-hit + KV/PCIe + ~0.15 s connector overhead (the dashed PCIe floor sits just
-  under the orange curve, confirming the decomposition).
- **Takeaway:** among hits, **GPU beats CPU by 1.4–2.5×** and the gap widens with
-  context. A CPU hit is a useful backstop (up to 56× better than recompute) but is
-  strictly worse than keeping the prefix resident in HBM.
+- **local CPU hit grows transfer-bound** (PCIe H2D measured **~54 GB/s**); CPU-hit
+  TTFT ≈ GPU-hit + KV/PCIe + ~0.15 s overhead (dashed PCIe floor sits just under it).
+- **remote RDMA-store hit** is the L3 tier the Mooncake-Store blog advocates: it is
+  a big win over recompute (**up to 16× lower TTFT**, consistent with the blog's
+  46× at higher hit rates) — but it pays the **NIC tax** (~5–7 GB/s effective here,
+  cf. ~9.7 GB/s raw Mooncake RDMA in MB2; multi-NIC pooling would raise it). So it
+  is **3.6× slower than a local CPU hit and ~9× slower than a GPU hit** at 64k, and
+  the gap **grows with context length**.
+- **Takeaway — the tier ordering is strict and widens with context:**
+  **GPU < CPU-local < remote-RDMA-store ≪ miss.** A global KV store helps (vs
+  recompute), which is why that approach exists; but every step *toward* the GPU is
+  another 1.4–4× of TTFT. The reuse that matters most is the GPU-resident kind.

 ### Exp (b) — APC and latency knee at small GPU capacity  (`figs/exp_b_capacity_knee.png`)

@@ -77,9 +86,13 @@ intra-session APC ceiling 71%), sweeping GPU KV capacity.

 ## Conclusion (for §2.2)

-1. **Hits on GPU > hits on CPU** is now measured, not asserted: a GPU(HBM) hit is
-   1.4–2.5× faster than a CPU(DRAM-offload) hit and 14–137× faster than recompute,
-   with the GPU advantage growing in context length (Exp a).
+1. **The KV-tier hierarchy is now measured, not asserted** (Exp a):
+   `GPU(HBM) < CPU(local DRAM) < remote KV-store(RDMA) ≪ miss`. At 64k tokens a GPU
+   hit (0.11 s) is 2.4× faster than a local CPU hit, ~9× faster than a remote RDMA
+   store hit, and 137× faster than recompute; the gaps **grow with context length**.
+   A global RDMA store (Mooncake-Store blog) is a real win over recompute (up to 16×
+   here / 46× in the blog) — but it pays the NIC tax, so it sits a tier *below* local
+   CPU and two below GPU. Each step toward the GPU is another 1.4–4× of TTFT.
 2. **You only need to hold the *active working set* on GPU.** Realized APC and
   latency saturate once HBM covers the concurrent sessions' working set (3.6 GB
   here); past that, extra capacity — and the entire CPU/storage tier built to chase
@@ -94,6 +107,13 @@ intra-session APC ceiling 71%), sweeping GPU KV capacity.
  C1/f2c); it isolates the capacity→APC→latency mechanism. Knee *position* scales
  with concurrency × per-session working set.
 - Single H20; PCIe H2D ~54 GB/s is intra-node (cf. 9.7 GB/s Mooncake inter-node RDMA).
+- Remote-RDMA tier is a single-node 2-instance Mooncake measurement (RDMA loopback
+  through the NIC; MB2 showed intra ≈ inter, NIC-bound). `t_transfer` includes the
+  request + 1-token decode + dst scheduling, so effective BW (~5–7 GB/s) is below the
+  raw ~9.7 GB/s; this is the realistic end-to-end remote-hit latency, not just the
+  wire transfer. The connector's retention-verify (`cached_followup`) is 0 because
+  kv_both `do_remote_prefill` does not reinsert the pulled prefix into dst's
+  persistent prefix cache — it does not affect the measured pull latency.
 - The 80.3% point at the knee slightly exceeds the 71% intra-session ceiling
  (transient full residency / generated-token reuse); steady state is 72.9%.

--- a/v2/exp_a_tier_latency/plot.py
+++ b/v2/exp_a_tier_latency/plot.py
@@ -18,6 +18,7 @@ def load(name):


 miss, gpu, cpu, pcie = load("miss.json"), load("gpu.json"), load("cpu.json"), load("pcie.json")
+rdma = load("rdma.json")


 def series(d):
@@ -27,14 +28,35 @@ def series(d):
    return [a for a, _ in items], [b for _, b in items]


+def rdma_series():
+    """Remote KV-store hit over RDMA: p50 of t_transfer_s per prefix length
+    (dst pulls the cached prefix from the remote pool instead of recomputing)."""
+    if not rdma:
+        return [], {}
+    import statistics
+    from collections import defaultdict
+    by = defaultdict(list)
+    for r in rdma["raw"]:
+        by[r["input_tokens"]].append(r["t_transfer_s"])
+    xs = sorted(by)
+    return xs, {L: statistics.median(by[L]) for L in xs}
+
+
+rdma_x, rdma_p50 = rdma_series()
+
+
 fig, ax = plt.subplots(figsize=(7.2, 5.0))
 for d, lab, mk, c in [(miss, "miss (recompute)", "o", "#d62728"),
-                      (cpu, "CPU-tier hit (DRAM offload)", "s", "#ff7f0e"),
+                      (cpu, "CPU-tier hit (local DRAM, PCIe)", "s", "#ff7f0e"),
                      (gpu, "GPU-tier hit (HBM APC)", "^", "#2ca02c")]:
    xs, ys = series(d)
    if xs:
        ax.plot(xs, ys, marker=mk, label=lab, color=c, linewidth=2, markersize=7)

+if rdma_x:
+    ax.plot(rdma_x, [rdma_p50[L] for L in rdma_x], marker="D", color="#9467bd",
+            linewidth=2, markersize=7, label="remote KV-store hit (Mooncake RDMA)")
+
 if pcie:
    items = sorted(((int(k), v["transfer_s"]) for k, v in pcie["by_length"].items()))
    xs = [a for a, _ in items]; ys = [b for _, b in items]
@@ -44,7 +66,8 @@ if pcie:
 ax.set_xscale("log", base=2); ax.set_yscale("log")
 ax.set_xlabel("Reused prefix length (tokens)")
 ax.set_ylabel("TTFT (s, log)")
-ax.set_title("Cost of serving a reused prefix from each KV tier\nQwen3-Coder-30B-A3B, 1xH20")
+ax.set_title("Cost of serving a reused prefix from each KV tier\n"
+             "Qwen3-Coder-30B-A3B, H20 (local tiers 1 GPU; RDMA pool 2 GPUs)")
 ax.grid(True, which="both", alpha=0.3)
 ax.legend()
 FIG.parent.mkdir(parents=True, exist_ok=True)
@@ -52,16 +75,18 @@ fig.tight_layout(); fig.savefig(FIG, dpi=140)
 print("wrote", FIG)

 # Table
-print(f"\n{'L':>7} {'miss(s)':>10} {'cpu(s)':>10} {'gpu(s)':>10} {'miss/cpu':>9} {'cpu/gpu':>9}")
+print(f"\n{'L':>7} {'miss':>9} {'rdma':>9} {'cpu':>9} {'gpu':>9} "
+      f"{'miss/rdma':>9} {'rdma/cpu':>9} {'cpu/gpu':>9}")
 allL = sorted({int(k) for d in (miss, gpu, cpu) if d for k in d["by_length"]})
 for L in allL:
    m = miss["by_length"].get(str(L), {}).get("ttft_p50") if miss else None
    c = cpu["by_length"].get(str(L), {}).get("ttft_p50") if cpu else None
    g = gpu["by_length"].get(str(L), {}).get("ttft_p50") if gpu else None
+    rd = rdma_p50.get(L)
    f = lambda x: f"{x:.4f}" if x is not None else "   -  "
-    r1 = f"{m/c:.1f}x" if (m and c) else "  -"
-    r2 = f"{c/g:.1f}x" if (c and g) else "  -"
-    print(f"{L:>7} {f(m):>10} {f(c):>10} {f(g):>10} {r1:>9} {r2:>9}")
+    rr = lambda a, b: f"{a/b:.1f}x" if (a and b) else "  -"
+    print(f"{L:>7} {f(m):>9} {f(rd):>9} {f(c):>9} {f(g):>9} "
+          f"{rr(m,rd):>9} {rr(rd,c):>9} {rr(c,g):>9}")

 if cpu:
    vf = {k: v.get("verified_frac") for k, v in cpu["by_length"].items()}
--- a/v2/exp_a_tier_latency/results/rdma.json
+++ b/v2/exp_a_tier_latency/results/rdma.json
--- a/v2/exp_a_tier_latency/run_rdma.sh
+++ b/v2/exp_a_tier_latency/run_rdma.sh
@@ -0,0 +1,53 @@
+#!/bin/bash
+# Exp (a) 4th tier: remote global-KV-store hit over RDMA (Mooncake).
+# Two kv_both MooncakeConnector instances (GPU0=src, GPU1=dst). For each prefix
+# length: src prefills+caches the KV, dst serves the request by PULLING that KV
+# over RDMA (do_remote_prefill) instead of recomputing -> that pull time is the
+# remote-store hit latency. Mirrors the Mooncake-Store blog mechanism.
+set -uo pipefail
+cd /home/admin/cpfs/wjh/agentic-kv
+PY=.venv/bin/python
+MODEL=/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct
+OUT=v2/exp_a_tier_latency/results
+mkdir -p "$OUT"
+PIDS=()
+
+launch() {  # $1 gpu, $2 http port, $3 bootstrap port, $4 master port
+    VLLM_MOONCAKE_BOOTSTRAP_PORT=$3 MASTER_PORT=$4 CUDA_VISIBLE_DEVICES=$1 VLLM_LOGGING_LEVEL=WARNING \
+    $PY -m vllm.entrypoints.openai.api_server --model "$MODEL" \
+        --host 0.0.0.0 --port $2 --tensor-parallel-size 1 --trust-remote-code \
+        --enable-prefix-caching --enforce-eager --dtype auto --max-model-len 70000 \
+        --gpu-memory-utilization 0.9 \
+        --kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_both"}' \
+        > "$OUT/vllm_rdma_$2.log" 2>&1 &
+    PIDS+=($!)
+}
+teardown() {
+    for p in "${PIDS[@]:-}"; do kill -TERM "$p" 2>/dev/null; done
+    sleep 6
+    for p in $(pgrep -f "VLLM::EngineCore"); do kill -9 "$p" 2>/dev/null; done
+    sleep 3
+}
+trap teardown EXIT
+
+echo ">>> launch 2 kv_both instances (GPU0:8000/bp8998, GPU1:8001/bp8999)"
+launch 0 8000 8998 29550
+launch 1 8001 8999 29551
+for port in 8000 8001; do
+    echo -n "  wait health $port..."
+    timeout 900 bash -c "until curl -sf http://127.0.0.1:$port/health >/dev/null 2>&1; do sleep 5; done" \
+        && echo " ok" || { echo " FAIL"; tail -25 "$OUT/vllm_rdma_$port.log"; exit 1; }
+done
+for bp in 8998 8999; do
+    timeout 180 bash -c "until curl -s http://127.0.0.1:$bp/query >/dev/null 2>&1; do sleep 2; done"
+done
+echo "  bootstrap ports ready."
+sleep 3
+
+$PY microbench/fresh_setup/mb2_kv_transfer.py \
+    --src-host 127.0.0.1 --dst-host 127.0.0.1 \
+    --src-port 8000 --dst-port 8001 --src-bp 8998 --dst-bp 8999 \
+    --sizes 1024,2048,4096,8192,16384,32768,65536 --repeats 11 \
+    --label rdma-intra-node --out "$OUT/rdma.json"
+
+echo "=== exp (a) RDMA tier DONE ==="
--- a/v2/figs/exp_a_tier_latency.png
+++ b/v2/figs/exp_a_tier_latency.png