v2 exp(a): add remote KV-store (RDMA) tier
Extends the hit-latency microbench to a 4th tier: a remote global-KV-store hit over RDMA, the Mooncake-Store mechanism. Two kv_both MooncakeConnector instances (run_rdma.sh); for each prefix length, instance B serves the request by pulling instance A's cached prefix over RDMA (do_remote_prefill, via microbench/fresh_setup/mb2_kv_transfer.py) instead of recomputing -- the timed pull is the remote-hit latency. Result (TTFT p50, 11 reps): strict tier ordering GPU(HBM) < CPU(local DRAM) < remote-RDMA-store << miss, gaps growing with context. At 64k: GPU 0.11s, CPU 0.27s, RDMA 0.97s, miss 15.2s -> miss/RDMA 15.8x, RDMA/CPU 3.6x, CPU/GPU 2.4x. So a global RDMA store is a real win over recompute (the blog's 46x) but pays the NIC tax (~5-7 GB/s effective) and sits a tier below local CPU and two below GPU -- reinforcing GPU-hit-first. README + figure updated to four tiers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
60
v2/README.md
60
v2/README.md
@@ -28,29 +28,38 @@ Run: `GPU=1 bash v2/exp_b_capacity_knee/run_sweep.sh` then
|
||||
|
||||
## Results (dash0, 2026-05-30)
|
||||
|
||||
### Exp (a) — GPU hit ≫ CPU hit ≫ miss (`figs/exp_a_tier_latency.png`)
|
||||
### Exp (a) — GPU hit > CPU hit > remote-store(RDMA) hit ≫ miss (`figs/exp_a_tier_latency.png`)
|
||||
|
||||
TTFT (s, p50 over reps) to serve a reused prefix of length L. CPU-tier hits were
|
||||
100% verified via `vllm:external_prefix_cache_hits`.
|
||||
TTFT (s, p50 over reps) to serve a reused prefix of length L from each KV tier.
|
||||
Local CPU-tier hits were 100% verified via `vllm:external_prefix_cache_hits`;
|
||||
the **remote KV-store** tier is a real cross-instance Mooncake hit — instance B
|
||||
serves the request by **pulling the cached prefix from instance A over RDMA**
|
||||
(`do_remote_prefill`) instead of recomputing (the Mooncake-Store-blog mechanism),
|
||||
measured with `microbench/fresh_setup/mb2_kv_transfer.py`.
|
||||
|
||||
| prefix L | miss (recompute) | CPU-tier hit | GPU-tier hit | miss/CPU | **CPU/GPU** |
|
||||
|---:|---:|---:|---:|---:|---:|
|
||||
| 1k | 0.078 | 0.057 | 0.042 | 1.4× | 1.4× |
|
||||
| 4k | 0.261 | 0.064 | 0.046 | 4.1× | 1.4× |
|
||||
| 8k | 0.588 | 0.076 | 0.053 | 7.7× | 1.4× |
|
||||
| 16k | 1.547 | 0.105 | 0.063 | 14.8× | 1.7× |
|
||||
| 32k | 4.604 | 0.158 | 0.080 | 29.2× | 2.0× |
|
||||
| **64k** | **15.230** | **0.272** | **0.111** | **56.0×** | **2.4×** |
|
||||
| prefix L | miss (recompute) | **remote RDMA store** | CPU-tier (local) | GPU-tier (HBM) | miss/RDMA | RDMA/CPU | CPU/GPU |
|
||||
|---:|---:|---:|---:|---:|---:|---:|---:|
|
||||
| 1k | 0.078 | 0.061 | 0.057 | 0.042 | 1.3× | 1.1× | 1.4× |
|
||||
| 8k | 0.588 | 0.151 | 0.076 | 0.053 | 3.9× | 2.0× | 1.5× |
|
||||
| 16k | 1.547 | 0.262 | 0.105 | 0.063 | 5.9× | 2.5× | 1.7× |
|
||||
| 32k | 4.604 | 0.680 | 0.158 | 0.080 | 6.8× | 4.3× | 2.0× |
|
||||
| **64k** | **15.230** | **0.966** | **0.272** | **0.111** | **15.8×** | **3.6×** | **2.4×** |
|
||||
|
||||
- **GPU hit is ~flat** (42→111 ms over 1k→64k): a hit returns the whole prefix from
|
||||
HBM, only the last token is recomputed.
|
||||
- **miss grows superlinearly** (→15.2 s at 64k): a miss pays the full prefill.
|
||||
- **CPU hit grows transfer-bound** (PCIe H2D measured **~54 GB/s**); CPU-hit TTFT ≈
|
||||
GPU-hit + KV/PCIe + ~0.15 s connector overhead (the dashed PCIe floor sits just
|
||||
under the orange curve, confirming the decomposition).
|
||||
- **Takeaway:** among hits, **GPU beats CPU by 1.4–2.5×** and the gap widens with
|
||||
context. A CPU hit is a useful backstop (up to 56× better than recompute) but is
|
||||
strictly worse than keeping the prefix resident in HBM.
|
||||
- **local CPU hit grows transfer-bound** (PCIe H2D measured **~54 GB/s**); CPU-hit
|
||||
TTFT ≈ GPU-hit + KV/PCIe + ~0.15 s overhead (dashed PCIe floor sits just under it).
|
||||
- **remote RDMA-store hit** is the L3 tier the Mooncake-Store blog advocates: it is
|
||||
a big win over recompute (**up to 16× lower TTFT**, consistent with the blog's
|
||||
46× at higher hit rates) — but it pays the **NIC tax** (~5–7 GB/s effective here,
|
||||
cf. ~9.7 GB/s raw Mooncake RDMA in MB2; multi-NIC pooling would raise it). So it
|
||||
is **3.6× slower than a local CPU hit and ~9× slower than a GPU hit** at 64k, and
|
||||
the gap **grows with context length**.
|
||||
- **Takeaway — the tier ordering is strict and widens with context:**
|
||||
**GPU < CPU-local < remote-RDMA-store ≪ miss.** A global KV store helps (vs
|
||||
recompute), which is why that approach exists; but every step *toward* the GPU is
|
||||
another 1.4–4× of TTFT. The reuse that matters most is the GPU-resident kind.
|
||||
|
||||
### Exp (b) — APC and latency knee at small GPU capacity (`figs/exp_b_capacity_knee.png`)
|
||||
|
||||
@@ -77,9 +86,13 @@ intra-session APC ceiling 71%), sweeping GPU KV capacity.
|
||||
|
||||
## Conclusion (for §2.2)
|
||||
|
||||
1. **Hits on GPU > hits on CPU** is now measured, not asserted: a GPU(HBM) hit is
|
||||
1.4–2.5× faster than a CPU(DRAM-offload) hit and 14–137× faster than recompute,
|
||||
with the GPU advantage growing in context length (Exp a).
|
||||
1. **The KV-tier hierarchy is now measured, not asserted** (Exp a):
|
||||
`GPU(HBM) < CPU(local DRAM) < remote KV-store(RDMA) ≪ miss`. At 64k tokens a GPU
|
||||
hit (0.11 s) is 2.4× faster than a local CPU hit, ~9× faster than a remote RDMA
|
||||
store hit, and 137× faster than recompute; the gaps **grow with context length**.
|
||||
A global RDMA store (Mooncake-Store blog) is a real win over recompute (up to 16×
|
||||
here / 46× in the blog) — but it pays the NIC tax, so it sits a tier *below* local
|
||||
CPU and two below GPU. Each step toward the GPU is another 1.4–4× of TTFT.
|
||||
2. **You only need to hold the *active working set* on GPU.** Realized APC and
|
||||
latency saturate once HBM covers the concurrent sessions' working set (3.6 GB
|
||||
here); past that, extra capacity — and the entire CPU/storage tier built to chase
|
||||
@@ -94,6 +107,13 @@ intra-session APC ceiling 71%), sweeping GPU KV capacity.
|
||||
C1/f2c); it isolates the capacity→APC→latency mechanism. Knee *position* scales
|
||||
with concurrency × per-session working set.
|
||||
- Single H20; PCIe H2D ~54 GB/s is intra-node (cf. 9.7 GB/s Mooncake inter-node RDMA).
|
||||
- Remote-RDMA tier is a single-node 2-instance Mooncake measurement (RDMA loopback
|
||||
through the NIC; MB2 showed intra ≈ inter, NIC-bound). `t_transfer` includes the
|
||||
request + 1-token decode + dst scheduling, so effective BW (~5–7 GB/s) is below the
|
||||
raw ~9.7 GB/s; this is the realistic end-to-end remote-hit latency, not just the
|
||||
wire transfer. The connector's retention-verify (`cached_followup`) is 0 because
|
||||
kv_both `do_remote_prefill` does not reinsert the pulled prefix into dst's
|
||||
persistent prefix cache — it does not affect the measured pull latency.
|
||||
- The 80.3% point at the knee slightly exceeds the 71% intra-session ceiling
|
||||
(transient full residency / generated-token reuse); steady state is 72.9%.
|
||||
|
||||
|
||||
@@ -18,6 +18,7 @@ def load(name):
|
||||
|
||||
|
||||
miss, gpu, cpu, pcie = load("miss.json"), load("gpu.json"), load("cpu.json"), load("pcie.json")
|
||||
rdma = load("rdma.json")
|
||||
|
||||
|
||||
def series(d):
|
||||
@@ -27,14 +28,35 @@ def series(d):
|
||||
return [a for a, _ in items], [b for _, b in items]
|
||||
|
||||
|
||||
def rdma_series():
|
||||
"""Remote KV-store hit over RDMA: p50 of t_transfer_s per prefix length
|
||||
(dst pulls the cached prefix from the remote pool instead of recomputing)."""
|
||||
if not rdma:
|
||||
return [], {}
|
||||
import statistics
|
||||
from collections import defaultdict
|
||||
by = defaultdict(list)
|
||||
for r in rdma["raw"]:
|
||||
by[r["input_tokens"]].append(r["t_transfer_s"])
|
||||
xs = sorted(by)
|
||||
return xs, {L: statistics.median(by[L]) for L in xs}
|
||||
|
||||
|
||||
rdma_x, rdma_p50 = rdma_series()
|
||||
|
||||
|
||||
fig, ax = plt.subplots(figsize=(7.2, 5.0))
|
||||
for d, lab, mk, c in [(miss, "miss (recompute)", "o", "#d62728"),
|
||||
(cpu, "CPU-tier hit (DRAM offload)", "s", "#ff7f0e"),
|
||||
(cpu, "CPU-tier hit (local DRAM, PCIe)", "s", "#ff7f0e"),
|
||||
(gpu, "GPU-tier hit (HBM APC)", "^", "#2ca02c")]:
|
||||
xs, ys = series(d)
|
||||
if xs:
|
||||
ax.plot(xs, ys, marker=mk, label=lab, color=c, linewidth=2, markersize=7)
|
||||
|
||||
if rdma_x:
|
||||
ax.plot(rdma_x, [rdma_p50[L] for L in rdma_x], marker="D", color="#9467bd",
|
||||
linewidth=2, markersize=7, label="remote KV-store hit (Mooncake RDMA)")
|
||||
|
||||
if pcie:
|
||||
items = sorted(((int(k), v["transfer_s"]) for k, v in pcie["by_length"].items()))
|
||||
xs = [a for a, _ in items]; ys = [b for _, b in items]
|
||||
@@ -44,7 +66,8 @@ if pcie:
|
||||
ax.set_xscale("log", base=2); ax.set_yscale("log")
|
||||
ax.set_xlabel("Reused prefix length (tokens)")
|
||||
ax.set_ylabel("TTFT (s, log)")
|
||||
ax.set_title("Cost of serving a reused prefix from each KV tier\nQwen3-Coder-30B-A3B, 1xH20")
|
||||
ax.set_title("Cost of serving a reused prefix from each KV tier\n"
|
||||
"Qwen3-Coder-30B-A3B, H20 (local tiers 1 GPU; RDMA pool 2 GPUs)")
|
||||
ax.grid(True, which="both", alpha=0.3)
|
||||
ax.legend()
|
||||
FIG.parent.mkdir(parents=True, exist_ok=True)
|
||||
@@ -52,16 +75,18 @@ fig.tight_layout(); fig.savefig(FIG, dpi=140)
|
||||
print("wrote", FIG)
|
||||
|
||||
# Table
|
||||
print(f"\n{'L':>7} {'miss(s)':>10} {'cpu(s)':>10} {'gpu(s)':>10} {'miss/cpu':>9} {'cpu/gpu':>9}")
|
||||
print(f"\n{'L':>7} {'miss':>9} {'rdma':>9} {'cpu':>9} {'gpu':>9} "
|
||||
f"{'miss/rdma':>9} {'rdma/cpu':>9} {'cpu/gpu':>9}")
|
||||
allL = sorted({int(k) for d in (miss, gpu, cpu) if d for k in d["by_length"]})
|
||||
for L in allL:
|
||||
m = miss["by_length"].get(str(L), {}).get("ttft_p50") if miss else None
|
||||
c = cpu["by_length"].get(str(L), {}).get("ttft_p50") if cpu else None
|
||||
g = gpu["by_length"].get(str(L), {}).get("ttft_p50") if gpu else None
|
||||
rd = rdma_p50.get(L)
|
||||
f = lambda x: f"{x:.4f}" if x is not None else " - "
|
||||
r1 = f"{m/c:.1f}x" if (m and c) else " -"
|
||||
r2 = f"{c/g:.1f}x" if (c and g) else " -"
|
||||
print(f"{L:>7} {f(m):>10} {f(c):>10} {f(g):>10} {r1:>9} {r2:>9}")
|
||||
rr = lambda a, b: f"{a/b:.1f}x" if (a and b) else " -"
|
||||
print(f"{L:>7} {f(m):>9} {f(rd):>9} {f(c):>9} {f(g):>9} "
|
||||
f"{rr(m,rd):>9} {rr(rd,c):>9} {rr(c,g):>9}")
|
||||
|
||||
if cpu:
|
||||
vf = {k: v.get("verified_frac") for k, v in cpu["by_length"].items()}
|
||||
|
||||
1013
v2/exp_a_tier_latency/results/rdma.json
Normal file
1013
v2/exp_a_tier_latency/results/rdma.json
Normal file
File diff suppressed because it is too large
Load Diff
53
v2/exp_a_tier_latency/run_rdma.sh
Normal file
53
v2/exp_a_tier_latency/run_rdma.sh
Normal file
@@ -0,0 +1,53 @@
|
||||
#!/bin/bash
|
||||
# Exp (a) 4th tier: remote global-KV-store hit over RDMA (Mooncake).
|
||||
# Two kv_both MooncakeConnector instances (GPU0=src, GPU1=dst). For each prefix
|
||||
# length: src prefills+caches the KV, dst serves the request by PULLING that KV
|
||||
# over RDMA (do_remote_prefill) instead of recomputing -> that pull time is the
|
||||
# remote-store hit latency. Mirrors the Mooncake-Store blog mechanism.
|
||||
set -uo pipefail
|
||||
cd /home/admin/cpfs/wjh/agentic-kv
|
||||
PY=.venv/bin/python
|
||||
MODEL=/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct
|
||||
OUT=v2/exp_a_tier_latency/results
|
||||
mkdir -p "$OUT"
|
||||
PIDS=()
|
||||
|
||||
launch() { # $1 gpu, $2 http port, $3 bootstrap port, $4 master port
|
||||
VLLM_MOONCAKE_BOOTSTRAP_PORT=$3 MASTER_PORT=$4 CUDA_VISIBLE_DEVICES=$1 VLLM_LOGGING_LEVEL=WARNING \
|
||||
$PY -m vllm.entrypoints.openai.api_server --model "$MODEL" \
|
||||
--host 0.0.0.0 --port $2 --tensor-parallel-size 1 --trust-remote-code \
|
||||
--enable-prefix-caching --enforce-eager --dtype auto --max-model-len 70000 \
|
||||
--gpu-memory-utilization 0.9 \
|
||||
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_both"}' \
|
||||
> "$OUT/vllm_rdma_$2.log" 2>&1 &
|
||||
PIDS+=($!)
|
||||
}
|
||||
teardown() {
|
||||
for p in "${PIDS[@]:-}"; do kill -TERM "$p" 2>/dev/null; done
|
||||
sleep 6
|
||||
for p in $(pgrep -f "VLLM::EngineCore"); do kill -9 "$p" 2>/dev/null; done
|
||||
sleep 3
|
||||
}
|
||||
trap teardown EXIT
|
||||
|
||||
echo ">>> launch 2 kv_both instances (GPU0:8000/bp8998, GPU1:8001/bp8999)"
|
||||
launch 0 8000 8998 29550
|
||||
launch 1 8001 8999 29551
|
||||
for port in 8000 8001; do
|
||||
echo -n " wait health $port..."
|
||||
timeout 900 bash -c "until curl -sf http://127.0.0.1:$port/health >/dev/null 2>&1; do sleep 5; done" \
|
||||
&& echo " ok" || { echo " FAIL"; tail -25 "$OUT/vllm_rdma_$port.log"; exit 1; }
|
||||
done
|
||||
for bp in 8998 8999; do
|
||||
timeout 180 bash -c "until curl -s http://127.0.0.1:$bp/query >/dev/null 2>&1; do sleep 2; done"
|
||||
done
|
||||
echo " bootstrap ports ready."
|
||||
sleep 3
|
||||
|
||||
$PY microbench/fresh_setup/mb2_kv_transfer.py \
|
||||
--src-host 127.0.0.1 --dst-host 127.0.0.1 \
|
||||
--src-port 8000 --dst-port 8001 --src-bp 8998 --dst-bp 8999 \
|
||||
--sizes 1024,2048,4096,8192,16384,32768,65536 --repeats 11 \
|
||||
--label rdma-intra-node --out "$OUT/rdma.json"
|
||||
|
||||
echo "=== exp (a) RDMA tier DONE ==="
|
||||
Binary file not shown.
|
Before Width: | Height: | Size: 81 KiB After Width: | Height: | Size: 100 KiB |
Reference in New Issue
Block a user