v2 exp(a): add remote KV-store (RDMA) tier
Extends the hit-latency microbench to a 4th tier: a remote global-KV-store hit over RDMA, the Mooncake-Store mechanism. Two kv_both MooncakeConnector instances (run_rdma.sh); for each prefix length, instance B serves the request by pulling instance A's cached prefix over RDMA (do_remote_prefill, via microbench/fresh_setup/mb2_kv_transfer.py) instead of recomputing -- the timed pull is the remote-hit latency. Result (TTFT p50, 11 reps): strict tier ordering GPU(HBM) < CPU(local DRAM) < remote-RDMA-store << miss, gaps growing with context. At 64k: GPU 0.11s, CPU 0.27s, RDMA 0.97s, miss 15.2s -> miss/RDMA 15.8x, RDMA/CPU 3.6x, CPU/GPU 2.4x. So a global RDMA store is a real win over recompute (the blog's 46x) but pays the NIC tax (~5-7 GB/s effective) and sits a tier below local CPU and two below GPU -- reinforcing GPU-hit-first. README + figure updated to four tiers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
60
v2/README.md
60
v2/README.md
@@ -28,29 +28,38 @@ Run: `GPU=1 bash v2/exp_b_capacity_knee/run_sweep.sh` then
|
|||||||
|
|
||||||
## Results (dash0, 2026-05-30)
|
## Results (dash0, 2026-05-30)
|
||||||
|
|
||||||
### Exp (a) — GPU hit ≫ CPU hit ≫ miss (`figs/exp_a_tier_latency.png`)
|
### Exp (a) — GPU hit > CPU hit > remote-store(RDMA) hit ≫ miss (`figs/exp_a_tier_latency.png`)
|
||||||
|
|
||||||
TTFT (s, p50 over reps) to serve a reused prefix of length L. CPU-tier hits were
|
TTFT (s, p50 over reps) to serve a reused prefix of length L from each KV tier.
|
||||||
100% verified via `vllm:external_prefix_cache_hits`.
|
Local CPU-tier hits were 100% verified via `vllm:external_prefix_cache_hits`;
|
||||||
|
the **remote KV-store** tier is a real cross-instance Mooncake hit — instance B
|
||||||
|
serves the request by **pulling the cached prefix from instance A over RDMA**
|
||||||
|
(`do_remote_prefill`) instead of recomputing (the Mooncake-Store-blog mechanism),
|
||||||
|
measured with `microbench/fresh_setup/mb2_kv_transfer.py`.
|
||||||
|
|
||||||
| prefix L | miss (recompute) | CPU-tier hit | GPU-tier hit | miss/CPU | **CPU/GPU** |
|
| prefix L | miss (recompute) | **remote RDMA store** | CPU-tier (local) | GPU-tier (HBM) | miss/RDMA | RDMA/CPU | CPU/GPU |
|
||||||
|---:|---:|---:|---:|---:|---:|
|
|---:|---:|---:|---:|---:|---:|---:|---:|
|
||||||
| 1k | 0.078 | 0.057 | 0.042 | 1.4× | 1.4× |
|
| 1k | 0.078 | 0.061 | 0.057 | 0.042 | 1.3× | 1.1× | 1.4× |
|
||||||
| 4k | 0.261 | 0.064 | 0.046 | 4.1× | 1.4× |
|
| 8k | 0.588 | 0.151 | 0.076 | 0.053 | 3.9× | 2.0× | 1.5× |
|
||||||
| 8k | 0.588 | 0.076 | 0.053 | 7.7× | 1.4× |
|
| 16k | 1.547 | 0.262 | 0.105 | 0.063 | 5.9× | 2.5× | 1.7× |
|
||||||
| 16k | 1.547 | 0.105 | 0.063 | 14.8× | 1.7× |
|
| 32k | 4.604 | 0.680 | 0.158 | 0.080 | 6.8× | 4.3× | 2.0× |
|
||||||
| 32k | 4.604 | 0.158 | 0.080 | 29.2× | 2.0× |
|
| **64k** | **15.230** | **0.966** | **0.272** | **0.111** | **15.8×** | **3.6×** | **2.4×** |
|
||||||
| **64k** | **15.230** | **0.272** | **0.111** | **56.0×** | **2.4×** |
|
|
||||||
|
|
||||||
- **GPU hit is ~flat** (42→111 ms over 1k→64k): a hit returns the whole prefix from
|
- **GPU hit is ~flat** (42→111 ms over 1k→64k): a hit returns the whole prefix from
|
||||||
HBM, only the last token is recomputed.
|
HBM, only the last token is recomputed.
|
||||||
- **miss grows superlinearly** (→15.2 s at 64k): a miss pays the full prefill.
|
- **miss grows superlinearly** (→15.2 s at 64k): a miss pays the full prefill.
|
||||||
- **CPU hit grows transfer-bound** (PCIe H2D measured **~54 GB/s**); CPU-hit TTFT ≈
|
- **local CPU hit grows transfer-bound** (PCIe H2D measured **~54 GB/s**); CPU-hit
|
||||||
GPU-hit + KV/PCIe + ~0.15 s connector overhead (the dashed PCIe floor sits just
|
TTFT ≈ GPU-hit + KV/PCIe + ~0.15 s overhead (dashed PCIe floor sits just under it).
|
||||||
under the orange curve, confirming the decomposition).
|
- **remote RDMA-store hit** is the L3 tier the Mooncake-Store blog advocates: it is
|
||||||
- **Takeaway:** among hits, **GPU beats CPU by 1.4–2.5×** and the gap widens with
|
a big win over recompute (**up to 16× lower TTFT**, consistent with the blog's
|
||||||
context. A CPU hit is a useful backstop (up to 56× better than recompute) but is
|
46× at higher hit rates) — but it pays the **NIC tax** (~5–7 GB/s effective here,
|
||||||
strictly worse than keeping the prefix resident in HBM.
|
cf. ~9.7 GB/s raw Mooncake RDMA in MB2; multi-NIC pooling would raise it). So it
|
||||||
|
is **3.6× slower than a local CPU hit and ~9× slower than a GPU hit** at 64k, and
|
||||||
|
the gap **grows with context length**.
|
||||||
|
- **Takeaway — the tier ordering is strict and widens with context:**
|
||||||
|
**GPU < CPU-local < remote-RDMA-store ≪ miss.** A global KV store helps (vs
|
||||||
|
recompute), which is why that approach exists; but every step *toward* the GPU is
|
||||||
|
another 1.4–4× of TTFT. The reuse that matters most is the GPU-resident kind.
|
||||||
|
|
||||||
### Exp (b) — APC and latency knee at small GPU capacity (`figs/exp_b_capacity_knee.png`)
|
### Exp (b) — APC and latency knee at small GPU capacity (`figs/exp_b_capacity_knee.png`)
|
||||||
|
|
||||||
@@ -77,9 +86,13 @@ intra-session APC ceiling 71%), sweeping GPU KV capacity.
|
|||||||
|
|
||||||
## Conclusion (for §2.2)
|
## Conclusion (for §2.2)
|
||||||
|
|
||||||
1. **Hits on GPU > hits on CPU** is now measured, not asserted: a GPU(HBM) hit is
|
1. **The KV-tier hierarchy is now measured, not asserted** (Exp a):
|
||||||
1.4–2.5× faster than a CPU(DRAM-offload) hit and 14–137× faster than recompute,
|
`GPU(HBM) < CPU(local DRAM) < remote KV-store(RDMA) ≪ miss`. At 64k tokens a GPU
|
||||||
with the GPU advantage growing in context length (Exp a).
|
hit (0.11 s) is 2.4× faster than a local CPU hit, ~9× faster than a remote RDMA
|
||||||
|
store hit, and 137× faster than recompute; the gaps **grow with context length**.
|
||||||
|
A global RDMA store (Mooncake-Store blog) is a real win over recompute (up to 16×
|
||||||
|
here / 46× in the blog) — but it pays the NIC tax, so it sits a tier *below* local
|
||||||
|
CPU and two below GPU. Each step toward the GPU is another 1.4–4× of TTFT.
|
||||||
2. **You only need to hold the *active working set* on GPU.** Realized APC and
|
2. **You only need to hold the *active working set* on GPU.** Realized APC and
|
||||||
latency saturate once HBM covers the concurrent sessions' working set (3.6 GB
|
latency saturate once HBM covers the concurrent sessions' working set (3.6 GB
|
||||||
here); past that, extra capacity — and the entire CPU/storage tier built to chase
|
here); past that, extra capacity — and the entire CPU/storage tier built to chase
|
||||||
@@ -94,6 +107,13 @@ intra-session APC ceiling 71%), sweeping GPU KV capacity.
|
|||||||
C1/f2c); it isolates the capacity→APC→latency mechanism. Knee *position* scales
|
C1/f2c); it isolates the capacity→APC→latency mechanism. Knee *position* scales
|
||||||
with concurrency × per-session working set.
|
with concurrency × per-session working set.
|
||||||
- Single H20; PCIe H2D ~54 GB/s is intra-node (cf. 9.7 GB/s Mooncake inter-node RDMA).
|
- Single H20; PCIe H2D ~54 GB/s is intra-node (cf. 9.7 GB/s Mooncake inter-node RDMA).
|
||||||
|
- Remote-RDMA tier is a single-node 2-instance Mooncake measurement (RDMA loopback
|
||||||
|
through the NIC; MB2 showed intra ≈ inter, NIC-bound). `t_transfer` includes the
|
||||||
|
request + 1-token decode + dst scheduling, so effective BW (~5–7 GB/s) is below the
|
||||||
|
raw ~9.7 GB/s; this is the realistic end-to-end remote-hit latency, not just the
|
||||||
|
wire transfer. The connector's retention-verify (`cached_followup`) is 0 because
|
||||||
|
kv_both `do_remote_prefill` does not reinsert the pulled prefix into dst's
|
||||||
|
persistent prefix cache — it does not affect the measured pull latency.
|
||||||
- The 80.3% point at the knee slightly exceeds the 71% intra-session ceiling
|
- The 80.3% point at the knee slightly exceeds the 71% intra-session ceiling
|
||||||
(transient full residency / generated-token reuse); steady state is 72.9%.
|
(transient full residency / generated-token reuse); steady state is 72.9%.
|
||||||
|
|
||||||
|
|||||||
@@ -18,6 +18,7 @@ def load(name):
|
|||||||
|
|
||||||
|
|
||||||
miss, gpu, cpu, pcie = load("miss.json"), load("gpu.json"), load("cpu.json"), load("pcie.json")
|
miss, gpu, cpu, pcie = load("miss.json"), load("gpu.json"), load("cpu.json"), load("pcie.json")
|
||||||
|
rdma = load("rdma.json")
|
||||||
|
|
||||||
|
|
||||||
def series(d):
|
def series(d):
|
||||||
@@ -27,14 +28,35 @@ def series(d):
|
|||||||
return [a for a, _ in items], [b for _, b in items]
|
return [a for a, _ in items], [b for _, b in items]
|
||||||
|
|
||||||
|
|
||||||
|
def rdma_series():
|
||||||
|
"""Remote KV-store hit over RDMA: p50 of t_transfer_s per prefix length
|
||||||
|
(dst pulls the cached prefix from the remote pool instead of recomputing)."""
|
||||||
|
if not rdma:
|
||||||
|
return [], {}
|
||||||
|
import statistics
|
||||||
|
from collections import defaultdict
|
||||||
|
by = defaultdict(list)
|
||||||
|
for r in rdma["raw"]:
|
||||||
|
by[r["input_tokens"]].append(r["t_transfer_s"])
|
||||||
|
xs = sorted(by)
|
||||||
|
return xs, {L: statistics.median(by[L]) for L in xs}
|
||||||
|
|
||||||
|
|
||||||
|
rdma_x, rdma_p50 = rdma_series()
|
||||||
|
|
||||||
|
|
||||||
fig, ax = plt.subplots(figsize=(7.2, 5.0))
|
fig, ax = plt.subplots(figsize=(7.2, 5.0))
|
||||||
for d, lab, mk, c in [(miss, "miss (recompute)", "o", "#d62728"),
|
for d, lab, mk, c in [(miss, "miss (recompute)", "o", "#d62728"),
|
||||||
(cpu, "CPU-tier hit (DRAM offload)", "s", "#ff7f0e"),
|
(cpu, "CPU-tier hit (local DRAM, PCIe)", "s", "#ff7f0e"),
|
||||||
(gpu, "GPU-tier hit (HBM APC)", "^", "#2ca02c")]:
|
(gpu, "GPU-tier hit (HBM APC)", "^", "#2ca02c")]:
|
||||||
xs, ys = series(d)
|
xs, ys = series(d)
|
||||||
if xs:
|
if xs:
|
||||||
ax.plot(xs, ys, marker=mk, label=lab, color=c, linewidth=2, markersize=7)
|
ax.plot(xs, ys, marker=mk, label=lab, color=c, linewidth=2, markersize=7)
|
||||||
|
|
||||||
|
if rdma_x:
|
||||||
|
ax.plot(rdma_x, [rdma_p50[L] for L in rdma_x], marker="D", color="#9467bd",
|
||||||
|
linewidth=2, markersize=7, label="remote KV-store hit (Mooncake RDMA)")
|
||||||
|
|
||||||
if pcie:
|
if pcie:
|
||||||
items = sorted(((int(k), v["transfer_s"]) for k, v in pcie["by_length"].items()))
|
items = sorted(((int(k), v["transfer_s"]) for k, v in pcie["by_length"].items()))
|
||||||
xs = [a for a, _ in items]; ys = [b for _, b in items]
|
xs = [a for a, _ in items]; ys = [b for _, b in items]
|
||||||
@@ -44,7 +66,8 @@ if pcie:
|
|||||||
ax.set_xscale("log", base=2); ax.set_yscale("log")
|
ax.set_xscale("log", base=2); ax.set_yscale("log")
|
||||||
ax.set_xlabel("Reused prefix length (tokens)")
|
ax.set_xlabel("Reused prefix length (tokens)")
|
||||||
ax.set_ylabel("TTFT (s, log)")
|
ax.set_ylabel("TTFT (s, log)")
|
||||||
ax.set_title("Cost of serving a reused prefix from each KV tier\nQwen3-Coder-30B-A3B, 1xH20")
|
ax.set_title("Cost of serving a reused prefix from each KV tier\n"
|
||||||
|
"Qwen3-Coder-30B-A3B, H20 (local tiers 1 GPU; RDMA pool 2 GPUs)")
|
||||||
ax.grid(True, which="both", alpha=0.3)
|
ax.grid(True, which="both", alpha=0.3)
|
||||||
ax.legend()
|
ax.legend()
|
||||||
FIG.parent.mkdir(parents=True, exist_ok=True)
|
FIG.parent.mkdir(parents=True, exist_ok=True)
|
||||||
@@ -52,16 +75,18 @@ fig.tight_layout(); fig.savefig(FIG, dpi=140)
|
|||||||
print("wrote", FIG)
|
print("wrote", FIG)
|
||||||
|
|
||||||
# Table
|
# Table
|
||||||
print(f"\n{'L':>7} {'miss(s)':>10} {'cpu(s)':>10} {'gpu(s)':>10} {'miss/cpu':>9} {'cpu/gpu':>9}")
|
print(f"\n{'L':>7} {'miss':>9} {'rdma':>9} {'cpu':>9} {'gpu':>9} "
|
||||||
|
f"{'miss/rdma':>9} {'rdma/cpu':>9} {'cpu/gpu':>9}")
|
||||||
allL = sorted({int(k) for d in (miss, gpu, cpu) if d for k in d["by_length"]})
|
allL = sorted({int(k) for d in (miss, gpu, cpu) if d for k in d["by_length"]})
|
||||||
for L in allL:
|
for L in allL:
|
||||||
m = miss["by_length"].get(str(L), {}).get("ttft_p50") if miss else None
|
m = miss["by_length"].get(str(L), {}).get("ttft_p50") if miss else None
|
||||||
c = cpu["by_length"].get(str(L), {}).get("ttft_p50") if cpu else None
|
c = cpu["by_length"].get(str(L), {}).get("ttft_p50") if cpu else None
|
||||||
g = gpu["by_length"].get(str(L), {}).get("ttft_p50") if gpu else None
|
g = gpu["by_length"].get(str(L), {}).get("ttft_p50") if gpu else None
|
||||||
|
rd = rdma_p50.get(L)
|
||||||
f = lambda x: f"{x:.4f}" if x is not None else " - "
|
f = lambda x: f"{x:.4f}" if x is not None else " - "
|
||||||
r1 = f"{m/c:.1f}x" if (m and c) else " -"
|
rr = lambda a, b: f"{a/b:.1f}x" if (a and b) else " -"
|
||||||
r2 = f"{c/g:.1f}x" if (c and g) else " -"
|
print(f"{L:>7} {f(m):>9} {f(rd):>9} {f(c):>9} {f(g):>9} "
|
||||||
print(f"{L:>7} {f(m):>10} {f(c):>10} {f(g):>10} {r1:>9} {r2:>9}")
|
f"{rr(m,rd):>9} {rr(rd,c):>9} {rr(c,g):>9}")
|
||||||
|
|
||||||
if cpu:
|
if cpu:
|
||||||
vf = {k: v.get("verified_frac") for k, v in cpu["by_length"].items()}
|
vf = {k: v.get("verified_frac") for k, v in cpu["by_length"].items()}
|
||||||
|
|||||||
1013
v2/exp_a_tier_latency/results/rdma.json
Normal file
1013
v2/exp_a_tier_latency/results/rdma.json
Normal file
File diff suppressed because it is too large
Load Diff
53
v2/exp_a_tier_latency/run_rdma.sh
Normal file
53
v2/exp_a_tier_latency/run_rdma.sh
Normal file
@@ -0,0 +1,53 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
# Exp (a) 4th tier: remote global-KV-store hit over RDMA (Mooncake).
|
||||||
|
# Two kv_both MooncakeConnector instances (GPU0=src, GPU1=dst). For each prefix
|
||||||
|
# length: src prefills+caches the KV, dst serves the request by PULLING that KV
|
||||||
|
# over RDMA (do_remote_prefill) instead of recomputing -> that pull time is the
|
||||||
|
# remote-store hit latency. Mirrors the Mooncake-Store blog mechanism.
|
||||||
|
set -uo pipefail
|
||||||
|
cd /home/admin/cpfs/wjh/agentic-kv
|
||||||
|
PY=.venv/bin/python
|
||||||
|
MODEL=/home/admin/cpfs/wjh/models/Qwen/Qwen3-Coder-30B-A3B-Instruct
|
||||||
|
OUT=v2/exp_a_tier_latency/results
|
||||||
|
mkdir -p "$OUT"
|
||||||
|
PIDS=()
|
||||||
|
|
||||||
|
launch() { # $1 gpu, $2 http port, $3 bootstrap port, $4 master port
|
||||||
|
VLLM_MOONCAKE_BOOTSTRAP_PORT=$3 MASTER_PORT=$4 CUDA_VISIBLE_DEVICES=$1 VLLM_LOGGING_LEVEL=WARNING \
|
||||||
|
$PY -m vllm.entrypoints.openai.api_server --model "$MODEL" \
|
||||||
|
--host 0.0.0.0 --port $2 --tensor-parallel-size 1 --trust-remote-code \
|
||||||
|
--enable-prefix-caching --enforce-eager --dtype auto --max-model-len 70000 \
|
||||||
|
--gpu-memory-utilization 0.9 \
|
||||||
|
--kv-transfer-config '{"kv_connector":"MooncakeConnector","kv_role":"kv_both"}' \
|
||||||
|
> "$OUT/vllm_rdma_$2.log" 2>&1 &
|
||||||
|
PIDS+=($!)
|
||||||
|
}
|
||||||
|
teardown() {
|
||||||
|
for p in "${PIDS[@]:-}"; do kill -TERM "$p" 2>/dev/null; done
|
||||||
|
sleep 6
|
||||||
|
for p in $(pgrep -f "VLLM::EngineCore"); do kill -9 "$p" 2>/dev/null; done
|
||||||
|
sleep 3
|
||||||
|
}
|
||||||
|
trap teardown EXIT
|
||||||
|
|
||||||
|
echo ">>> launch 2 kv_both instances (GPU0:8000/bp8998, GPU1:8001/bp8999)"
|
||||||
|
launch 0 8000 8998 29550
|
||||||
|
launch 1 8001 8999 29551
|
||||||
|
for port in 8000 8001; do
|
||||||
|
echo -n " wait health $port..."
|
||||||
|
timeout 900 bash -c "until curl -sf http://127.0.0.1:$port/health >/dev/null 2>&1; do sleep 5; done" \
|
||||||
|
&& echo " ok" || { echo " FAIL"; tail -25 "$OUT/vllm_rdma_$port.log"; exit 1; }
|
||||||
|
done
|
||||||
|
for bp in 8998 8999; do
|
||||||
|
timeout 180 bash -c "until curl -s http://127.0.0.1:$bp/query >/dev/null 2>&1; do sleep 2; done"
|
||||||
|
done
|
||||||
|
echo " bootstrap ports ready."
|
||||||
|
sleep 3
|
||||||
|
|
||||||
|
$PY microbench/fresh_setup/mb2_kv_transfer.py \
|
||||||
|
--src-host 127.0.0.1 --dst-host 127.0.0.1 \
|
||||||
|
--src-port 8000 --dst-port 8001 --src-bp 8998 --dst-bp 8999 \
|
||||||
|
--sizes 1024,2048,4096,8192,16384,32768,65536 --repeats 11 \
|
||||||
|
--label rdma-intra-node --out "$OUT/rdma.json"
|
||||||
|
|
||||||
|
echo "=== exp (a) RDMA tier DONE ==="
|
||||||
Binary file not shown.
|
Before Width: | Height: | Size: 81 KiB After Width: | Height: | Size: 100 KiB |
Reference in New Issue
Block a user