Files

Gahow Wang dc8e6dd5a8 v2 exp(a): add remote KV-store (RDMA) tier

Extends the hit-latency microbench to a 4th tier: a remote global-KV-store
hit over RDMA, the Mooncake-Store mechanism. Two kv_both MooncakeConnector
instances (run_rdma.sh); for each prefix length, instance B serves the
request by pulling instance A's cached prefix over RDMA (do_remote_prefill,
via microbench/fresh_setup/mb2_kv_transfer.py) instead of recomputing -- the
timed pull is the remote-hit latency.

Result (TTFT p50, 11 reps): strict tier ordering
GPU(HBM) < CPU(local DRAM) < remote-RDMA-store << miss, gaps growing with
context. At 64k: GPU 0.11s, CPU 0.27s, RDMA 0.97s, miss 15.2s -> miss/RDMA
15.8x, RDMA/CPU 3.6x, CPU/GPU 2.4x. So a global RDMA store is a real win
over recompute (the blog's 46x) but pays the NIC tax (~5-7 GB/s effective)
and sits a tier below local CPU and two below GPU -- reinforcing
GPU-hit-first. README + figure updated to four tiers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-05-30 12:48:37 +08:00

common

v2 exp(a): three-tier KV-hit latency microbench (GPU >> CPU >> miss)

2026-05-30 11:23:04 +08:00

exp_a_tier_latency

v2 exp(a): add remote KV-store (RDMA) tier

2026-05-30 12:48:37 +08:00

exp_b_capacity_knee

v2 exp(b): GPU KV-capacity APC/latency knee + writeup

2026-05-30 11:23:31 +08:00

figs

v2 exp(a): add remote KV-store (RDMA) tier

2026-05-30 12:48:37 +08:00

.gitignore

v2 exp(a): three-tier KV-hit latency microbench (GPU >> CPU >> miss)

2026-05-30 11:23:04 +08:00

README.md

v2 exp(a): add remote KV-store (RDMA) tier

2026-05-30 12:48:37 +08:00

README.md

v2 — Evidence for the GPU-hit-first principle (§2.2)

Two experiments that turn "Hits on GPU > hits on CPU" + "GPU is enough to hold most of the valuable KV reuse" from assertion into measurement.

Hardware: dash0, 1× NVIDIA H20 (97 GB) per experiment, Qwen3-Coder-30B-A3B-Instruct, vLLM 0.18.1 (V1, prefix caching, enforce-eager). KV = 96 KiB/token (1 GiB = 10,923 tok).

Exp (a) — three-tier hit latency (`exp_a_tier_latency/`)

TTFT of serving a reused prefix of length L from each tier:

miss — fresh unique prompt → full prefill (recompute)
GPU hit — re-request → HBM prefix cache
CPU hit — warm → evict to CPU offload tier (--kv-offloading-size) → re-request → DRAM fetch
PCIe floor — direct pinned-memory H2D transfer cost for the same KV size (backstop)

Tier of each measured request is verified via vllm:prefix_cache_hits vs vllm:external_prefix_cache_hits deltas, not assumed.

Run: GPU=0 bash v2/exp_a_tier_latency/run.sh then .venv/bin/python v2/exp_a_tier_latency/plot.py.

Exp (b) — capacity → APC → latency knee (`exp_b_capacity_knee/`)

Replay a fixed agentic trace at several GPU KV pool sizes (--num-gpu-blocks-override); measure realized APC + TTFT p90 per capacity. The knee = the GPU capacity beyond which more HBM buys ~no extra reuse.

Run: GPU=1 bash v2/exp_b_capacity_knee/run_sweep.sh then .venv/bin/python v2/exp_b_capacity_knee/analyze_and_plot.py.

Results (dash0, 2026-05-30)

Exp (a) — GPU hit > CPU hit > remote-store(RDMA) hit ≫ miss (`figs/exp_a_tier_latency.png`)

TTFT (s, p50 over reps) to serve a reused prefix of length L from each KV tier. Local CPU-tier hits were 100% verified via vllm:external_prefix_cache_hits; the remote KV-store tier is a real cross-instance Mooncake hit — instance B serves the request by pulling the cached prefix from instance A over RDMA (do_remote_prefill) instead of recomputing (the Mooncake-Store-blog mechanism), measured with microbench/fresh_setup/mb2_kv_transfer.py.

prefix L	miss (recompute)	remote RDMA store	CPU-tier (local)	GPU-tier (HBM)	miss/RDMA	RDMA/CPU	CPU/GPU
1k	0.078	0.061	0.057	0.042	1.3×	1.1×	1.4×
8k	0.588	0.151	0.076	0.053	3.9×	2.0×	1.5×
16k	1.547	0.262	0.105	0.063	5.9×	2.5×	1.7×
32k	4.604	0.680	0.158	0.080	6.8×	4.3×	2.0×
64k	15.230	0.966	0.272	0.111	15.8×	3.6×	2.4×

GPU hit is ~flat (42→111 ms over 1k→64k): a hit returns the whole prefix from HBM, only the last token is recomputed.
miss grows superlinearly (→15.2 s at 64k): a miss pays the full prefill.
local CPU hit grows transfer-bound (PCIe H2D measured ~54 GB/s); CPU-hit TTFT ≈ GPU-hit + KV/PCIe + ~0.15 s overhead (dashed PCIe floor sits just under it).
remote RDMA-store hit is the L3 tier the Mooncake-Store blog advocates: it is a big win over recompute (up to 16× lower TTFT, consistent with the blog's 46× at higher hit rates) — but it pays the NIC tax (~5–7 GB/s effective here, cf. ~9.7 GB/s raw Mooncake RDMA in MB2; multi-NIC pooling would raise it). So it is 3.6× slower than a local CPU hit and ~9× slower than a GPU hit at 64k, and the gap grows with context length.
Takeaway — the tier ordering is strict and widens with context: GPU < CPU-local < remote-RDMA-store ≪ miss. A global KV store helps (vs recompute), which is why that approach exists; but every step toward the GPU is another 1.4–4× of TTFT. The reuse that matters most is the GPU-resident kind.

Exp (b) — APC and latency knee at small GPU capacity (`figs/exp_b_capacity_knee.png`)

Closed-loop replay (concurrency 4) of a controlled multi-turn workload (24 sessions × 6 turns, cumulative intra-session prefix, per-session working set 0.91 GB, intra-session APC ceiling 71%), sweeping GPU KV capacity.

GPU KV (GB)	realized APC	TTFT p50	TTFT p90	E2E p90	completion
1.2	7.4%	8.32	13.00	16.54	100%
1.6	12.2%	4.02	8.90	12.41	100%
2.4	36.3%	0.47	4.62	8.66	100%
3.6	80.3%	0.41	0.53	4.33	100%
4.8	72.9%	0.49	0.65	4.27	100%
7.2	72.9%	0.49	0.64	4.25	100%
9.7	72.9%	0.49	0.65	4.19	100%
14.5	72.9%	0.49	0.65	4.25	100%

Sharp knee at 3.6 GB = exactly the active working set (4 sessions × 0.91 GB). APC saturates at the ~71% ceiling; TTFT p90 collapses 13.0 s → 0.53 s at the same point. Beyond the knee, more HBM buys nothing (dead flat to 14.5 GB).
Below the knee, sessions evict each other between turns → cache misses → recompute → 13 s TTFT. The knee is where the working set becomes GPU-resident.

Conclusion (for §2.2)

The KV-tier hierarchy is now measured, not asserted (Exp a): GPU(HBM) < CPU(local DRAM) < remote KV-store(RDMA) ≪ miss. At 64k tokens a GPU hit (0.11 s) is 2.4× faster than a local CPU hit, ~9× faster than a remote RDMA store hit, and 137× faster than recompute; the gaps grow with context length. A global RDMA store (Mooncake-Store blog) is a real win over recompute (up to 16× here / 46× in the blog) — but it pays the NIC tax, so it sits a tier below local CPU and two below GPU. Each step toward the GPU is another 1.4–4× of TTFT.
You only need to hold the active working set on GPU. Realized APC and latency saturate once HBM covers the concurrent sessions' working set (3.6 GB here); past that, extra capacity — and the entire CPU/storage tier built to chase the long reuse tail — adds ~0 (Exp b). The knee scales linearly with concurrency, i.e. with cluster GPU count, which the production cluster already provides.
Together: maximize GPU residency of the active working set (colocation + affinity routing + dedup-migration); the CPU tier is a fallback, not the primary path.

Caveats

Exp (b) uses a controlled multi-turn workload (the production trace is 90% single-turn with huge per-request contexts that thrash a single instance — see C1/f2c); it isolates the capacity→APC→latency mechanism. Knee position scales with concurrency × per-session working set.
Single H20; PCIe H2D ~54 GB/s is intra-node (cf. 9.7 GB/s Mooncake inter-node RDMA).
Remote-RDMA tier is a single-node 2-instance Mooncake measurement (RDMA loopback through the NIC; MB2 showed intra ≈ inter, NIC-bound). t_transfer includes the request + 1-token decode + dst scheduling, so effective BW (~5–7 GB/s) is below the raw ~9.7 GB/s; this is the realistic end-to-end remote-hit latency, not just the wire transfer. The connector's retention-verify (cached_followup) is 0 because kv_both do_remote_prefill does not reinsert the pulled prefix into dst's persistent prefix cache — it does not affect the measured pull latency.
The 80.3% point at the knee slightly exceeds the 71% intra-session ceiling (transient full residency / generated-token reuse); steady state is 72.9%.

README.md Unescape Escape

v2 — Evidence for the GPU-hit-first principle (§2.2)

Exp (a) — three-tier hit latency (exp_a_tier_latency/)

Exp (b) — capacity → APC → latency knee (exp_b_capacity_knee/)

Results (dash0, 2026-05-30)

Exp (a) — GPU hit > CPU hit > remote-store(RDMA) hit ≫ miss (figs/exp_a_tier_latency.png)

Exp (b) — APC and latency knee at small GPU capacity (figs/exp_b_capacity_knee.png)

Conclusion (for §2.2)

Caveats

README.md

Exp (a) — three-tier hit latency (`exp_a_tier_latency/`)

Exp (b) — capacity → APC → latency knee (`exp_b_capacity_knee/`)

Exp (a) — GPU hit > CPU hit > remote-store(RDMA) hit ≫ miss (`figs/exp_a_tier_latency.png`)

Exp (b) — APC and latency knee at small GPU capacity (`figs/exp_b_capacity_knee.png`)