Extends the hit-latency microbench to a 4th tier: a remote global-KV-store hit over RDMA, the Mooncake-Store mechanism. Two kv_both MooncakeConnector instances (run_rdma.sh); for each prefix length, instance B serves the request by pulling instance A's cached prefix over RDMA (do_remote_prefill, via microbench/fresh_setup/mb2_kv_transfer.py) instead of recomputing -- the timed pull is the remote-hit latency. Result (TTFT p50, 11 reps): strict tier ordering GPU(HBM) < CPU(local DRAM) < remote-RDMA-store << miss, gaps growing with context. At 64k: GPU 0.11s, CPU 0.27s, RDMA 0.97s, miss 15.2s -> miss/RDMA 15.8x, RDMA/CPU 3.6x, CPU/GPU 2.4x. So a global RDMA store is a real win over recompute (the blog's 46x) but pays the NIC tax (~5-7 GB/s effective) and sits a tier below local CPU and two below GPU -- reinforcing GPU-hit-first. README + figure updated to four tiers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
120 lines
7.1 KiB
Markdown
120 lines
7.1 KiB
Markdown
# v2 — Evidence for the GPU-hit-first principle (§2.2)
|
||
|
||
Two experiments that turn "**Hits on GPU > hits on CPU**" + "**GPU is enough to
|
||
hold most of the *valuable* KV reuse**" from assertion into measurement.
|
||
|
||
Hardware: dash0, 1× NVIDIA H20 (97 GB) per experiment, Qwen3-Coder-30B-A3B-Instruct,
|
||
vLLM 0.18.1 (V1, prefix caching, enforce-eager). KV = 96 KiB/token (1 GiB = 10,923 tok).
|
||
|
||
## Exp (a) — three-tier hit latency (`exp_a_tier_latency/`)
|
||
TTFT of serving a reused prefix of length L from each tier:
|
||
- **miss** — fresh unique prompt → full prefill (recompute)
|
||
- **GPU hit** — re-request → HBM prefix cache
|
||
- **CPU hit** — warm → evict to CPU offload tier (`--kv-offloading-size`) → re-request → DRAM fetch
|
||
- **PCIe floor** — direct pinned-memory H2D transfer cost for the same KV size (backstop)
|
||
|
||
Tier of each measured request is *verified* via `vllm:prefix_cache_hits` vs
|
||
`vllm:external_prefix_cache_hits` deltas, not assumed.
|
||
|
||
Run: `GPU=0 bash v2/exp_a_tier_latency/run.sh` then `.venv/bin/python v2/exp_a_tier_latency/plot.py`.
|
||
|
||
## Exp (b) — capacity → APC → latency knee (`exp_b_capacity_knee/`)
|
||
Replay a fixed agentic trace at several GPU KV pool sizes
|
||
(`--num-gpu-blocks-override`); measure realized APC + TTFT p90 per capacity.
|
||
The knee = the GPU capacity beyond which more HBM buys ~no extra reuse.
|
||
|
||
Run: `GPU=1 bash v2/exp_b_capacity_knee/run_sweep.sh` then
|
||
`.venv/bin/python v2/exp_b_capacity_knee/analyze_and_plot.py`.
|
||
|
||
## Results (dash0, 2026-05-30)
|
||
|
||
### Exp (a) — GPU hit > CPU hit > remote-store(RDMA) hit ≫ miss (`figs/exp_a_tier_latency.png`)
|
||
|
||
TTFT (s, p50 over reps) to serve a reused prefix of length L from each KV tier.
|
||
Local CPU-tier hits were 100% verified via `vllm:external_prefix_cache_hits`;
|
||
the **remote KV-store** tier is a real cross-instance Mooncake hit — instance B
|
||
serves the request by **pulling the cached prefix from instance A over RDMA**
|
||
(`do_remote_prefill`) instead of recomputing (the Mooncake-Store-blog mechanism),
|
||
measured with `microbench/fresh_setup/mb2_kv_transfer.py`.
|
||
|
||
| prefix L | miss (recompute) | **remote RDMA store** | CPU-tier (local) | GPU-tier (HBM) | miss/RDMA | RDMA/CPU | CPU/GPU |
|
||
|---:|---:|---:|---:|---:|---:|---:|---:|
|
||
| 1k | 0.078 | 0.061 | 0.057 | 0.042 | 1.3× | 1.1× | 1.4× |
|
||
| 8k | 0.588 | 0.151 | 0.076 | 0.053 | 3.9× | 2.0× | 1.5× |
|
||
| 16k | 1.547 | 0.262 | 0.105 | 0.063 | 5.9× | 2.5× | 1.7× |
|
||
| 32k | 4.604 | 0.680 | 0.158 | 0.080 | 6.8× | 4.3× | 2.0× |
|
||
| **64k** | **15.230** | **0.966** | **0.272** | **0.111** | **15.8×** | **3.6×** | **2.4×** |
|
||
|
||
- **GPU hit is ~flat** (42→111 ms over 1k→64k): a hit returns the whole prefix from
|
||
HBM, only the last token is recomputed.
|
||
- **miss grows superlinearly** (→15.2 s at 64k): a miss pays the full prefill.
|
||
- **local CPU hit grows transfer-bound** (PCIe H2D measured **~54 GB/s**); CPU-hit
|
||
TTFT ≈ GPU-hit + KV/PCIe + ~0.15 s overhead (dashed PCIe floor sits just under it).
|
||
- **remote RDMA-store hit** is the L3 tier the Mooncake-Store blog advocates: it is
|
||
a big win over recompute (**up to 16× lower TTFT**, consistent with the blog's
|
||
46× at higher hit rates) — but it pays the **NIC tax** (~5–7 GB/s effective here,
|
||
cf. ~9.7 GB/s raw Mooncake RDMA in MB2; multi-NIC pooling would raise it). So it
|
||
is **3.6× slower than a local CPU hit and ~9× slower than a GPU hit** at 64k, and
|
||
the gap **grows with context length**.
|
||
- **Takeaway — the tier ordering is strict and widens with context:**
|
||
**GPU < CPU-local < remote-RDMA-store ≪ miss.** A global KV store helps (vs
|
||
recompute), which is why that approach exists; but every step *toward* the GPU is
|
||
another 1.4–4× of TTFT. The reuse that matters most is the GPU-resident kind.
|
||
|
||
### Exp (b) — APC and latency knee at small GPU capacity (`figs/exp_b_capacity_knee.png`)
|
||
|
||
Closed-loop replay (concurrency 4) of a controlled multi-turn workload (24 sessions
|
||
× 6 turns, cumulative intra-session prefix, per-session working set **0.91 GB**,
|
||
intra-session APC ceiling 71%), sweeping GPU KV capacity.
|
||
|
||
| GPU KV (GB) | realized APC | TTFT p50 | TTFT p90 | E2E p90 | completion |
|
||
|---:|---:|---:|---:|---:|---:|
|
||
| 1.2 | 7.4% | 8.32 | 13.00 | 16.54 | 100% |
|
||
| 1.6 | 12.2% | 4.02 | 8.90 | 12.41 | 100% |
|
||
| 2.4 | 36.3% | 0.47 | 4.62 | 8.66 | 100% |
|
||
| **3.6** | **80.3%** | **0.41** | **0.53** | **4.33** | 100% |
|
||
| 4.8 | 72.9% | 0.49 | 0.65 | 4.27 | 100% |
|
||
| 7.2 | 72.9% | 0.49 | 0.64 | 4.25 | 100% |
|
||
| 9.7 | 72.9% | 0.49 | 0.65 | 4.19 | 100% |
|
||
| 14.5| 72.9% | 0.49 | 0.65 | 4.25 | 100% |
|
||
|
||
- **Sharp knee at 3.6 GB** = exactly the active working set (4 sessions × 0.91 GB).
|
||
APC saturates at the ~71% ceiling; **TTFT p90 collapses 13.0 s → 0.53 s** at the
|
||
same point. Beyond the knee, **more HBM buys nothing** (dead flat to 14.5 GB).
|
||
- Below the knee, sessions evict each other between turns → cache misses →
|
||
recompute → 13 s TTFT. The knee is where the working set becomes GPU-resident.
|
||
|
||
## Conclusion (for §2.2)
|
||
|
||
1. **The KV-tier hierarchy is now measured, not asserted** (Exp a):
|
||
`GPU(HBM) < CPU(local DRAM) < remote KV-store(RDMA) ≪ miss`. At 64k tokens a GPU
|
||
hit (0.11 s) is 2.4× faster than a local CPU hit, ~9× faster than a remote RDMA
|
||
store hit, and 137× faster than recompute; the gaps **grow with context length**.
|
||
A global RDMA store (Mooncake-Store blog) is a real win over recompute (up to 16×
|
||
here / 46× in the blog) — but it pays the NIC tax, so it sits a tier *below* local
|
||
CPU and two below GPU. Each step toward the GPU is another 1.4–4× of TTFT.
|
||
2. **You only need to hold the *active working set* on GPU.** Realized APC and
|
||
latency saturate once HBM covers the concurrent sessions' working set (3.6 GB
|
||
here); past that, extra capacity — and the entire CPU/storage tier built to chase
|
||
the long reuse tail — adds ~0 (Exp b). The knee scales linearly with concurrency,
|
||
i.e. with **cluster GPU count**, which the production cluster already provides.
|
||
3. Together: maximize GPU residency of the active working set (colocation + affinity
|
||
routing + dedup-migration); the CPU tier is a fallback, not the primary path.
|
||
|
||
## Caveats
|
||
- Exp (b) uses a controlled multi-turn workload (the production trace is 90%
|
||
single-turn with huge per-request contexts that thrash a single instance — see
|
||
C1/f2c); it isolates the capacity→APC→latency mechanism. Knee *position* scales
|
||
with concurrency × per-session working set.
|
||
- Single H20; PCIe H2D ~54 GB/s is intra-node (cf. 9.7 GB/s Mooncake inter-node RDMA).
|
||
- Remote-RDMA tier is a single-node 2-instance Mooncake measurement (RDMA loopback
|
||
through the NIC; MB2 showed intra ≈ inter, NIC-bound). `t_transfer` includes the
|
||
request + 1-token decode + dst scheduling, so effective BW (~5–7 GB/s) is below the
|
||
raw ~9.7 GB/s; this is the realistic end-to-end remote-hit latency, not just the
|
||
wire transfer. The connector's retention-verify (`cached_followup`) is 0 because
|
||
kv_both `do_remote_prefill` does not reinsert the pulled prefix into dst's
|
||
persistent prefix cache — it does not affect the measured pull latency.
|
||
- The 80.3% point at the knee slightly exceeds the 71% intra-session ceiling
|
||
(transient full residency / generated-token reuse); steady state is 72.9%.
|
||
|