Files
agentic-kvc/v2/README.md
Gahow Wang dc8e6dd5a8 v2 exp(a): add remote KV-store (RDMA) tier
Extends the hit-latency microbench to a 4th tier: a remote global-KV-store
hit over RDMA, the Mooncake-Store mechanism. Two kv_both MooncakeConnector
instances (run_rdma.sh); for each prefix length, instance B serves the
request by pulling instance A's cached prefix over RDMA (do_remote_prefill,
via microbench/fresh_setup/mb2_kv_transfer.py) instead of recomputing -- the
timed pull is the remote-hit latency.

Result (TTFT p50, 11 reps): strict tier ordering
GPU(HBM) < CPU(local DRAM) < remote-RDMA-store << miss, gaps growing with
context. At 64k: GPU 0.11s, CPU 0.27s, RDMA 0.97s, miss 15.2s -> miss/RDMA
15.8x, RDMA/CPU 3.6x, CPU/GPU 2.4x. So a global RDMA store is a real win
over recompute (the blog's 46x) but pays the NIC tax (~5-7 GB/s effective)
and sits a tier below local CPU and two below GPU -- reinforcing
GPU-hit-first. README + figure updated to four tiers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 12:48:37 +08:00

120 lines
7.1 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# v2 — Evidence for the GPU-hit-first principle (§2.2)
Two experiments that turn "**Hits on GPU > hits on CPU**" + "**GPU is enough to
hold most of the *valuable* KV reuse**" from assertion into measurement.
Hardware: dash0, 1× NVIDIA H20 (97 GB) per experiment, Qwen3-Coder-30B-A3B-Instruct,
vLLM 0.18.1 (V1, prefix caching, enforce-eager). KV = 96 KiB/token (1 GiB = 10,923 tok).
## Exp (a) — three-tier hit latency (`exp_a_tier_latency/`)
TTFT of serving a reused prefix of length L from each tier:
- **miss** — fresh unique prompt → full prefill (recompute)
- **GPU hit** — re-request → HBM prefix cache
- **CPU hit** — warm → evict to CPU offload tier (`--kv-offloading-size`) → re-request → DRAM fetch
- **PCIe floor** — direct pinned-memory H2D transfer cost for the same KV size (backstop)
Tier of each measured request is *verified* via `vllm:prefix_cache_hits` vs
`vllm:external_prefix_cache_hits` deltas, not assumed.
Run: `GPU=0 bash v2/exp_a_tier_latency/run.sh` then `.venv/bin/python v2/exp_a_tier_latency/plot.py`.
## Exp (b) — capacity → APC → latency knee (`exp_b_capacity_knee/`)
Replay a fixed agentic trace at several GPU KV pool sizes
(`--num-gpu-blocks-override`); measure realized APC + TTFT p90 per capacity.
The knee = the GPU capacity beyond which more HBM buys ~no extra reuse.
Run: `GPU=1 bash v2/exp_b_capacity_knee/run_sweep.sh` then
`.venv/bin/python v2/exp_b_capacity_knee/analyze_and_plot.py`.
## Results (dash0, 2026-05-30)
### Exp (a) — GPU hit > CPU hit > remote-store(RDMA) hit ≫ miss (`figs/exp_a_tier_latency.png`)
TTFT (s, p50 over reps) to serve a reused prefix of length L from each KV tier.
Local CPU-tier hits were 100% verified via `vllm:external_prefix_cache_hits`;
the **remote KV-store** tier is a real cross-instance Mooncake hit — instance B
serves the request by **pulling the cached prefix from instance A over RDMA**
(`do_remote_prefill`) instead of recomputing (the Mooncake-Store-blog mechanism),
measured with `microbench/fresh_setup/mb2_kv_transfer.py`.
| prefix L | miss (recompute) | **remote RDMA store** | CPU-tier (local) | GPU-tier (HBM) | miss/RDMA | RDMA/CPU | CPU/GPU |
|---:|---:|---:|---:|---:|---:|---:|---:|
| 1k | 0.078 | 0.061 | 0.057 | 0.042 | 1.3× | 1.1× | 1.4× |
| 8k | 0.588 | 0.151 | 0.076 | 0.053 | 3.9× | 2.0× | 1.5× |
| 16k | 1.547 | 0.262 | 0.105 | 0.063 | 5.9× | 2.5× | 1.7× |
| 32k | 4.604 | 0.680 | 0.158 | 0.080 | 6.8× | 4.3× | 2.0× |
| **64k** | **15.230** | **0.966** | **0.272** | **0.111** | **15.8×** | **3.6×** | **2.4×** |
- **GPU hit is ~flat** (42→111 ms over 1k→64k): a hit returns the whole prefix from
HBM, only the last token is recomputed.
- **miss grows superlinearly** (→15.2 s at 64k): a miss pays the full prefill.
- **local CPU hit grows transfer-bound** (PCIe H2D measured **~54 GB/s**); CPU-hit
TTFT ≈ GPU-hit + KV/PCIe + ~0.15 s overhead (dashed PCIe floor sits just under it).
- **remote RDMA-store hit** is the L3 tier the Mooncake-Store blog advocates: it is
a big win over recompute (**up to 16× lower TTFT**, consistent with the blog's
46× at higher hit rates) — but it pays the **NIC tax** (~57 GB/s effective here,
cf. ~9.7 GB/s raw Mooncake RDMA in MB2; multi-NIC pooling would raise it). So it
is **3.6× slower than a local CPU hit and ~9× slower than a GPU hit** at 64k, and
the gap **grows with context length**.
- **Takeaway — the tier ordering is strict and widens with context:**
**GPU < CPU-local < remote-RDMA-store ≪ miss.** A global KV store helps (vs
recompute), which is why that approach exists; but every step *toward* the GPU is
another 1.44× of TTFT. The reuse that matters most is the GPU-resident kind.
### Exp (b) — APC and latency knee at small GPU capacity (`figs/exp_b_capacity_knee.png`)
Closed-loop replay (concurrency 4) of a controlled multi-turn workload (24 sessions
× 6 turns, cumulative intra-session prefix, per-session working set **0.91 GB**,
intra-session APC ceiling 71%), sweeping GPU KV capacity.
| GPU KV (GB) | realized APC | TTFT p50 | TTFT p90 | E2E p90 | completion |
|---:|---:|---:|---:|---:|---:|
| 1.2 | 7.4% | 8.32 | 13.00 | 16.54 | 100% |
| 1.6 | 12.2% | 4.02 | 8.90 | 12.41 | 100% |
| 2.4 | 36.3% | 0.47 | 4.62 | 8.66 | 100% |
| **3.6** | **80.3%** | **0.41** | **0.53** | **4.33** | 100% |
| 4.8 | 72.9% | 0.49 | 0.65 | 4.27 | 100% |
| 7.2 | 72.9% | 0.49 | 0.64 | 4.25 | 100% |
| 9.7 | 72.9% | 0.49 | 0.65 | 4.19 | 100% |
| 14.5| 72.9% | 0.49 | 0.65 | 4.25 | 100% |
- **Sharp knee at 3.6 GB** = exactly the active working set (4 sessions × 0.91 GB).
APC saturates at the ~71% ceiling; **TTFT p90 collapses 13.0 s → 0.53 s** at the
same point. Beyond the knee, **more HBM buys nothing** (dead flat to 14.5 GB).
- Below the knee, sessions evict each other between turns → cache misses →
recompute → 13 s TTFT. The knee is where the working set becomes GPU-resident.
## Conclusion (for §2.2)
1. **The KV-tier hierarchy is now measured, not asserted** (Exp a):
`GPU(HBM) < CPU(local DRAM) < remote KV-store(RDMA) ≪ miss`. At 64k tokens a GPU
hit (0.11 s) is 2.4× faster than a local CPU hit, ~9× faster than a remote RDMA
store hit, and 137× faster than recompute; the gaps **grow with context length**.
A global RDMA store (Mooncake-Store blog) is a real win over recompute (up to 16×
here / 46× in the blog) — but it pays the NIC tax, so it sits a tier *below* local
CPU and two below GPU. Each step toward the GPU is another 1.44× of TTFT.
2. **You only need to hold the *active working set* on GPU.** Realized APC and
latency saturate once HBM covers the concurrent sessions' working set (3.6 GB
here); past that, extra capacity — and the entire CPU/storage tier built to chase
the long reuse tail — adds ~0 (Exp b). The knee scales linearly with concurrency,
i.e. with **cluster GPU count**, which the production cluster already provides.
3. Together: maximize GPU residency of the active working set (colocation + affinity
routing + dedup-migration); the CPU tier is a fallback, not the primary path.
## Caveats
- Exp (b) uses a controlled multi-turn workload (the production trace is 90%
single-turn with huge per-request contexts that thrash a single instance — see
C1/f2c); it isolates the capacity→APC→latency mechanism. Knee *position* scales
with concurrency × per-session working set.
- Single H20; PCIe H2D ~54 GB/s is intra-node (cf. 9.7 GB/s Mooncake inter-node RDMA).
- Remote-RDMA tier is a single-node 2-instance Mooncake measurement (RDMA loopback
through the NIC; MB2 showed intra ≈ inter, NIC-bound). `t_transfer` includes the
request + 1-token decode + dst scheduling, so effective BW (~57 GB/s) is below the
raw ~9.7 GB/s; this is the realistic end-to-end remote-hit latency, not just the
wire transfer. The connector's retention-verify (`cached_followup`) is 0 because
kv_both `do_remote_prefill` does not reinsert the pulled prefix into dst's
persistent prefix cache — it does not affect the measured pull latency.
- The 80.3% point at the knee slightly exceeds the 71% intra-session ceiling
(transient full residency / generated-token reuse); steady state is 72.9%.