agentic-kvc

gahow/agentic-kvc

Fork 0

Commit Graph

Author	SHA1	Message	Date
Gahow Wang	dc8e6dd5a8	v2 exp(a): add remote KV-store (RDMA) tier Extends the hit-latency microbench to a 4th tier: a remote global-KV-store hit over RDMA, the Mooncake-Store mechanism. Two kv_both MooncakeConnector instances (run_rdma.sh); for each prefix length, instance B serves the request by pulling instance A's cached prefix over RDMA (do_remote_prefill, via microbench/fresh_setup/mb2_kv_transfer.py) instead of recomputing -- the timed pull is the remote-hit latency. Result (TTFT p50, 11 reps): strict tier ordering GPU(HBM) < CPU(local DRAM) < remote-RDMA-store << miss, gaps growing with context. At 64k: GPU 0.11s, CPU 0.27s, RDMA 0.97s, miss 15.2s -> miss/RDMA 15.8x, RDMA/CPU 3.6x, CPU/GPU 2.4x. So a global RDMA store is a real win over recompute (the blog's 46x) but pays the NIC tax (~5-7 GB/s effective) and sits a tier below local CPU and two below GPU -- reinforcing GPU-hit-first. README + figure updated to four tiers. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 12:48:37 +08:00
Gahow Wang	ad754cfe0b	v2 exp(b): GPU KV-capacity APC/latency knee + writeup Sweeps GPU KV-cache capacity (--num-gpu-blocks-override) under a closed-loop replay (concurrency 4) of a controlled multi-turn workload (cumulative intra-session prefix, gen_synth_trace.py), measuring realized APC (prefix_cache hits/queries delta) and latency per capacity. Result: a sharp knee at 3.6 GB = exactly the active working set (4 sessions x 0.91 GB). APC rises 7->12->36->80% then saturates at the ~71% intra-session ceiling; TTFT p90 collapses 13.0 s -> 0.53 s at the same point; dead flat to 14.5 GB, 100% completion throughout. So only the active working set needs HBM; capacity beyond it -- and the CPU/storage tier built to chase the reuse tail -- buys ~0. Knee scales linearly with concurrency = cluster GPU count. README.md ties exp(a)+exp(b) into the section-2.2 GPU-hit-first argument with tables, conclusions, and caveats. Raw per-request dumps gitignored; summary/m0/m1 deltas kept. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 11:23:31 +08:00
Gahow Wang	837df6bc9e	v2 exp(a): three-tier KV-hit latency microbench (GPU >> CPU >> miss) Measures TTFT to serve a reused prefix of length L from each KV tier on a single H20 (Qwen3-Coder-30B-A3B, vLLM 0.18.1): miss (recompute), CPU-tier hit (native DRAM offload), GPU-tier hit (HBM prefix cache). Each measured request is bracketed by /metrics scrapes so the tier is verified (vllm:prefix_cache_hits vs external_prefix_cache_hits), not assumed. Result: GPU hit is ~flat (42->111 ms over 1k->64k tokens); CPU hit is transfer-bound (PCIe H2D ~54 GB/s, 57->272 ms); miss grows superlinearly (78 ms -> 15.2 s). GPU beats CPU 1.4-2.5x (gap grows with context); miss/CPU up to 56x, miss/GPU up to 137x. pcie_transfer.py is the independent CPU-hit floor backstop. Evidence for the GPU-hit-first principle (paper section 2.2). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 11:23:04 +08:00

Author

SHA1

Message

Date

Gahow Wang

dc8e6dd5a8

v2 exp(a): add remote KV-store (RDMA) tier

Extends the hit-latency microbench to a 4th tier: a remote global-KV-store
hit over RDMA, the Mooncake-Store mechanism. Two kv_both MooncakeConnector
instances (run_rdma.sh); for each prefix length, instance B serves the
request by pulling instance A's cached prefix over RDMA (do_remote_prefill,
via microbench/fresh_setup/mb2_kv_transfer.py) instead of recomputing -- the
timed pull is the remote-hit latency.

Result (TTFT p50, 11 reps): strict tier ordering
GPU(HBM) < CPU(local DRAM) < remote-RDMA-store << miss, gaps growing with
context. At 64k: GPU 0.11s, CPU 0.27s, RDMA 0.97s, miss 15.2s -> miss/RDMA
15.8x, RDMA/CPU 3.6x, CPU/GPU 2.4x. So a global RDMA store is a real win
over recompute (the blog's 46x) but pays the NIC tax (~5-7 GB/s effective)
and sits a tier below local CPU and two below GPU -- reinforcing
GPU-hit-first. README + figure updated to four tiers.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-05-30 12:48:37 +08:00

Gahow Wang

ad754cfe0b

v2 exp(b): GPU KV-capacity APC/latency knee + writeup

Sweeps GPU KV-cache capacity (--num-gpu-blocks-override) under a closed-loop
replay (concurrency 4) of a controlled multi-turn workload (cumulative
intra-session prefix, gen_synth_trace.py), measuring realized APC
(prefix_cache hits/queries delta) and latency per capacity.

Result: a sharp knee at 3.6 GB = exactly the active working set
(4 sessions x 0.91 GB). APC rises 7->12->36->80% then saturates at the
~71% intra-session ceiling; TTFT p90 collapses 13.0 s -> 0.53 s at the same
point; dead flat to 14.5 GB, 100% completion throughout. So only the active
working set needs HBM; capacity beyond it -- and the CPU/storage tier built
to chase the reuse tail -- buys ~0. Knee scales linearly with concurrency
= cluster GPU count.

README.md ties exp(a)+exp(b) into the section-2.2 GPU-hit-first argument
with tables, conclusions, and caveats. Raw per-request dumps gitignored;
summary/m0/m1 deltas kept.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-05-30 11:23:31 +08:00

Gahow Wang

837df6bc9e

v2 exp(a): three-tier KV-hit latency microbench (GPU >> CPU >> miss)

Measures TTFT to serve a reused prefix of length L from each KV tier on a
single H20 (Qwen3-Coder-30B-A3B, vLLM 0.18.1): miss (recompute), CPU-tier
hit (native DRAM offload), GPU-tier hit (HBM prefix cache). Each measured
request is bracketed by /metrics scrapes so the tier is verified
(vllm:prefix_cache_hits vs external_prefix_cache_hits), not assumed.

Result: GPU hit is ~flat (42->111 ms over 1k->64k tokens); CPU hit is
transfer-bound (PCIe H2D ~54 GB/s, 57->272 ms); miss grows superlinearly
(78 ms -> 15.2 s). GPU beats CPU 1.4-2.5x (gap grows with context);
miss/CPU up to 56x, miss/GPU up to 137x. pcie_transfer.py is the
independent CPU-hit floor backstop. Evidence for the GPU-hit-first
principle (paper section 2.2).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-05-30 11:23:04 +08:00

3 Commits