Files
agentic-kvc/v2/exp_a_tier_latency/results/pcie.json
Gahow Wang 837df6bc9e v2 exp(a): three-tier KV-hit latency microbench (GPU >> CPU >> miss)
Measures TTFT to serve a reused prefix of length L from each KV tier on a
single H20 (Qwen3-Coder-30B-A3B, vLLM 0.18.1): miss (recompute), CPU-tier
hit (native DRAM offload), GPU-tier hit (HBM prefix cache). Each measured
request is bracketed by /metrics scrapes so the tier is verified
(vllm:prefix_cache_hits vs external_prefix_cache_hits), not assumed.

Result: GPU hit is ~flat (42->111 ms over 1k->64k tokens); CPU hit is
transfer-bound (PCIe H2D ~54 GB/s, 57->272 ms); miss grows superlinearly
(78 ms -> 15.2 s). GPU beats CPU 1.4-2.5x (gap grows with context);
miss/CPU up to 56x, miss/GPU up to 137x. pcie_transfer.py is the
independent CPU-hit floor backstop. Evidence for the GPU-hit-first
principle (paper section 2.2).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 11:23:04 +08:00

40 lines
942 B
JSON

{
"device": "NVIDIA H20",
"by_length": {
"1024": {
"kv_bytes": 100663296,
"transfer_s": 0.001876260997960344,
"bw_GBps": 53.65100916633112
},
"2048": {
"kv_bytes": 201326592,
"transfer_s": 0.003709116979734972,
"bw_GBps": 54.27884671741612
},
"4096": {
"kv_bytes": 402653184,
"transfer_s": 0.007338636991335079,
"bw_GBps": 54.86757070494469
},
"8192": {
"kv_bytes": 805306368,
"transfer_s": 0.01476299500791356,
"bw_GBps": 54.548983290201164
},
"16384": {
"kv_bytes": 1610612736,
"transfer_s": 0.02972855800180696,
"bw_GBps": 54.17729093695375
},
"32768": {
"kv_bytes": 3221225472,
"transfer_s": 0.059267577016726136,
"bw_GBps": 54.35055107940257
},
"65536": {
"kv_bytes": 6442450944,
"transfer_s": 0.11847134301206097,
"bw_GBps": 54.37982536708583
}
}
}