Measures TTFT to serve a reused prefix of length L from each KV tier on a single H20 (Qwen3-Coder-30B-A3B, vLLM 0.18.1): miss (recompute), CPU-tier hit (native DRAM offload), GPU-tier hit (HBM prefix cache). Each measured request is bracketed by /metrics scrapes so the tier is verified (vllm:prefix_cache_hits vs external_prefix_cache_hits), not assumed. Result: GPU hit is ~flat (42->111 ms over 1k->64k tokens); CPU hit is transfer-bound (PCIe H2D ~54 GB/s, 57->272 ms); miss grows superlinearly (78 ms -> 15.2 s). GPU beats CPU 1.4-2.5x (gap grows with context); miss/CPU up to 56x, miss/GPU up to 137x. pcie_transfer.py is the independent CPU-hit floor backstop. Evidence for the GPU-hit-first principle (paper section 2.2). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
40 lines
942 B
JSON
40 lines
942 B
JSON
{
|
|
"device": "NVIDIA H20",
|
|
"by_length": {
|
|
"1024": {
|
|
"kv_bytes": 100663296,
|
|
"transfer_s": 0.001876260997960344,
|
|
"bw_GBps": 53.65100916633112
|
|
},
|
|
"2048": {
|
|
"kv_bytes": 201326592,
|
|
"transfer_s": 0.003709116979734972,
|
|
"bw_GBps": 54.27884671741612
|
|
},
|
|
"4096": {
|
|
"kv_bytes": 402653184,
|
|
"transfer_s": 0.007338636991335079,
|
|
"bw_GBps": 54.86757070494469
|
|
},
|
|
"8192": {
|
|
"kv_bytes": 805306368,
|
|
"transfer_s": 0.01476299500791356,
|
|
"bw_GBps": 54.548983290201164
|
|
},
|
|
"16384": {
|
|
"kv_bytes": 1610612736,
|
|
"transfer_s": 0.02972855800180696,
|
|
"bw_GBps": 54.17729093695375
|
|
},
|
|
"32768": {
|
|
"kv_bytes": 3221225472,
|
|
"transfer_s": 0.059267577016726136,
|
|
"bw_GBps": 54.35055107940257
|
|
},
|
|
"65536": {
|
|
"kv_bytes": 6442450944,
|
|
"transfer_s": 0.11847134301206097,
|
|
"bw_GBps": 54.37982536708583
|
|
}
|
|
}
|
|
} |