Measures TTFT to serve a reused prefix of length L from each KV tier on a single H20 (Qwen3-Coder-30B-A3B, vLLM 0.18.1): miss (recompute), CPU-tier hit (native DRAM offload), GPU-tier hit (HBM prefix cache). Each measured request is bracketed by /metrics scrapes so the tier is verified (vllm:prefix_cache_hits vs external_prefix_cache_hits), not assumed. Result: GPU hit is ~flat (42->111 ms over 1k->64k tokens); CPU hit is transfer-bound (PCIe H2D ~54 GB/s, 57->272 ms); miss grows superlinearly (78 ms -> 15.2 s). GPU beats CPU 1.4-2.5x (gap grows with context); miss/CPU up to 56x, miss/GPU up to 137x. pcie_transfer.py is the independent CPU-hit floor backstop. Evidence for the GPU-hit-first principle (paper section 2.2). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
60 lines
1.8 KiB
Python
60 lines
1.8 KiB
Python
"""Exp (a) backstop: direct CPU(DRAM)->GPU(HBM) KV-transfer cost.
|
|
|
|
Independent lower bound on a CPU-tier hit: fetching L tokens' KV over the
|
|
host<->device link. CPU_hit(L) >= GPU_hit(L) + KV_bytes(L) / BW_h2d.
|
|
Uses pinned host memory (best case for the offload tier, which pins buffers).
|
|
"""
|
|
from __future__ import annotations
|
|
|
|
import argparse
|
|
import json
|
|
import time
|
|
|
|
import torch
|
|
|
|
KV_BYTES_PER_TOKEN = 98304 # Qwen3-Coder, bf16
|
|
LENGTHS = [1024, 2048, 4096, 8192, 16384, 32768, 65536]
|
|
|
|
|
|
def time_h2d(nbytes: int, reps: int) -> float:
|
|
n = nbytes // 2 # bf16 elements
|
|
host = torch.empty(n, dtype=torch.float16, pin_memory=True)
|
|
dev = torch.empty(n, dtype=torch.float16, device="cuda")
|
|
# warmup
|
|
for _ in range(3):
|
|
dev.copy_(host, non_blocking=True)
|
|
torch.cuda.synchronize()
|
|
ts = []
|
|
for _ in range(reps):
|
|
t0 = time.perf_counter()
|
|
dev.copy_(host, non_blocking=True)
|
|
torch.cuda.synchronize()
|
|
ts.append(time.perf_counter() - t0)
|
|
ts.sort()
|
|
return ts[len(ts) // 2]
|
|
|
|
|
|
def main():
|
|
ap = argparse.ArgumentParser()
|
|
ap.add_argument("--reps", type=int, default=20)
|
|
ap.add_argument("--out", required=True)
|
|
args = ap.parse_args()
|
|
assert torch.cuda.is_available(), "need a GPU"
|
|
print("device:", torch.cuda.get_device_name(0))
|
|
|
|
out = {"device": torch.cuda.get_device_name(0), "by_length": {}}
|
|
for L in LENGTHS:
|
|
nbytes = L * KV_BYTES_PER_TOKEN
|
|
sec = time_h2d(nbytes, args.reps)
|
|
bw = nbytes / sec / 1e9
|
|
out["by_length"][str(L)] = {
|
|
"kv_bytes": nbytes, "transfer_s": sec, "bw_GBps": bw,
|
|
}
|
|
print(f"L={L:>6} KV={nbytes/1e9:6.3f}GB t={sec*1000:7.2f}ms bw={bw:6.1f} GB/s", flush=True)
|
|
json.dump(out, open(args.out, "w"), indent=2)
|
|
print("wrote", args.out)
|
|
|
|
|
|
if __name__ == "__main__":
|
|
main()
|