Files
agentic-kvc/v2/exp_b_capacity_knee/results/m1_blk6144.json
Gahow Wang ad754cfe0b v2 exp(b): GPU KV-capacity APC/latency knee + writeup
Sweeps GPU KV-cache capacity (--num-gpu-blocks-override) under a closed-loop
replay (concurrency 4) of a controlled multi-turn workload (cumulative
intra-session prefix, gen_synth_trace.py), measuring realized APC
(prefix_cache hits/queries delta) and latency per capacity.

Result: a sharp knee at 3.6 GB = exactly the active working set
(4 sessions x 0.91 GB). APC rises 7->12->36->80% then saturates at the
~71% intra-session ceiling; TTFT p90 collapses 13.0 s -> 0.53 s at the same
point; dead flat to 14.5 GB, 100% completion throughout. So only the active
working set needs HBM; capacity beyond it -- and the CPU/storage tier built
to chase the reuse tail -- buys ~0. Knee scales linearly with concurrency
= cluster GPU count.

README.md ties exp(a)+exp(b) into the section-2.2 GPU-hit-first argument
with tables, conclusions, and caveats. Raw per-request dumps gitignored;
summary/m0/m1 deltas kept.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 11:23:31 +08:00

2 lines
133 B
JSON

{"gpu_hits": 1780650671.0650043, "gpu_queries": 1780860335.06498, "ext_hits": 1780086191.0650318, "ext_queries": 1780086191.0650187}