agentic-kvc

Files

Gahow Wang ad754cfe0b v2 exp(b): GPU KV-capacity APC/latency knee + writeup

Sweeps GPU KV-cache capacity (--num-gpu-blocks-override) under a closed-loop
replay (concurrency 4) of a controlled multi-turn workload (cumulative
intra-session prefix, gen_synth_trace.py), measuring realized APC
(prefix_cache hits/queries delta) and latency per capacity.

Result: a sharp knee at 3.6 GB = exactly the active working set
(4 sessions x 0.91 GB). APC rises 7->12->36->80% then saturates at the
~71% intra-session ceiling; TTFT p90 collapses 13.0 s -> 0.53 s at the same
point; dead flat to 14.5 GB, 100% completion throughout. So only the active
working set needs HBM; capacity beyond it -- and the CPU/storage tier built
to chase the reuse tail -- buys ~0. Knee scales linearly with concurrency
= cluster GPU count.

README.md ties exp(a)+exp(b) into the section-2.2 GPU-hit-first argument
with tables, conclusions, and caveats. Raw per-request dumps gitignored;
summary/m0/m1 deltas kept.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-05-30 11:23:31 +08:00

results

v2 exp(b): GPU KV-capacity APC/latency knee + writeup

2026-05-30 11:23:31 +08:00

analyze_and_plot.py

v2 exp(b): GPU KV-capacity APC/latency knee + writeup

2026-05-30 11:23:31 +08:00

gen_synth_trace.py

v2 exp(b): GPU KV-capacity APC/latency knee + writeup

2026-05-30 11:23:31 +08:00

run_sweep.sh

v2 exp(b): GPU KV-capacity APC/latency knee + writeup

2026-05-30 11:23:31 +08:00