Two figures inserted into V2_DEEP_ANALYSIS §4.5 and §4.4 respectively, to
visually rebut the two critic-agent claims that we argued in prose were
design intent, not deficiencies.
(1) gpu_utilization.png -- §4.5 "P GPU is wasted 90% of the time"
Two-panel side-by-side:
Left (request count view, the naive reading): KVC P = 328 reqs (7.4%),
KVC D = ~1450 each, DP = ~1100 each. P "looks idle."
Right (compute work view, the honest reading): KVC P does 1.07M tokens
of prefill, comparable to each KVC D worker's ~0.80M. P is a
low-frequency high-cost safety net, not idle capacity.
Bonus finding: KVC's total compute (3.47M tokens across 4 GPUs) is 33%
LESS than DP's (5.17M). Same GPUs, less work done. That's the affinity
win.
(2) cache_efficiency.png -- §4.4 "Cache concentration is not policy win"
Two-panel side-by-side. The setup: KVC has 27% LESS total KV pool
(276K vs 351K tokens) yet caches MORE per request.
Left (cache hit rate vs turn number): KVC's session-affinity lets
hit rate accumulate with turns; DP's hash + radix-LRU causes
a mid-turn drift around turns 8-25 where KVC = 97.0% vs DP
= 95.8% (1.24pp gap). Shows mechanism, not just outcome.
Right (ECDF of per-request uncached tokens, log x): KVC's distribution
concentrates near zero (50% < 187 tokens), DP's is spread
(50% < 781 tokens). At uncached = 500 tokens threshold, KVC
has 74% of requests below, DP has 31%.
→ smaller pool, better retention, less per-request work. Direct empirical
rebuttal to "fragmentation is architectural, not policy."
Bundled scripts (rerunable):
- scripts/analysis/plot_gpu_utilization.py
- scripts/analysis/plot_cache_efficiency.py
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>