Reviewer feedback: the original gpu_utilization figure was confusing.
"P does prefill" is a trivial restatement of the architecture; the
figure didn't make clear what insight it was supposed to convey.
The non-trivial insight WAS in the figure but buried in per-GPU
breakdown details: KVC v2's total system compute is 3.47M tokens
vs DP's 5.17M -- a 33% reduction for the same 4449-request workload.
That's the result of session affinity actually converting to less
work, not just to better locality.
Redesigned the figure to lead with that finding:
Left panel (NEW): system-wide compute as two stacked bars
- KVC: P heavy prefill (1.07M) + D append-prefill (1.39M) + decode (1.01M)
- DP: full prefill (4.17M) + decode (1.00M)
- Big "-33% total compute" badge bracketed by an arrow between the
bar tops makes the headline number unmissable
Right panel (kept, simplified): per-GPU work distribution
- Same color coding as the left panel, so the architecture story
flows from "what work the system does" to "where it happens"
- In-panel annotation boxes describe the two architectural shapes
(specialized P + light D vs uniform fused workers)
- Removed the second legend that was overlapping bars
Doc §4.5 rewritten to match:
- Old title: "[辩驳 critic] Prefill GPU 90%+ 闲置 是设计意图,不是浪费"
(inside-baseball framing that confused external readers)
- New title: "KVC 的 compute 经济:session affinity 让系统总 compute 减少 33%"
(leads with the non-trivial finding)
- Body presents 3.47M vs 5.17M directly, decomposes into prefill /
decode segments, shows why session affinity converts to compute
reduction (mean uncached drops from 952 to 341 on the fast path)
- Cross-references §3.5 (TPOT) to explain why "unequal GPU load"
is a design feature, not a bug
- Drops the audit-rebuttal framing; the rebuttal of "P is idle"
is now implicit in the system-total comparison
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Bar-overlap fix: extend ylim by 35-45% above the tallest bar to give the
"P GPU only sees 328 requests" and "P GPU does 1.07M tokens" annotations
clean white-bbox space above the bars instead of crashing into the KVC D
bars at x=1. Move both annotation xytext positions to x=2.4 (left panel)
and x=5.5 (right panel) so the arrows pull away from the orange P bar
toward the center of the panel.
Group labels (KVC 1P3D / DP 4-way CA) kept in axes-fraction bboxes at
y=1.02; subplot titles raised to pad=24 to leave room.
Note: a small visual collision between the bboxed group labels and the
subplot-title second line remains in the rendered output (acknowledged
in the prior conversation). Acceptable for now; full layout rework is
deferred. The annotation-vs-bar overlap (the original blocker) is fixed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two figures inserted into V2_DEEP_ANALYSIS §4.5 and §4.4 respectively, to
visually rebut the two critic-agent claims that we argued in prose were
design intent, not deficiencies.
(1) gpu_utilization.png -- §4.5 "P GPU is wasted 90% of the time"
Two-panel side-by-side:
Left (request count view, the naive reading): KVC P = 328 reqs (7.4%),
KVC D = ~1450 each, DP = ~1100 each. P "looks idle."
Right (compute work view, the honest reading): KVC P does 1.07M tokens
of prefill, comparable to each KVC D worker's ~0.80M. P is a
low-frequency high-cost safety net, not idle capacity.
Bonus finding: KVC's total compute (3.47M tokens across 4 GPUs) is 33%
LESS than DP's (5.17M). Same GPUs, less work done. That's the affinity
win.
(2) cache_efficiency.png -- §4.4 "Cache concentration is not policy win"
Two-panel side-by-side. The setup: KVC has 27% LESS total KV pool
(276K vs 351K tokens) yet caches MORE per request.
Left (cache hit rate vs turn number): KVC's session-affinity lets
hit rate accumulate with turns; DP's hash + radix-LRU causes
a mid-turn drift around turns 8-25 where KVC = 97.0% vs DP
= 95.8% (1.24pp gap). Shows mechanism, not just outcome.
Right (ECDF of per-request uncached tokens, log x): KVC's distribution
concentrates near zero (50% < 187 tokens), DP's is spread
(50% < 781 tokens). At uncached = 500 tokens threshold, KVC
has 74% of requests below, DP has 31%.
→ smaller pool, better retention, less per-request work. Direct empirical
rebuttal to "fragmentation is architectural, not policy."
Bundled scripts (rerunable):
- scripts/analysis/plot_gpu_utilization.py
- scripts/analysis/plot_cache_efficiency.py
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>