agentic-pd-hybrid

gahow/agentic-pd-hybrid

Fork 0

Commit Graph

Author	SHA1	Message	Date
kzlin	506d360160	fix(figures): GPU utilization figure annotation/headroom polish Bar-overlap fix: extend ylim by 35-45% above the tallest bar to give the "P GPU only sees 328 requests" and "P GPU does 1.07M tokens" annotations clean white-bbox space above the bars instead of crashing into the KVC D bars at x=1. Move both annotation xytext positions to x=2.4 (left panel) and x=5.5 (right panel) so the arrows pull away from the orange P bar toward the center of the panel. Group labels (KVC 1P3D / DP 4-way CA) kept in axes-fraction bboxes at y=1.02; subplot titles raised to pad=24 to leave room. Note: a small visual collision between the bboxed group labels and the subplot-title second line remains in the rendered output (acknowledged in the prior conversation). Acceptable for now; full layout rework is deferred. The annotation-vs-bar overlap (the original blocker) is fixed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 22:28:39 +08:00
kzlin	517677d7f2	docs(kvc): add GPU-utilization and cache-efficiency figures (rebut critic) Two figures inserted into V2_DEEP_ANALYSIS §4.5 and §4.4 respectively, to visually rebut the two critic-agent claims that we argued in prose were design intent, not deficiencies. (1) gpu_utilization.png -- §4.5 "P GPU is wasted 90% of the time" Two-panel side-by-side: Left (request count view, the naive reading): KVC P = 328 reqs (7.4%), KVC D = ~1450 each, DP = ~1100 each. P "looks idle." Right (compute work view, the honest reading): KVC P does 1.07M tokens of prefill, comparable to each KVC D worker's ~0.80M. P is a low-frequency high-cost safety net, not idle capacity. Bonus finding: KVC's total compute (3.47M tokens across 4 GPUs) is 33% LESS than DP's (5.17M). Same GPUs, less work done. That's the affinity win. (2) cache_efficiency.png -- §4.4 "Cache concentration is not policy win" Two-panel side-by-side. The setup: KVC has 27% LESS total KV pool (276K vs 351K tokens) yet caches MORE per request. Left (cache hit rate vs turn number): KVC's session-affinity lets hit rate accumulate with turns; DP's hash + radix-LRU causes a mid-turn drift around turns 8-25 where KVC = 97.0% vs DP = 95.8% (1.24pp gap). Shows mechanism, not just outcome. Right (ECDF of per-request uncached tokens, log x): KVC's distribution concentrates near zero (50% < 187 tokens), DP's is spread (50% < 781 tokens). At uncached = 500 tokens threshold, KVC has 74% of requests below, DP has 31%. → smaller pool, better retention, less per-request work. Direct empirical rebuttal to "fragmentation is architectural, not policy." Bundled scripts (rerunable): - scripts/analysis/plot_gpu_utilization.py - scripts/analysis/plot_cache_efficiency.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 18:04:49 +08:00

Author

SHA1

Message

Date

kzlin

506d360160

fix(figures): GPU utilization figure annotation/headroom polish

Bar-overlap fix: extend ylim by 35-45% above the tallest bar to give the
"P GPU only sees 328 requests" and "P GPU does 1.07M tokens" annotations
clean white-bbox space above the bars instead of crashing into the KVC D
bars at x=1. Move both annotation xytext positions to x=2.4 (left panel)
and x=5.5 (right panel) so the arrows pull away from the orange P bar
toward the center of the panel.

Group labels (KVC 1P3D / DP 4-way CA) kept in axes-fraction bboxes at
y=1.02; subplot titles raised to pad=24 to leave room.

Note: a small visual collision between the bboxed group labels and the
subplot-title second line remains in the rendered output (acknowledged
in the prior conversation). Acceptable for now; full layout rework is
deferred. The annotation-vs-bar overlap (the original blocker) is fixed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-11 22:28:39 +08:00

kzlin

517677d7f2

docs(kvc): add GPU-utilization and cache-efficiency figures (rebut critic)

Two figures inserted into V2_DEEP_ANALYSIS §4.5 and §4.4 respectively, to
visually rebut the two critic-agent claims that we argued in prose were
design intent, not deficiencies.

(1) gpu_utilization.png  -- §4.5  "P GPU is wasted 90% of the time"
  Two-panel side-by-side:
    Left  (request count view, the naive reading): KVC P = 328 reqs (7.4%),
          KVC D = ~1450 each, DP = ~1100 each. P "looks idle."
    Right (compute work view, the honest reading): KVC P does 1.07M tokens
          of prefill, comparable to each KVC D worker's ~0.80M. P is a
          low-frequency high-cost safety net, not idle capacity.
  Bonus finding: KVC's total compute (3.47M tokens across 4 GPUs) is 33%
  LESS than DP's (5.17M). Same GPUs, less work done. That's the affinity
  win.

(2) cache_efficiency.png  -- §4.4  "Cache concentration is not policy win"
  Two-panel side-by-side. The setup: KVC has 27% LESS total KV pool
  (276K vs 351K tokens) yet caches MORE per request.
    Left  (cache hit rate vs turn number): KVC's session-affinity lets
          hit rate accumulate with turns; DP's hash + radix-LRU causes
          a mid-turn drift around turns 8-25 where KVC = 97.0% vs DP
          = 95.8% (1.24pp gap). Shows mechanism, not just outcome.
    Right (ECDF of per-request uncached tokens, log x): KVC's distribution
          concentrates near zero (50% < 187 tokens), DP's is spread
          (50% < 781 tokens). At uncached = 500 tokens threshold, KVC
          has 74% of requests below, DP has 31%.
  → smaller pool, better retention, less per-request work. Direct empirical
  rebuttal to "fragmentation is architectural, not policy."

Bundled scripts (rerunable):
- scripts/analysis/plot_gpu_utilization.py
- scripts/analysis/plot_cache_efficiency.py

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-11 18:04:49 +08:00

2 Commits