agentic-pd-hybrid

gahow/agentic-pd-hybrid

Fork 0

Commit Graph

Author	SHA1	Message	Date
kzlin	314c4cda0e	docs(kvc): redesign gpu_utilization figure to lead with system-total compute Reviewer feedback: the original gpu_utilization figure was confusing. "P does prefill" is a trivial restatement of the architecture; the figure didn't make clear what insight it was supposed to convey. The non-trivial insight WAS in the figure but buried in per-GPU breakdown details: KVC v2's total system compute is 3.47M tokens vs DP's 5.17M -- a 33% reduction for the same 4449-request workload. That's the result of session affinity actually converting to less work, not just to better locality. Redesigned the figure to lead with that finding: Left panel (NEW): system-wide compute as two stacked bars - KVC: P heavy prefill (1.07M) + D append-prefill (1.39M) + decode (1.01M) - DP: full prefill (4.17M) + decode (1.00M) - Big "-33% total compute" badge bracketed by an arrow between the bar tops makes the headline number unmissable Right panel (kept, simplified): per-GPU work distribution - Same color coding as the left panel, so the architecture story flows from "what work the system does" to "where it happens" - In-panel annotation boxes describe the two architectural shapes (specialized P + light D vs uniform fused workers) - Removed the second legend that was overlapping bars Doc §4.5 rewritten to match: - Old title: "[辩驳 critic] Prefill GPU 90%+ 闲置是设计意图，不是浪费" (inside-baseball framing that confused external readers) - New title: "KVC 的 compute 经济：session affinity 让系统总 compute 减少 33%" (leads with the non-trivial finding) - Body presents 3.47M vs 5.17M directly, decomposes into prefill / decode segments, shows why session affinity converts to compute reduction (mean uncached drops from 952 to 341 on the fast path) - Cross-references §3.5 (TPOT) to explain why "unequal GPU load" is a design feature, not a bug - Drops the audit-rebuttal framing; the rebuttal of "P is idle" is now implicit in the system-total comparison Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-13 10:39:15 +08:00
kzlin	506d360160	fix(figures): GPU utilization figure annotation/headroom polish Bar-overlap fix: extend ylim by 35-45% above the tallest bar to give the "P GPU only sees 328 requests" and "P GPU does 1.07M tokens" annotations clean white-bbox space above the bars instead of crashing into the KVC D bars at x=1. Move both annotation xytext positions to x=2.4 (left panel) and x=5.5 (right panel) so the arrows pull away from the orange P bar toward the center of the panel. Group labels (KVC 1P3D / DP 4-way CA) kept in axes-fraction bboxes at y=1.02; subplot titles raised to pad=24 to leave room. Note: a small visual collision between the bboxed group labels and the subplot-title second line remains in the rendered output (acknowledged in the prior conversation). Acceptable for now; full layout rework is deferred. The annotation-vs-bar overlap (the original blocker) is fixed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 22:28:39 +08:00
kzlin	517677d7f2	docs(kvc): add GPU-utilization and cache-efficiency figures (rebut critic) Two figures inserted into V2_DEEP_ANALYSIS §4.5 and §4.4 respectively, to visually rebut the two critic-agent claims that we argued in prose were design intent, not deficiencies. (1) gpu_utilization.png -- §4.5 "P GPU is wasted 90% of the time" Two-panel side-by-side: Left (request count view, the naive reading): KVC P = 328 reqs (7.4%), KVC D = ~1450 each, DP = ~1100 each. P "looks idle." Right (compute work view, the honest reading): KVC P does 1.07M tokens of prefill, comparable to each KVC D worker's ~0.80M. P is a low-frequency high-cost safety net, not idle capacity. Bonus finding: KVC's total compute (3.47M tokens across 4 GPUs) is 33% LESS than DP's (5.17M). Same GPUs, less work done. That's the affinity win. (2) cache_efficiency.png -- §4.4 "Cache concentration is not policy win" Two-panel side-by-side. The setup: KVC has 27% LESS total KV pool (276K vs 351K tokens) yet caches MORE per request. Left (cache hit rate vs turn number): KVC's session-affinity lets hit rate accumulate with turns; DP's hash + radix-LRU causes a mid-turn drift around turns 8-25 where KVC = 97.0% vs DP = 95.8% (1.24pp gap). Shows mechanism, not just outcome. Right (ECDF of per-request uncached tokens, log x): KVC's distribution concentrates near zero (50% < 187 tokens), DP's is spread (50% < 781 tokens). At uncached = 500 tokens threshold, KVC has 74% of requests below, DP has 31%. → smaller pool, better retention, less per-request work. Direct empirical rebuttal to "fragmentation is architectural, not policy." Bundled scripts (rerunable): - scripts/analysis/plot_gpu_utilization.py - scripts/analysis/plot_cache_efficiency.py Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-11 18:04:49 +08:00

Author

SHA1

Message

Date

kzlin

314c4cda0e

docs(kvc): redesign gpu_utilization figure to lead with system-total compute

Reviewer feedback: the original gpu_utilization figure was confusing.
"P does prefill" is a trivial restatement of the architecture; the
figure didn't make clear what insight it was supposed to convey.

The non-trivial insight WAS in the figure but buried in per-GPU
breakdown details: KVC v2's total system compute is 3.47M tokens
vs DP's 5.17M -- a 33% reduction for the same 4449-request workload.
That's the result of session affinity actually converting to less
work, not just to better locality.

Redesigned the figure to lead with that finding:

Left panel (NEW): system-wide compute as two stacked bars
  - KVC: P heavy prefill (1.07M) + D append-prefill (1.39M) + decode (1.01M)
  - DP:  full prefill (4.17M) + decode (1.00M)
  - Big "-33% total compute" badge bracketed by an arrow between the
    bar tops makes the headline number unmissable

Right panel (kept, simplified): per-GPU work distribution
  - Same color coding as the left panel, so the architecture story
    flows from "what work the system does" to "where it happens"
  - In-panel annotation boxes describe the two architectural shapes
    (specialized P + light D vs uniform fused workers)
  - Removed the second legend that was overlapping bars

Doc §4.5 rewritten to match:
  - Old title: "[辩驳 critic] Prefill GPU 90%+ 闲置 是设计意图，不是浪费"
    (inside-baseball framing that confused external readers)
  - New title: "KVC 的 compute 经济：session affinity 让系统总 compute 减少 33%"
    (leads with the non-trivial finding)
  - Body presents 3.47M vs 5.17M directly, decomposes into prefill /
    decode segments, shows why session affinity converts to compute
    reduction (mean uncached drops from 952 to 341 on the fast path)
  - Cross-references §3.5 (TPOT) to explain why "unequal GPU load"
    is a design feature, not a bug
  - Drops the audit-rebuttal framing; the rebuttal of "P is idle"
    is now implicit in the system-total comparison

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-13 10:39:15 +08:00

kzlin

506d360160

fix(figures): GPU utilization figure annotation/headroom polish

Bar-overlap fix: extend ylim by 35-45% above the tallest bar to give the
"P GPU only sees 328 requests" and "P GPU does 1.07M tokens" annotations
clean white-bbox space above the bars instead of crashing into the KVC D
bars at x=1. Move both annotation xytext positions to x=2.4 (left panel)
and x=5.5 (right panel) so the arrows pull away from the orange P bar
toward the center of the panel.

Group labels (KVC 1P3D / DP 4-way CA) kept in axes-fraction bboxes at
y=1.02; subplot titles raised to pad=24 to leave room.

Note: a small visual collision between the bboxed group labels and the
subplot-title second line remains in the rendered output (acknowledged
in the prior conversation). Acceptable for now; full layout rework is
deferred. The annotation-vs-bar overlap (the original blocker) is fixed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-11 22:28:39 +08:00

kzlin

517677d7f2

docs(kvc): add GPU-utilization and cache-efficiency figures (rebut critic)

Two figures inserted into V2_DEEP_ANALYSIS §4.5 and §4.4 respectively, to
visually rebut the two critic-agent claims that we argued in prose were
design intent, not deficiencies.

(1) gpu_utilization.png  -- §4.5  "P GPU is wasted 90% of the time"
  Two-panel side-by-side:
    Left  (request count view, the naive reading): KVC P = 328 reqs (7.4%),
          KVC D = ~1450 each, DP = ~1100 each. P "looks idle."
    Right (compute work view, the honest reading): KVC P does 1.07M tokens
          of prefill, comparable to each KVC D worker's ~0.80M. P is a
          low-frequency high-cost safety net, not idle capacity.
  Bonus finding: KVC's total compute (3.47M tokens across 4 GPUs) is 33%
  LESS than DP's (5.17M). Same GPUs, less work done. That's the affinity
  win.

(2) cache_efficiency.png  -- §4.4  "Cache concentration is not policy win"
  Two-panel side-by-side. The setup: KVC has 27% LESS total KV pool
  (276K vs 351K tokens) yet caches MORE per request.
    Left  (cache hit rate vs turn number): KVC's session-affinity lets
          hit rate accumulate with turns; DP's hash + radix-LRU causes
          a mid-turn drift around turns 8-25 where KVC = 97.0% vs DP
          = 95.8% (1.24pp gap). Shows mechanism, not just outcome.
    Right (ECDF of per-request uncached tokens, log x): KVC's distribution
          concentrates near zero (50% < 187 tokens), DP's is spread
          (50% < 781 tokens). At uncached = 500 tokens threshold, KVC
          has 74% of requests below, DP has 31%.
  → smaller pool, better retention, less per-request work. Direct empirical
  rebuttal to "fragmentation is architectural, not policy."

Bundled scripts (rerunable):
- scripts/analysis/plot_gpu_utilization.py
- scripts/analysis/plot_cache_efficiency.py

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-11 18:04:49 +08:00

3 Commits