agentic-pd-hybrid

Files

kzlin 314c4cda0e docs(kvc): redesign gpu_utilization figure to lead with system-total compute

Reviewer feedback: the original gpu_utilization figure was confusing.
"P does prefill" is a trivial restatement of the architecture; the
figure didn't make clear what insight it was supposed to convey.

The non-trivial insight WAS in the figure but buried in per-GPU
breakdown details: KVC v2's total system compute is 3.47M tokens
vs DP's 5.17M -- a 33% reduction for the same 4449-request workload.
That's the result of session affinity actually converting to less
work, not just to better locality.

Redesigned the figure to lead with that finding:

Left panel (NEW): system-wide compute as two stacked bars
  - KVC: P heavy prefill (1.07M) + D append-prefill (1.39M) + decode (1.01M)
  - DP:  full prefill (4.17M) + decode (1.00M)
  - Big "-33% total compute" badge bracketed by an arrow between the
    bar tops makes the headline number unmissable

Right panel (kept, simplified): per-GPU work distribution
  - Same color coding as the left panel, so the architecture story
    flows from "what work the system does" to "where it happens"
  - In-panel annotation boxes describe the two architectural shapes
    (specialized P + light D vs uniform fused workers)
  - Removed the second legend that was overlapping bars

Doc §4.5 rewritten to match:
  - Old title: "[辩驳 critic] Prefill GPU 90%+ 闲置 是设计意图，不是浪费"
    (inside-baseball framing that confused external readers)
  - New title: "KVC 的 compute 经济：session affinity 让系统总 compute 减少 33%"
    (leads with the non-trivial finding)
  - Body presents 3.47M vs 5.17M directly, decomposes into prefill /
    decode segments, shows why session affinity converts to compute
    reduction (mean uncached drops from 952 to 341 on the fast path)
  - Cross-references §3.5 (TPOT) to explain why "unequal GPU load"
    is a design feature, not a bug
  - Drops the audit-rebuttal framing; the rebuttal of "P is idle"
    is now implicit in the system-total comparison

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-13 10:39:15 +08:00

analysis

docs(kvc): redesign gpu_utilization figure to lead with system-total compute

2026-05-13 10:39:15 +08:00

convert_audit_to_trace.py

docs: KVC v1-v4 debug journey + raise session soft_cap to 16

2026-04-28 21:10:41 +08:00

run_all_experiments.sh

docs: KVC v1-v4 debug journey + raise session soft_cap to 16

2026-04-28 21:10:41 +08:00

run_exp_a_pd_disagg.sh

docs: KVC v1-v4 debug journey + raise session soft_cap to 16