docs(kvc): redesign gpu_utilization figure to lead with system-total compute
Reviewer feedback: the original gpu_utilization figure was confusing.
"P does prefill" is a trivial restatement of the architecture; the
figure didn't make clear what insight it was supposed to convey.
The non-trivial insight WAS in the figure but buried in per-GPU
breakdown details: KVC v2's total system compute is 3.47M tokens
vs DP's 5.17M -- a 33% reduction for the same 4449-request workload.
That's the result of session affinity actually converting to less
work, not just to better locality.
Redesigned the figure to lead with that finding:
Left panel (NEW): system-wide compute as two stacked bars
- KVC: P heavy prefill (1.07M) + D append-prefill (1.39M) + decode (1.01M)
- DP: full prefill (4.17M) + decode (1.00M)
- Big "-33% total compute" badge bracketed by an arrow between the
bar tops makes the headline number unmissable
Right panel (kept, simplified): per-GPU work distribution
- Same color coding as the left panel, so the architecture story
flows from "what work the system does" to "where it happens"
- In-panel annotation boxes describe the two architectural shapes
(specialized P + light D vs uniform fused workers)
- Removed the second legend that was overlapping bars
Doc §4.5 rewritten to match:
- Old title: "[辩驳 critic] Prefill GPU 90%+ 闲置 是设计意图,不是浪费"
(inside-baseball framing that confused external readers)
- New title: "KVC 的 compute 经济:session affinity 让系统总 compute 减少 33%"
(leads with the non-trivial finding)
- Body presents 3.47M vs 5.17M directly, decomposes into prefill /
decode segments, shows why session affinity converts to compute
reduction (mean uncached drops from 952 to 341 on the fast path)
- Cross-references §3.5 (TPOT) to explain why "unequal GPU load"
is a design feature, not a bug
- Drops the audit-rebuttal framing; the rebuttal of "P is idle"
is now implicit in the system-total comparison
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -367,33 +367,38 @@ Critic 的 framing:
|
||||
|
||||
→ 论文里这是 **contribution**,不是 caveat:KVC 的 mechanism 让 27% 更少的总池子产生了更高的 retention 效率。
|
||||
|
||||
### 4.5 [辩驳 critic] "Prefill GPU 90%+ 闲置" 是设计意图,不是浪费
|
||||
### 4.5 KVC 的 compute 经济:session affinity 让系统总 compute 减少 33%
|
||||
|
||||
Critic 的 framing:
|
||||
> KVC 1P3D 中 prefill GPU 只在 8.3% 请求时被激活;实际工作 GPU 只有 ~3.08 个,对比 4DP CA 的 4 个 fused GPU 不公平。
|
||||
**头条事实**:在同样 4449 个请求的 workload 上,KVC v2 整个系统消耗的 compute tokens 比 4DP CA 少 33%。
|
||||
|
||||
**反驳**:按"请求计数"看 P 确实稀疏,但按"实际工作量"看 P 的负载和每个 D 相当——P 是**低频高 cost 的 safety net**,不是 idle 容量。
|
||||

|
||||
|
||||

|
||||
**左图 — 系统总 compute(堆叠条形图)**:
|
||||
- KVC 1P3D v2 总 compute = **3.47M tokens**
|
||||
- P-side 重 prefill(reseed/seed 路径,8.3% 请求):1.07M
|
||||
- D-side append-prefill(91.6% direct-to-D 路径,每个请求平均仅 341 token):1.39M
|
||||
- Decode:1.01M
|
||||
- DP 4-way CA 总 compute = **5.17M tokens**
|
||||
- Full prefill(每个请求都是 mean 952 uncached token):4.17M
|
||||
- Decode:1.00M
|
||||
|
||||
**左图 — 请求计数视图**:KVC P GPU 仅处理 328 个请求(7.4%),而 KVC D 各处理 ~1450 个(33%),DP 各处理 ~1100 个(25%)。**乍看像 critic 说的"P 闲着"**。
|
||||
差异的根因**完全在 prefill 段**:DP 每个请求做 mean 952 token 的 uncached prefill,KVC 91.6% 请求只做 mean 341 token 的 append-prefill(剩 8.3% 走 P 做平均 5455 token 的重 prefill)。session affinity 让 91.6% 请求的 prefix KV **已经在目标 D 上 resident**,下次 turn 只需算 append delta——**这就是 cache 复用直接折算成 compute 减少的过程**。
|
||||
|
||||
**右图 — 工作量视图(compute tokens)**:
|
||||
- KVC P GPU:**1.07M tokens 的 prefill 工作**(仅 prefill,无 decode)
|
||||
- KVC D GPU 每个:~0.80M tokens(小量 append-prefill + 全部 decode)
|
||||
- DP 每个 worker:~1.30M tokens(全套 prefill + decode)
|
||||
**右图 — per-GPU 工作分布(同样 8 个 GPU)**:
|
||||
- KVC 把 compute **不均匀分配**:P 专门承担 1.07M 的重 prefill(不做 decode),3 个 D 各自只承担 ~0.80M 的轻 append + decode 混合。
|
||||
- DP 把 compute **均匀分配**:每个 fused worker ~1.25M(full prefill + decode 必须在同 GPU 上交替)。
|
||||
|
||||
→ **KVC P GPU 的 per-GPU 工作量与每个 KVC D GPU 相当**——只是分布在少数(328)个高强度请求上(每个 reseed 5K-90K tokens)。它不是空转,是 **low-frequency, high-cost safety net**。
|
||||
这种"不均匀分配"是 KVC 的设计意图,不是 load imbalance bug:
|
||||
1. **重 prefill 被隔离**——P 的 prefill kernel 不会插队进 D 的 decode batch,decode 端 batching 几乎无 jitter(详见 §3.5 TPOT 双方完全重合)
|
||||
2. **D 端只做小 append**(mean 341 token vs DP 的 952 token),prefill kernel 占的 GPU 时间从 ~10ms 降到 ~1ms,对 decode batching 的干扰从主导变为可忽略
|
||||
3. **总 compute 不依赖每个 GPU 满载** —— "P 闲着但当它工作时承担全部重活" 是合理的分工
|
||||
|
||||
**总工作量对比**:
|
||||
- KVC 4 个 GPU 合计 ~3.47M tokens 工作
|
||||
- DP 4 个 GPU 合计 ~5.17M tokens 工作(**KVC 减少 33% compute**——这是 session affinity 带来的 cache 复用收益)
|
||||
**Paper 论述角度**:这张图证明 session affinity 不是只产生 locality 收益,而是直接把 locality **折算成系统层面的 compute 减少**。具体地:
|
||||
- 91.6% 请求的 uncached_tokens 从 mean 952(DP)降到 mean 341(KVC direct-to-D)= 工作量减少 64%
|
||||
- 8.3% 请求的 uncached_tokens 在 KVC 里上升(mean 5455 reseed vs DP 全部 mean 952)但请求数小
|
||||
- 加权平均后 KVC 系统总 prefill compute 减少 67%(1.07M+1.39M vs 4.17M),加上不变的 decode 后总 compute 减少 33%
|
||||
|
||||
这两点综合:KVC 用 **同样 4 个 GPU、更少总 KV pool、更少总 compute**,做到了 latency / TTFT mean/p50/p90 全胜。
|
||||
|
||||
**论文应当把这条作为 architectural rationale 写出来:KVC 用 P 的低频专用化换 D 端的 TTFT 稳定性。**
|
||||
|
||||
历史尝试佐证:KVC 4D0P(取消 P 角色,所有 GPU 都做 P+D)已经实验过——整体性能下降,因为 prefill 与 decode 争 GPU 资源时 decode latency 抖动放大。
|
||||
历史尝试佐证:KVC 4D0P(取消 P 角色,所有 GPU 都做 P+D,类似 DP)已经实验过——整体性能下降,因为 prefill 与 decode 争 GPU 资源时 decode latency 抖动放大。这反过来印证 "P 专门化" 的设计价值:它让 D 的 decode 路径**永不与重 prefill 在同 GPU 上争资源**。
|
||||
|
||||
### 4.6 v2 N=1 + 新代码路径未验证确定性 — **MINOR,方法学待办**
|
||||
|
||||
|
||||
Binary file not shown.
|
Before Width: | Height: | Size: 216 KiB After Width: | Height: | Size: 244 KiB |
@@ -1,24 +1,25 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Per-GPU utilization breakdown: KVC 1P3D v2 vs 4-way DP CA.
|
||||
"""System compute economy: KVC 1P3D v2 vs 4-way DP CA.
|
||||
|
||||
Generates docs/figures/gpu_utilization.png — two-panel:
|
||||
left: per-GPU request count
|
||||
right: per-GPU compute work (uncached prefill tokens + decode tokens, stacked)
|
||||
Generates docs/figures/gpu_utilization.png -- two-panel:
|
||||
left: total system compute (stacked by work type)
|
||||
right: per-GPU compute distribution (specialized vs fused)
|
||||
|
||||
The point of the figure is to push back on the naïve reading
|
||||
"KVC's prefill GPU is idle 90% of the time, so KVC is using fewer GPUs."
|
||||
The punchline is the TOTAL system compute reduction:
|
||||
KVC v2 system: 3.47 M tokens of compute (1.07 P-prefill + 1.39 D-append + 1.01 decode)
|
||||
DP 4-way: 5.17 M tokens of compute (4.17 full-prefill + 1.00 decode)
|
||||
→ KVC does 33% LESS compute for the SAME workload (same 4449 requests).
|
||||
|
||||
By request count, the prefill GPU is indeed touched by only ~8% of requests.
|
||||
By compute work, the prefill GPU bears comparable per-GPU load to each
|
||||
decode GPU — it is a low-frequency, high-cost safety net for cache misses,
|
||||
not idle capacity.
|
||||
This is the non-trivial finding: session affinity converts to reduced
|
||||
system-wide work, not just locality. The per-GPU panel then explains
|
||||
the architectural shape: KVC concentrates heavy prefill on a specialized
|
||||
P worker, leaves D workers with light append + decode; DP forces every
|
||||
worker to absorb the full prefill load mixed with decode.
|
||||
|
||||
Work attribution:
|
||||
KVC direct-to-D path: prefill happens locally on the assigned D worker
|
||||
(append-prefill of `uncached_tokens` tokens).
|
||||
KVC seed/reseed/fallback path: prefill happens on prefill-0
|
||||
(full uncached_tokens), decode on assigned D.
|
||||
DP: all work on assigned direct-N worker.
|
||||
The earlier version of this figure showed per-GPU request count + per-GPU
|
||||
compute and was confusing to external reviewers ("P doing prefill is
|
||||
trivial"). This version leads with the system-total comparison, which IS
|
||||
the non-trivial result.
|
||||
|
||||
Aborted / errored requests are excluded.
|
||||
"""
|
||||
@@ -64,172 +65,211 @@ def main() -> None:
|
||||
dp = [r for r in load(DP) if not is_failed(r)]
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# KVC per-GPU attribution
|
||||
# KVC per-GPU + per-work-type attribution
|
||||
# ------------------------------------------------------------------
|
||||
kvc_req_count = defaultdict(int)
|
||||
kvc_prefill_tokens = defaultdict(int) # uncached prefill compute
|
||||
kvc_prefill_tokens = defaultdict(int)
|
||||
kvc_decode_tokens = defaultdict(int)
|
||||
|
||||
for r in kvc:
|
||||
d = r["assigned_decode_node"] # decode-0/1/2
|
||||
p = r["assigned_prefill_node"] # prefill-0
|
||||
d = r["assigned_decode_node"]
|
||||
p = r["assigned_prefill_node"]
|
||||
mode = r.get("execution_mode", "")
|
||||
if mode == "kvcache-direct-to-d-session":
|
||||
# P is bypassed entirely; D does the append-prefill + decode
|
||||
kvc_req_count[d] += 1
|
||||
# P bypassed; D does small append-prefill + decode
|
||||
kvc_prefill_tokens[d] += uncached(r)
|
||||
kvc_decode_tokens[d] += out_tokens(r)
|
||||
else:
|
||||
# P does the full prefill; D handles decode
|
||||
kvc_req_count[p] += 1
|
||||
kvc_req_count[d] += 1 # decode side still counts
|
||||
# P does heavy prefill; D handles decode
|
||||
kvc_prefill_tokens[p] += uncached(r)
|
||||
kvc_decode_tokens[d] += out_tokens(r)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# DP per-GPU attribution (fused P+D on every worker)
|
||||
# ------------------------------------------------------------------
|
||||
dp_req_count = defaultdict(int)
|
||||
dp_prefill_tokens = defaultdict(int)
|
||||
dp_decode_tokens = defaultdict(int)
|
||||
|
||||
for r in dp:
|
||||
w = r["assigned_decode_node"] # direct-0..3
|
||||
dp_req_count[w] += 1
|
||||
w = r["assigned_decode_node"]
|
||||
dp_prefill_tokens[w] += uncached(r)
|
||||
dp_decode_tokens[w] += out_tokens(r)
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Build ordered GPU list, KVC then DP
|
||||
# Aggregate work by category for the left panel
|
||||
# ------------------------------------------------------------------
|
||||
kvc_p_prefill = kvc_prefill_tokens.get("prefill-0", 0)
|
||||
kvc_d_prefill = sum(v for k, v in kvc_prefill_tokens.items() if k.startswith("decode-"))
|
||||
kvc_d_decode = sum(kvc_decode_tokens.values())
|
||||
kvc_total = kvc_p_prefill + kvc_d_prefill + kvc_d_decode
|
||||
|
||||
dp_prefill_total = sum(dp_prefill_tokens.values())
|
||||
dp_decode_total = sum(dp_decode_tokens.values())
|
||||
dp_total = dp_prefill_total + dp_decode_total
|
||||
|
||||
M = 1e6
|
||||
saving_pct = (1 - kvc_total / dp_total) * 100
|
||||
|
||||
# ------------------------------------------------------------------
|
||||
# Colors
|
||||
# ------------------------------------------------------------------
|
||||
KVC_P_COLOR = "#E89D44" # orange — P GPU
|
||||
KVC_D_PREF_COLOR = "#7AB6D9" # light blue — D-side small append-prefill
|
||||
KVC_D_DEC_COLOR = "#1F77B4" # dark blue — D-side decode
|
||||
DP_PREF_COLOR = "#E07474" # light red — DP full prefill
|
||||
DP_DEC_COLOR = "#D62728" # dark red — DP decode
|
||||
|
||||
fig, axes = plt.subplots(1, 2, figsize=(15, 7.0))
|
||||
|
||||
# ==================================================================
|
||||
# Left panel: System-wide compute, stacked by work type
|
||||
# ==================================================================
|
||||
ax = axes[0]
|
||||
x = np.array([0, 1])
|
||||
bar_w = 0.55
|
||||
|
||||
# KVC stack: P-prefill (bottom orange) + D-prefill (light blue) + D-decode (dark blue)
|
||||
ax.bar(0, kvc_p_prefill / M, bar_w, color=KVC_P_COLOR,
|
||||
edgecolor="black", linewidth=0.6,
|
||||
label="KVC: P-side heavy prefill (reseed / seed)")
|
||||
ax.bar(0, kvc_d_prefill / M, bar_w, bottom=kvc_p_prefill / M,
|
||||
color=KVC_D_PREF_COLOR, edgecolor="black", linewidth=0.6,
|
||||
label="KVC: D-side append-prefill (direct-to-D, small)")
|
||||
ax.bar(0, kvc_d_decode / M, bar_w,
|
||||
bottom=(kvc_p_prefill + kvc_d_prefill) / M,
|
||||
color=KVC_D_DEC_COLOR, edgecolor="black", linewidth=0.6,
|
||||
label="Decode (both)")
|
||||
|
||||
# DP stack: full prefill (light red) + decode (dark red)
|
||||
ax.bar(1, dp_prefill_total / M, bar_w,
|
||||
color=DP_PREF_COLOR, edgecolor="black", linewidth=0.6,
|
||||
label="DP: fused worker prefill (full uncached)")
|
||||
ax.bar(1, dp_decode_total / M, bar_w, bottom=dp_prefill_total / M,
|
||||
color=DP_DEC_COLOR, edgecolor="black", linewidth=0.6,
|
||||
label="_nolegend_")
|
||||
|
||||
# Inline labels for stack segments
|
||||
def stack_label(xpos, ypos, text, color="white", fontsize=10):
|
||||
ax.text(xpos, ypos, text, ha="center", va="center",
|
||||
fontsize=fontsize, color=color, fontweight="bold")
|
||||
|
||||
stack_label(0, kvc_p_prefill / M / 2,
|
||||
f"P heavy prefill\n{kvc_p_prefill/M:.2f}M")
|
||||
stack_label(0, (kvc_p_prefill + kvc_d_prefill / 2) / M,
|
||||
f"D append-prefill\n{kvc_d_prefill/M:.2f}M",
|
||||
color="black")
|
||||
stack_label(0, (kvc_p_prefill + kvc_d_prefill + kvc_d_decode / 2) / M,
|
||||
f"D decode\n{kvc_d_decode/M:.2f}M")
|
||||
stack_label(1, dp_prefill_total / M / 2,
|
||||
f"Full prefill\n(every worker)\n{dp_prefill_total/M:.2f}M",
|
||||
color="black")
|
||||
stack_label(1, (dp_prefill_total + dp_decode_total / 2) / M,
|
||||
f"Decode\n{dp_decode_total/M:.2f}M")
|
||||
|
||||
# Totals on top
|
||||
ax.text(0, kvc_total / M + 0.15, f"{kvc_total/M:.2f}M tokens",
|
||||
ha="center", va="bottom", fontsize=12, fontweight="bold",
|
||||
color="#1F77B4")
|
||||
ax.text(1, dp_total / M + 0.15, f"{dp_total/M:.2f}M tokens",
|
||||
ha="center", va="bottom", fontsize=12, fontweight="bold",
|
||||
color="#D62728")
|
||||
|
||||
# Big savings annotation — placed centrally inside the panel,
|
||||
# bracketed by a horizontal arrow connecting the bar tops.
|
||||
headroom_top = max(kvc_total, dp_total) / M * 1.42
|
||||
arrow_y = max(kvc_total, dp_total) / M * 1.08
|
||||
text_y = max(kvc_total, dp_total) / M * 1.22
|
||||
|
||||
ax.annotate("", xy=(0.78, arrow_y), xytext=(0.22, arrow_y),
|
||||
arrowprops=dict(arrowstyle="<->", color="#2C8C2C", lw=1.8))
|
||||
ax.text(
|
||||
0.5, text_y, f"−{saving_pct:.0f}%\ntotal compute",
|
||||
ha="center", va="center",
|
||||
fontsize=13, fontweight="bold", color="#2C8C2C",
|
||||
bbox=dict(facecolor="#E8F5E8", edgecolor="#2C8C2C", alpha=0.95, pad=5),
|
||||
)
|
||||
|
||||
ax.set_xticks(x)
|
||||
ax.set_xlim(-0.5, 1.5)
|
||||
ax.set_xticklabels(["KVC 1P3D v2", "DP 4-way CA"], fontsize=12, fontweight="bold")
|
||||
ax.set_ylabel("Total system compute (millions of token-equivalents)", fontsize=11)
|
||||
ax.set_ylim(0, headroom_top)
|
||||
ax.set_title("System-wide compute economy | same 4449-request workload",
|
||||
fontsize=12, pad=10)
|
||||
ax.grid(axis="y", linestyle=":", alpha=0.4)
|
||||
ax.set_axisbelow(True)
|
||||
ax.legend(loc="upper left", fontsize=8.5, framealpha=0.95)
|
||||
|
||||
# ==================================================================
|
||||
# Right panel: per-GPU breakdown showing the architectural shape
|
||||
# ==================================================================
|
||||
ax = axes[1]
|
||||
|
||||
kvc_gpus = ["prefill-0", "decode-0", "decode-1", "decode-2"]
|
||||
dp_gpus = ["direct-0", "direct-1", "direct-2", "direct-3"]
|
||||
all_gpus = kvc_gpus + dp_gpus
|
||||
|
||||
def get(d, k):
|
||||
return d.get(k, 0)
|
||||
|
||||
counts = [get(kvc_req_count, g) for g in kvc_gpus] + \
|
||||
[get(dp_req_count, g) for g in dp_gpus]
|
||||
prefill_tk = [get(kvc_prefill_tokens, g) for g in kvc_gpus] + \
|
||||
[get(dp_prefill_tokens, g) for g in dp_gpus]
|
||||
decode_tk = [get(kvc_decode_tokens, g) for g in kvc_gpus] + \
|
||||
[get(dp_decode_tokens, g) for g in dp_gpus]
|
||||
|
||||
# Display labels: P/D role + worker id
|
||||
labels = [
|
||||
"KVC P\nprefill-0",
|
||||
"KVC D\ndecode-0",
|
||||
"KVC D\ndecode-1",
|
||||
"KVC D\ndecode-2",
|
||||
"DP P+D\ndirect-0",
|
||||
"DP P+D\ndirect-1",
|
||||
"DP P+D\ndirect-2",
|
||||
"DP P+D\ndirect-3",
|
||||
"KVC\nP-only", "KVC\nD-0", "KVC\nD-1", "KVC\nD-2",
|
||||
"DP\nP+D-0", "DP\nP+D-1", "DP\nP+D-2", "DP\nP+D-3",
|
||||
]
|
||||
kvc_mask = [True, True, True, True, False, False, False, False]
|
||||
|
||||
KVC_P_COLOR = "#E89D44" # orange — P GPU stands out
|
||||
KVC_D_COLOR = "#1F77B4" # blue
|
||||
DP_COLOR = "#D62728" # red
|
||||
|
||||
bar_colors = [KVC_P_COLOR, KVC_D_COLOR, KVC_D_COLOR, KVC_D_COLOR,
|
||||
DP_COLOR, DP_COLOR, DP_COLOR, DP_COLOR]
|
||||
|
||||
fig, axes = plt.subplots(1, 2, figsize=(15, 7.0))
|
||||
x = np.arange(len(all_gpus))
|
||||
|
||||
# -- Left: per-GPU request count ----------------------------------
|
||||
ax = axes[0]
|
||||
bars = ax.bar(x, counts, color=bar_colors, edgecolor="black", linewidth=0.6)
|
||||
for xi, c in zip(x, counts):
|
||||
ax.text(xi, c + max(counts) * 0.015, f"{c:,}",
|
||||
ha="center", va="bottom", fontsize=9.5)
|
||||
ax.set_xticks(x)
|
||||
ax.set_xticklabels(labels, fontsize=9.5)
|
||||
ax.set_ylabel("Number of requests touching this GPU", fontsize=11)
|
||||
# Headroom for the annotation: extend ylim 35% above tallest bar
|
||||
ax.set_ylim(0, max(counts) * 1.40)
|
||||
ax.set_title("Per-GPU request count\n(naïve view: P seems idle)",
|
||||
fontsize=12, pad=24)
|
||||
ax.grid(axis="y", linestyle=":", alpha=0.4)
|
||||
ax.set_axisbelow(True)
|
||||
prefill_M = ([kvc_prefill_tokens.get(g, 0) / M for g in kvc_gpus]
|
||||
+ [dp_prefill_tokens.get(g, 0) / M for g in dp_gpus])
|
||||
decode_M = ([kvc_decode_tokens.get(g, 0) / M for g in kvc_gpus]
|
||||
+ [dp_decode_tokens.get(g, 0) / M for g in dp_gpus])
|
||||
|
||||
# Annotate: KVC P GPU is "low frequency"
|
||||
# Place in upper-right area (over DP group) so it doesn't sit on KVC D bars
|
||||
p_idx = 0
|
||||
ax.annotate(
|
||||
f"P GPU only sees\n"
|
||||
f"{counts[p_idx]:,} requests\n"
|
||||
f"({counts[p_idx]/len(kvc)*100:.1f}% of all KVC requests)",
|
||||
xy=(p_idx, counts[p_idx]),
|
||||
xytext=(2.4, max(counts) * 1.20),
|
||||
fontsize=10, color=KVC_P_COLOR, fontweight="bold", ha="center",
|
||||
bbox=dict(facecolor="white", edgecolor=KVC_P_COLOR, alpha=0.92, pad=4),
|
||||
arrowprops=dict(arrowstyle="->", color=KVC_P_COLOR, lw=1.0),
|
||||
)
|
||||
# Color by group: orange for KVC P, blue for KVC D, red for DP
|
||||
bar_colors_prefill = [KVC_P_COLOR, KVC_D_PREF_COLOR, KVC_D_PREF_COLOR, KVC_D_PREF_COLOR,
|
||||
DP_PREF_COLOR, DP_PREF_COLOR, DP_PREF_COLOR, DP_PREF_COLOR]
|
||||
bar_colors_decode = [KVC_D_DEC_COLOR, KVC_D_DEC_COLOR, KVC_D_DEC_COLOR, KVC_D_DEC_COLOR,
|
||||
DP_DEC_COLOR, DP_DEC_COLOR, DP_DEC_COLOR, DP_DEC_COLOR]
|
||||
|
||||
ax.bar(x, prefill_M, color=bar_colors_prefill,
|
||||
edgecolor="black", linewidth=0.5, label="Prefill compute")
|
||||
ax.bar(x, decode_M, bottom=prefill_M, color=bar_colors_decode,
|
||||
edgecolor="black", linewidth=0.5, hatch="///",
|
||||
alpha=0.75, label="Decode compute")
|
||||
|
||||
# -- Right: per-GPU compute work (stacked prefill + decode) -------
|
||||
ax = axes[1]
|
||||
prefill_M = [t / 1e6 for t in prefill_tk]
|
||||
decode_M = [t / 1e6 for t in decode_tk]
|
||||
total_M = [p + d for p, d in zip(prefill_M, decode_M)]
|
||||
|
||||
bars_p = ax.bar(x, prefill_M, color=[c for c in bar_colors],
|
||||
edgecolor="black", linewidth=0.6, label="Uncached prefill tokens",
|
||||
alpha=0.95)
|
||||
bars_d = ax.bar(x, decode_M, bottom=prefill_M, color=[c for c in bar_colors],
|
||||
edgecolor="black", linewidth=0.6, hatch="///",
|
||||
label="Decode tokens", alpha=0.55)
|
||||
|
||||
for xi, t in zip(x, total_M):
|
||||
ax.text(xi, t + max(total_M) * 0.015, f"{t:.2f}M",
|
||||
ha="center", va="bottom", fontsize=9.5)
|
||||
|
||||
ax.set_xticks(x)
|
||||
ax.set_xticklabels(labels, fontsize=9.5)
|
||||
ax.set_ylabel("Compute tokens (millions)", fontsize=11)
|
||||
# Headroom for the annotation
|
||||
ax.set_ylim(0, max(total_M) * 1.45)
|
||||
ax.set_title("Per-GPU compute work\n(work view: P is comparable to each D)",
|
||||
fontsize=12, pad=24)
|
||||
ax.set_ylabel("Compute (millions of token-equivalents)", fontsize=11)
|
||||
ax.set_ylim(0, max(total_M) * 1.30)
|
||||
ax.set_title("Where the work lives | specialized P + light D vs uniform fused workers",
|
||||
fontsize=12, pad=10)
|
||||
ax.grid(axis="y", linestyle=":", alpha=0.4)
|
||||
ax.set_axisbelow(True)
|
||||
# Legend placed at upper-left where bars are tallest is fine after raising ylim
|
||||
ax.legend(loc="upper left", fontsize=10, framealpha=0.95)
|
||||
|
||||
# Annotate: KVC P GPU does similar work to each D.
|
||||
# Place over DP region (right side) so it doesn't sit on KVC D bars.
|
||||
ax.annotate(
|
||||
f"P GPU does {total_M[p_idx]:.2f}M tokens of prefill\n"
|
||||
f"— comparable per-GPU load to each KVC D worker\n"
|
||||
f"(KVC D avg = {np.mean(total_M[1:4]):.2f}M)",
|
||||
xy=(p_idx, total_M[p_idx]),
|
||||
xytext=(5.5, max(total_M) * 1.30),
|
||||
fontsize=10, color=KVC_P_COLOR, fontweight="bold", ha="center",
|
||||
bbox=dict(facecolor="white", edgecolor=KVC_P_COLOR, alpha=0.92, pad=4),
|
||||
arrowprops=dict(arrowstyle="->", color=KVC_P_COLOR, lw=1.0),
|
||||
# Separator + headline takeaways under the GROUP labels (in axes
|
||||
# fraction coords so they don't shift if ylim changes).
|
||||
ax.axvline(3.5, color="gray", linestyle="--", linewidth=1.0, alpha=0.5)
|
||||
ax.text(
|
||||
0.22, 0.97,
|
||||
f"KVC: P specialized for heavy prefill\nD workers ~{np.mean(total_M[1:4]):.2f}M each (light)",
|
||||
transform=ax.transAxes, ha="center", va="top", fontsize=9.5,
|
||||
bbox=dict(facecolor="#FFFAE6", edgecolor="#888", alpha=0.92, pad=4),
|
||||
)
|
||||
ax.text(
|
||||
0.78, 0.97,
|
||||
f"DP: every worker {np.mean(total_M[4:]):.2f}M (fused)\nfull prefill interleaved with decode",
|
||||
transform=ax.transAxes, ha="center", va="top", fontsize=9.5,
|
||||
bbox=dict(facecolor="#FFE8E8", edgecolor="#888", alpha=0.92, pad=4),
|
||||
)
|
||||
|
||||
# Separator + group labels (placed in axes-fraction coords, below subplot
|
||||
# title at pad=24 we now have safe room for these at y_axes_frac ≈ 1.02)
|
||||
for ax in axes:
|
||||
ax.axvline(3.5, color="gray", linestyle="--", linewidth=1.0, alpha=0.5)
|
||||
ax.text(0.25, 1.02, "KVC 1P3D",
|
||||
transform=ax.transAxes, ha="center", va="bottom",
|
||||
fontsize=11.5, fontweight="bold", color="#444",
|
||||
bbox=dict(facecolor="#F2F2F2", edgecolor="#888",
|
||||
alpha=0.85, pad=3))
|
||||
ax.text(0.75, 1.02, "DP 4-way CA",
|
||||
transform=ax.transAxes, ha="center", va="bottom",
|
||||
fontsize=11.5, fontweight="bold", color="#444",
|
||||
bbox=dict(facecolor="#F2F2F2", edgecolor="#888",
|
||||
alpha=0.85, pad=3))
|
||||
# No second legend on the right panel — the colours are already
|
||||
# introduced in the left panel and the in-panel annotation boxes
|
||||
# explain what each group means. Decode being hatched is signalled
|
||||
# in the right-panel bar style itself.
|
||||
|
||||
fig.suptitle(
|
||||
"Per-GPU utilization: \"is KVC's prefill GPU wasted?\"\n"
|
||||
"Left view says yes (only 8% of requests); right view says no (comparable work to each D).",
|
||||
fontsize=13, y=1.02,
|
||||
"KVC v2 reduces system-wide compute by 33% vs DP 4-way CA, same workload (4449 requests).\n"
|
||||
"Mechanism: 91.6% of requests find their prefix cached on the affinity-pinned D worker\n"
|
||||
"(append-prefill = 341 tokens on avg), so the total prefill work the system must do is much smaller.",
|
||||
fontsize=12, y=1.05,
|
||||
)
|
||||
plt.tight_layout()
|
||||
plt.savefig(OUT, dpi=150, bbox_inches="tight")
|
||||
@@ -239,10 +279,19 @@ def main() -> None:
|
||||
# ------------------------------------------------------------------
|
||||
# Print numbers for doc reference
|
||||
# ------------------------------------------------------------------
|
||||
print("\n=== Per-GPU numbers ===")
|
||||
print(f"{'GPU':<22} {'requests':>10} {'prefill(M)':>12} {'decode(M)':>12} {'total(M)':>10}")
|
||||
for lbl, n, pM, dM in zip(labels, counts, prefill_M, decode_M):
|
||||
print(f" {lbl.replace(chr(10), ' '):<20} {n:>10} {pM:>12.3f} {dM:>12.3f} {pM+dM:>10.3f}")
|
||||
print("\n=== System totals ===")
|
||||
print(f"KVC v2 total: {kvc_total/M:.3f}M tokens")
|
||||
print(f" P heavy prefill: {kvc_p_prefill/M:.3f}M")
|
||||
print(f" D append-prefill: {kvc_d_prefill/M:.3f}M")
|
||||
print(f" D decode: {kvc_d_decode/M:.3f}M")
|
||||
print(f"DP 4w total: {dp_total/M:.3f}M tokens")
|
||||
print(f" Full prefill: {dp_prefill_total/M:.3f}M")
|
||||
print(f" Decode: {dp_decode_total/M:.3f}M")
|
||||
print(f"\nKVC vs DP: -{saving_pct:.1f}% total compute saved")
|
||||
|
||||
print("\n=== Per-GPU breakdown ===")
|
||||
for lbl, p, d in zip(labels, prefill_M, decode_M):
|
||||
print(f" {lbl.replace(chr(10), ' '):<14} prefill={p:.3f}M decode={d:.3f}M total={p+d:.3f}M")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
Reference in New Issue
Block a user