Files
agentic-pd-hybrid/scripts/analysis/plot_gpu_utilization.py
kzlin 314c4cda0e docs(kvc): redesign gpu_utilization figure to lead with system-total compute
Reviewer feedback: the original gpu_utilization figure was confusing.
"P does prefill" is a trivial restatement of the architecture; the
figure didn't make clear what insight it was supposed to convey.

The non-trivial insight WAS in the figure but buried in per-GPU
breakdown details: KVC v2's total system compute is 3.47M tokens
vs DP's 5.17M -- a 33% reduction for the same 4449-request workload.
That's the result of session affinity actually converting to less
work, not just to better locality.

Redesigned the figure to lead with that finding:

Left panel (NEW): system-wide compute as two stacked bars
  - KVC: P heavy prefill (1.07M) + D append-prefill (1.39M) + decode (1.01M)
  - DP:  full prefill (4.17M) + decode (1.00M)
  - Big "-33% total compute" badge bracketed by an arrow between the
    bar tops makes the headline number unmissable

Right panel (kept, simplified): per-GPU work distribution
  - Same color coding as the left panel, so the architecture story
    flows from "what work the system does" to "where it happens"
  - In-panel annotation boxes describe the two architectural shapes
    (specialized P + light D vs uniform fused workers)
  - Removed the second legend that was overlapping bars

Doc §4.5 rewritten to match:
  - Old title: "[辩驳 critic] Prefill GPU 90%+ 闲置 是设计意图,不是浪费"
    (inside-baseball framing that confused external readers)
  - New title: "KVC 的 compute 经济:session affinity 让系统总 compute 减少 33%"
    (leads with the non-trivial finding)
  - Body presents 3.47M vs 5.17M directly, decomposes into prefill /
    decode segments, shows why session affinity converts to compute
    reduction (mean uncached drops from 952 to 341 on the fast path)
  - Cross-references §3.5 (TPOT) to explain why "unequal GPU load"
    is a design feature, not a bug
  - Drops the audit-rebuttal framing; the rebuttal of "P is idle"
    is now implicit in the system-total comparison

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-13 10:39:15 +08:00

299 lines
13 KiB
Python
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

#!/usr/bin/env python3
"""System compute economy: KVC 1P3D v2 vs 4-way DP CA.
Generates docs/figures/gpu_utilization.png -- two-panel:
left: total system compute (stacked by work type)
right: per-GPU compute distribution (specialized vs fused)
The punchline is the TOTAL system compute reduction:
KVC v2 system: 3.47 M tokens of compute (1.07 P-prefill + 1.39 D-append + 1.01 decode)
DP 4-way: 5.17 M tokens of compute (4.17 full-prefill + 1.00 decode)
→ KVC does 33% LESS compute for the SAME workload (same 4449 requests).
This is the non-trivial finding: session affinity converts to reduced
system-wide work, not just locality. The per-GPU panel then explains
the architectural shape: KVC concentrates heavy prefill on a specialized
P worker, leaves D workers with light append + decode; DP forces every
worker to absorb the full prefill load mixed with decode.
The earlier version of this figure showed per-GPU request count + per-GPU
compute and was confusing to external reviewers ("P doing prefill is
trivial"). This version leads with the system-total comparison, which IS
the non-trivial result.
Aborted / errored requests are excluded.
"""
from __future__ import annotations
import json
from collections import defaultdict
from pathlib import Path
import matplotlib.pyplot as plt
import numpy as np
ROOT = Path(__file__).resolve().parents[2]
KVC = ROOT / "outputs/qwen3-30b-tp1-ts1-migration-v2/kvc_1p3d_migration_v2_run1_metrics.jsonl"
DP = ROOT / "outputs/qwen3-30b-tp1-ts1-validation/dp4_metrics.jsonl"
OUT = ROOT / "docs/figures/gpu_utilization.png"
def load(p: Path) -> list[dict]:
return [json.loads(line) for line in p.open()]
def is_failed(r: dict) -> bool:
if r.get("error"):
return True
fr = r.get("finish_reason")
if fr and ("abort" in str(fr).lower() or "badrequest" in str(fr).lower()):
return True
return False
def uncached(r: dict) -> int:
return max(0, r["input_length"] - r.get("cached_tokens", 0))
def out_tokens(r: dict) -> int:
return r.get("actual_output_tokens") or r.get("output_length") or 0
def main() -> None:
kvc = [r for r in load(KVC) if not is_failed(r)]
dp = [r for r in load(DP) if not is_failed(r)]
# ------------------------------------------------------------------
# KVC per-GPU + per-work-type attribution
# ------------------------------------------------------------------
kvc_prefill_tokens = defaultdict(int)
kvc_decode_tokens = defaultdict(int)
for r in kvc:
d = r["assigned_decode_node"]
p = r["assigned_prefill_node"]
mode = r.get("execution_mode", "")
if mode == "kvcache-direct-to-d-session":
# P bypassed; D does small append-prefill + decode
kvc_prefill_tokens[d] += uncached(r)
kvc_decode_tokens[d] += out_tokens(r)
else:
# P does heavy prefill; D handles decode
kvc_prefill_tokens[p] += uncached(r)
kvc_decode_tokens[d] += out_tokens(r)
# ------------------------------------------------------------------
# DP per-GPU attribution (fused P+D on every worker)
# ------------------------------------------------------------------
dp_prefill_tokens = defaultdict(int)
dp_decode_tokens = defaultdict(int)
for r in dp:
w = r["assigned_decode_node"]
dp_prefill_tokens[w] += uncached(r)
dp_decode_tokens[w] += out_tokens(r)
# ------------------------------------------------------------------
# Aggregate work by category for the left panel
# ------------------------------------------------------------------
kvc_p_prefill = kvc_prefill_tokens.get("prefill-0", 0)
kvc_d_prefill = sum(v for k, v in kvc_prefill_tokens.items() if k.startswith("decode-"))
kvc_d_decode = sum(kvc_decode_tokens.values())
kvc_total = kvc_p_prefill + kvc_d_prefill + kvc_d_decode
dp_prefill_total = sum(dp_prefill_tokens.values())
dp_decode_total = sum(dp_decode_tokens.values())
dp_total = dp_prefill_total + dp_decode_total
M = 1e6
saving_pct = (1 - kvc_total / dp_total) * 100
# ------------------------------------------------------------------
# Colors
# ------------------------------------------------------------------
KVC_P_COLOR = "#E89D44" # orange — P GPU
KVC_D_PREF_COLOR = "#7AB6D9" # light blue — D-side small append-prefill
KVC_D_DEC_COLOR = "#1F77B4" # dark blue — D-side decode
DP_PREF_COLOR = "#E07474" # light red — DP full prefill
DP_DEC_COLOR = "#D62728" # dark red — DP decode
fig, axes = plt.subplots(1, 2, figsize=(15, 7.0))
# ==================================================================
# Left panel: System-wide compute, stacked by work type
# ==================================================================
ax = axes[0]
x = np.array([0, 1])
bar_w = 0.55
# KVC stack: P-prefill (bottom orange) + D-prefill (light blue) + D-decode (dark blue)
ax.bar(0, kvc_p_prefill / M, bar_w, color=KVC_P_COLOR,
edgecolor="black", linewidth=0.6,
label="KVC: P-side heavy prefill (reseed / seed)")
ax.bar(0, kvc_d_prefill / M, bar_w, bottom=kvc_p_prefill / M,
color=KVC_D_PREF_COLOR, edgecolor="black", linewidth=0.6,
label="KVC: D-side append-prefill (direct-to-D, small)")
ax.bar(0, kvc_d_decode / M, bar_w,
bottom=(kvc_p_prefill + kvc_d_prefill) / M,
color=KVC_D_DEC_COLOR, edgecolor="black", linewidth=0.6,
label="Decode (both)")
# DP stack: full prefill (light red) + decode (dark red)
ax.bar(1, dp_prefill_total / M, bar_w,
color=DP_PREF_COLOR, edgecolor="black", linewidth=0.6,
label="DP: fused worker prefill (full uncached)")
ax.bar(1, dp_decode_total / M, bar_w, bottom=dp_prefill_total / M,
color=DP_DEC_COLOR, edgecolor="black", linewidth=0.6,
label="_nolegend_")
# Inline labels for stack segments
def stack_label(xpos, ypos, text, color="white", fontsize=10):
ax.text(xpos, ypos, text, ha="center", va="center",
fontsize=fontsize, color=color, fontweight="bold")
stack_label(0, kvc_p_prefill / M / 2,
f"P heavy prefill\n{kvc_p_prefill/M:.2f}M")
stack_label(0, (kvc_p_prefill + kvc_d_prefill / 2) / M,
f"D append-prefill\n{kvc_d_prefill/M:.2f}M",
color="black")
stack_label(0, (kvc_p_prefill + kvc_d_prefill + kvc_d_decode / 2) / M,
f"D decode\n{kvc_d_decode/M:.2f}M")
stack_label(1, dp_prefill_total / M / 2,
f"Full prefill\n(every worker)\n{dp_prefill_total/M:.2f}M",
color="black")
stack_label(1, (dp_prefill_total + dp_decode_total / 2) / M,
f"Decode\n{dp_decode_total/M:.2f}M")
# Totals on top
ax.text(0, kvc_total / M + 0.15, f"{kvc_total/M:.2f}M tokens",
ha="center", va="bottom", fontsize=12, fontweight="bold",
color="#1F77B4")
ax.text(1, dp_total / M + 0.15, f"{dp_total/M:.2f}M tokens",
ha="center", va="bottom", fontsize=12, fontweight="bold",
color="#D62728")
# Big savings annotation — placed centrally inside the panel,
# bracketed by a horizontal arrow connecting the bar tops.
headroom_top = max(kvc_total, dp_total) / M * 1.42
arrow_y = max(kvc_total, dp_total) / M * 1.08
text_y = max(kvc_total, dp_total) / M * 1.22
ax.annotate("", xy=(0.78, arrow_y), xytext=(0.22, arrow_y),
arrowprops=dict(arrowstyle="<->", color="#2C8C2C", lw=1.8))
ax.text(
0.5, text_y, f"{saving_pct:.0f}%\ntotal compute",
ha="center", va="center",
fontsize=13, fontweight="bold", color="#2C8C2C",
bbox=dict(facecolor="#E8F5E8", edgecolor="#2C8C2C", alpha=0.95, pad=5),
)
ax.set_xticks(x)
ax.set_xlim(-0.5, 1.5)
ax.set_xticklabels(["KVC 1P3D v2", "DP 4-way CA"], fontsize=12, fontweight="bold")
ax.set_ylabel("Total system compute (millions of token-equivalents)", fontsize=11)
ax.set_ylim(0, headroom_top)
ax.set_title("System-wide compute economy | same 4449-request workload",
fontsize=12, pad=10)
ax.grid(axis="y", linestyle=":", alpha=0.4)
ax.set_axisbelow(True)
ax.legend(loc="upper left", fontsize=8.5, framealpha=0.95)
# ==================================================================
# Right panel: per-GPU breakdown showing the architectural shape
# ==================================================================
ax = axes[1]
kvc_gpus = ["prefill-0", "decode-0", "decode-1", "decode-2"]
dp_gpus = ["direct-0", "direct-1", "direct-2", "direct-3"]
all_gpus = kvc_gpus + dp_gpus
labels = [
"KVC\nP-only", "KVC\nD-0", "KVC\nD-1", "KVC\nD-2",
"DP\nP+D-0", "DP\nP+D-1", "DP\nP+D-2", "DP\nP+D-3",
]
x = np.arange(len(all_gpus))
prefill_M = ([kvc_prefill_tokens.get(g, 0) / M for g in kvc_gpus]
+ [dp_prefill_tokens.get(g, 0) / M for g in dp_gpus])
decode_M = ([kvc_decode_tokens.get(g, 0) / M for g in kvc_gpus]
+ [dp_decode_tokens.get(g, 0) / M for g in dp_gpus])
# Color by group: orange for KVC P, blue for KVC D, red for DP
bar_colors_prefill = [KVC_P_COLOR, KVC_D_PREF_COLOR, KVC_D_PREF_COLOR, KVC_D_PREF_COLOR,
DP_PREF_COLOR, DP_PREF_COLOR, DP_PREF_COLOR, DP_PREF_COLOR]
bar_colors_decode = [KVC_D_DEC_COLOR, KVC_D_DEC_COLOR, KVC_D_DEC_COLOR, KVC_D_DEC_COLOR,
DP_DEC_COLOR, DP_DEC_COLOR, DP_DEC_COLOR, DP_DEC_COLOR]
ax.bar(x, prefill_M, color=bar_colors_prefill,
edgecolor="black", linewidth=0.5, label="Prefill compute")
ax.bar(x, decode_M, bottom=prefill_M, color=bar_colors_decode,
edgecolor="black", linewidth=0.5, hatch="///",
alpha=0.75, label="Decode compute")
total_M = [p + d for p, d in zip(prefill_M, decode_M)]
for xi, t in zip(x, total_M):
ax.text(xi, t + max(total_M) * 0.015, f"{t:.2f}M",
ha="center", va="bottom", fontsize=9.5)
ax.set_xticks(x)
ax.set_xticklabels(labels, fontsize=9.5)
ax.set_ylabel("Compute (millions of token-equivalents)", fontsize=11)
ax.set_ylim(0, max(total_M) * 1.30)
ax.set_title("Where the work lives | specialized P + light D vs uniform fused workers",
fontsize=12, pad=10)
ax.grid(axis="y", linestyle=":", alpha=0.4)
ax.set_axisbelow(True)
# Separator + headline takeaways under the GROUP labels (in axes
# fraction coords so they don't shift if ylim changes).
ax.axvline(3.5, color="gray", linestyle="--", linewidth=1.0, alpha=0.5)
ax.text(
0.22, 0.97,
f"KVC: P specialized for heavy prefill\nD workers ~{np.mean(total_M[1:4]):.2f}M each (light)",
transform=ax.transAxes, ha="center", va="top", fontsize=9.5,
bbox=dict(facecolor="#FFFAE6", edgecolor="#888", alpha=0.92, pad=4),
)
ax.text(
0.78, 0.97,
f"DP: every worker {np.mean(total_M[4:]):.2f}M (fused)\nfull prefill interleaved with decode",
transform=ax.transAxes, ha="center", va="top", fontsize=9.5,
bbox=dict(facecolor="#FFE8E8", edgecolor="#888", alpha=0.92, pad=4),
)
# No second legend on the right panel — the colours are already
# introduced in the left panel and the in-panel annotation boxes
# explain what each group means. Decode being hatched is signalled
# in the right-panel bar style itself.
fig.suptitle(
"KVC v2 reduces system-wide compute by 33% vs DP 4-way CA, same workload (4449 requests).\n"
"Mechanism: 91.6% of requests find their prefix cached on the affinity-pinned D worker\n"
"(append-prefill = 341 tokens on avg), so the total prefill work the system must do is much smaller.",
fontsize=12, y=1.05,
)
plt.tight_layout()
plt.savefig(OUT, dpi=150, bbox_inches="tight")
print(f"wrote {OUT}")
plt.close(fig)
# ------------------------------------------------------------------
# Print numbers for doc reference
# ------------------------------------------------------------------
print("\n=== System totals ===")
print(f"KVC v2 total: {kvc_total/M:.3f}M tokens")
print(f" P heavy prefill: {kvc_p_prefill/M:.3f}M")
print(f" D append-prefill: {kvc_d_prefill/M:.3f}M")
print(f" D decode: {kvc_d_decode/M:.3f}M")
print(f"DP 4w total: {dp_total/M:.3f}M tokens")
print(f" Full prefill: {dp_prefill_total/M:.3f}M")
print(f" Decode: {dp_decode_total/M:.3f}M")
print(f"\nKVC vs DP: -{saving_pct:.1f}% total compute saved")
print("\n=== Per-GPU breakdown ===")
for lbl, p, d in zip(labels, prefill_M, decode_M):
print(f" {lbl.replace(chr(10), ' '):<14} prefill={p:.3f}M decode={d:.3f}M total={p+d:.3f}M")
if __name__ == "__main__":
main()