MB5 analysis: per-role KV split proves static-partition mismatch
aggregate_mb5.py: - Split the cluster KV timeline by role (P-pool vs D-pool) using a PID->role map parsed from vllm_logs filenames. The cluster average hid the result — 6P+2D/4P+4D look ~45% utilized but the decode pool is actually pegged at ~100% while prefill idles at ~30%. - Two-stage reduce/plot: --reduce-to (numpy-only, runs on the serving host over multi-GB snapshot dirs) dumps a compact JSON; --from-reduced (matplotlib) renders locally. matplotlib import is now lazy. - New plot_role_split figure + p/d peak/steady columns in the CSV. PD_DISAGG_RESULTS.md: consolidated writeup with figures inline. Verdict: no static P:D ratio beats 8C colocation. The binding constraint moves with the ratio (D-pool saturates at 6P+2D/4P+4D, P-pool jams at 2P+6D -> 91% request loss); 8C's shared pool stays elastic at 34% steady, 100% completion. PD wins TPOT (10-35x cleaner, the MB1 phase-isolation benefit is real) but loses TTFT and sheds load. Round-robin P routing also zeroes prefix-cache reuse; a session-affinity re-run of 6P+2D is in flight to test the fix. Figures (rep1): mb5_kv_timeline, mb5_role_split, mb5_peak_utilization, mb5_latency_compare + mb5_summary.csv. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
1
analysis/mb5/reduced_20260527_164040_rep1.json
Normal file
1
analysis/mb5/reduced_20260527_164040_rep1.json
Normal file
File diff suppressed because one or more lines are too long
BIN
figs/mb5/mb5_kv_timeline.png
Normal file
BIN
figs/mb5/mb5_kv_timeline.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 176 KiB |
BIN
figs/mb5/mb5_latency_compare.png
Normal file
BIN
figs/mb5/mb5_latency_compare.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 29 KiB |
BIN
figs/mb5/mb5_peak_utilization.png
Normal file
BIN
figs/mb5/mb5_peak_utilization.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 34 KiB |
BIN
figs/mb5/mb5_role_split.png
Normal file
BIN
figs/mb5/mb5_role_split.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 375 KiB |
5
figs/mb5/mb5_summary.csv
Normal file
5
figs/mb5/mb5_summary.csv
Normal file
@@ -0,0 +1,5 @@
|
||||
config,rep,n_requests,n_success,wall_clock_s,peak_pool_frac,steady_pool_frac,p_pool_peak_frac,p_pool_steady_frac,d_pool_peak_frac,d_pool_steady_frac,peak_waiting,latency_p50_s,latency_p90_s,latency_p99_s,ttft_p50_s,ttft_p90_s,ttft_p99_s,prefix_cache_hit_ratio
|
||||
8C,1,1214,1214,2994.218414353032,0.7174957362137578,0.3439702956225128,,,,,29,10.82550932947197,83.34998885790122,194.10265863158946,6.967104309005663,53.12018221841427,114.12611859919207,0.1937163528742694
|
||||
6P+2D,1,1214,1214,3419.065942236979,0.7726478112563957,0.42145750426378625,0.743272692817889,0.3082291074474133,0.9959636156907333,0.7434906196702672,128,44.48975181748392,91.82252187062406,147.70196208347772,40.95952733900049,86.68752026481089,142.84028979733685,0.0
|
||||
4P+4D,1,1214,1214,4170.666486939997,0.6997939169982945,0.45876918703808983,0.6438459351904491,0.28540363843092664,0.9753411028993746,0.5977686185332576,152,59.52004547297838,157.08703426021387,224.03997302683115,56.419772224500775,153.07864206891392,219.73412787001706,0.0
|
||||
2P+6D,1,1214,109,5761.816568834998,0.9698692438885731,0.9435119386014781,0.9969869243888573,0.9198408186469585,0.9620238772029562,0.9494504453287853,872,26.293884326005355,499.3484142678091,577.7122636228032,23.580788671970367,498.0334587502061,576.5306194114453,0.0
|
||||
|
227
microbench/fresh_setup/PD_DISAGG_RESULTS.md
Normal file
227
microbench/fresh_setup/PD_DISAGG_RESULTS.md
Normal file
@@ -0,0 +1,227 @@
|
||||
# PD-disaggregation under an agentic workload — does it work?
|
||||
|
||||
**Consolidated results doc.** Self-contained writeup of every PD-disagg
|
||||
argument and experiment, with figures inline. For the live experiment TODO
|
||||
list see [PD_DISAGG_INVESTIGATION.md](PD_DISAGG_INVESTIGATION.md).
|
||||
|
||||
Date: 2026-05-28 · Hardware: dash1, 8×GPU · Model: Qwen3-Coder-30B-A3B-Instruct
|
||||
· vLLM 0.18.1 (V1, chunked-prefill on) · Mooncake 0.3.11 · Trace:
|
||||
`w600_r0.0015_st30.jsonl` (1214 requests, agentic multi-turn).
|
||||
|
||||
---
|
||||
|
||||
## TL;DR (verdict)
|
||||
|
||||
**No static prefill/decode split beats 8-way colocation (8C) on this agentic
|
||||
workload.** Every disaggregated ratio we tried is dominated by 8C on the
|
||||
metric the user actually feels (TTFT, end-to-end latency, request
|
||||
completion), and the failure *moves* with the ratio:
|
||||
|
||||
- **D-heavy bottleneck** (6P+2D, 4P+4D): the decode pool saturates (peak
|
||||
**99.6% / 97.5%**) while the prefill pool sits at **~30%** — half the
|
||||
cluster's KV is stranded on the wrong side.
|
||||
- **P-heavy bottleneck** (2P+6D): the 2 prefill instances can't keep up,
|
||||
the prefill pool jams at **99.7%**, **872 requests** pile up in the queue
|
||||
and **91% of requests never complete**.
|
||||
- **8C** keeps a single elastic pool that absorbs whichever phase is hot at
|
||||
the moment → steady utilization **34%**, **100% completion**, fastest
|
||||
wall-clock, best p50/p90 latency.
|
||||
|
||||
PD-disagg *does* deliver the phase-isolation win we predicted in MB1 — its
|
||||
**TPOT is 10–35× cleaner** — but that win is swamped by TTFT inflation,
|
||||
request loss, and a total collapse of prefix-cache reuse under the stock
|
||||
round-robin router.
|
||||
|
||||
This is the empirical backing for the paper's claim: **agentic workloads
|
||||
have time-varying P:D demand that no static partition can track; colocation
|
||||
wins because its pool is elastic.** (H1 *and* H2 from the investigation doc,
|
||||
unified by one mechanism.)
|
||||
|
||||
---
|
||||
|
||||
## 1. Why this experiment exists
|
||||
|
||||
Earlier cost accounting (MB1 phase-interference, MB2 KV-transfer cost) showed
|
||||
that on the **phase-isolation axis alone**, PD-disagg actually *wins*: it
|
||||
removes prefill→decode interference, and the transfer cost is small relative
|
||||
to the interference it avoids. So "PD-disagg is bad for agentic" could not be
|
||||
argued from phase isolation — we needed a system-level experiment that
|
||||
measures the whole picture (queueing, pool capacity, cache reuse), not just
|
||||
the isolated phase cost.
|
||||
|
||||
See [analysis/mb1](../../analysis/mb1) and [analysis/mb2](../../analysis/mb2)
|
||||
for that accounting. This doc is the system-level answer.
|
||||
|
||||
---
|
||||
|
||||
## 2. Setup
|
||||
|
||||
| | |
|
||||
|---|---|
|
||||
| Configs | `8C` (8× kv_both colo), `6P+2D`, `4P+4D`, `2P+6D` (prefill+decode split) |
|
||||
| PD routing | stock **round-robin** on both P and D (vLLM official `mooncake_connector_proxy`) |
|
||||
| Trace | `w600_r0.0015_st30.jsonl`, 1214 requests, agentic multi-turn |
|
||||
| Reps | 1 (rep1) for this analysis; the 3-rep sweep confirmed run-to-run consistency before we converged on rep1 for iteration speed |
|
||||
| KV instrumentation | V1 scheduler patched to dump per-request KV block allocation every 100 ms per EngineCore (see `instrument_kv_snapshot.py`) |
|
||||
|
||||
8C is the fair baseline: 8 colocated instances, replayer round-robins across
|
||||
them directly (no proxy). PD configs route through the proxy.
|
||||
|
||||
---
|
||||
|
||||
## 3. Headline result — no PD ratio beats 8C
|
||||
|
||||
All numbers are rep1.
|
||||
|
||||
| Metric | **8C** | 6P+2D | 4P+4D | 2P+6D |
|
||||
|---|---|---|---|---|
|
||||
| **completion** | **100%** | 100% | 100% | **9%** 💀 |
|
||||
| wall-clock (drain trace) | **2994 s** | 3419 s | 4171 s | 5762 s |
|
||||
| prefix-cache hit | **19.4%** | 0% | 0% | 0% |
|
||||
| TTFT mean | **18.0 s** | 44.8 s | 70.0 s | 106.8 s |
|
||||
| TTFT p50 | **7.0 s** | 41.0 s | 56.4 s | 23.6 s |
|
||||
| TTFT p90 | **53.1 s** | 86.7 s | 153.1 s | 498 s |
|
||||
| E2E p50 | **10.8 s** | 44.5 s | 59.5 s | 26.3 s |
|
||||
| E2E p90 | **83.3 s** | 91.8 s | 157.1 s | 499 s |
|
||||
|
||||

|
||||
|
||||
> ⚠️ **Read the percentiles with the completion rate.** Latency percentiles
|
||||
> are computed over *successful* requests only. 2P+6D's "p99 = 577 s" covers
|
||||
> just the 9% that finished — the other 91% never returned, so its real
|
||||
> experience is far worse than any latency bar suggests.
|
||||
|
||||
8C wins p50 by **4×** and p90 decisively. The only metric where a PD config
|
||||
edges 8C is E2E **p99** (6P+2D 148 s vs 8C 194 s) — and that is the flip side
|
||||
of the next result.
|
||||
|
||||
---
|
||||
|
||||
## 4. The duality — PD wins TPOT, loses TTFT
|
||||
|
||||
PD-disagg delivers exactly the phase-isolation benefit MB1 predicted: with no
|
||||
prefill stealing decode steps, **inter-token latency is dramatically cleaner.**
|
||||
|
||||
| TPOT | **8C** | 6P+2D | 4P+4D | 2P+6D |
|
||||
|---|---|---|---|---|
|
||||
| mean | 87 ms | 11 ms | 9 ms | 6 ms |
|
||||
| p90 | 230 ms | 18 ms | 14 ms | 8 ms |
|
||||
| p99 | **1129 ms** | **26 ms** | **20 ms** | **12 ms** |
|
||||
|
||||
PD's TPOT p99 is **10–35× lower** — once a request reaches a dedicated decode
|
||||
instance it streams without interruption. 8C's 1.1 s TPOT p99 *is* the
|
||||
chunked-prefill interference tax (decode steps occasionally stalled behind an
|
||||
8k-token prefill chunk), consistent with MB1.
|
||||
|
||||
**But the win is local.** TTFT inflates 2.5–6× because every request now pays
|
||||
P→D handoff + admission into a smaller, saturated decode pool. For this
|
||||
workload's modest output lengths, TTFT dominates total time, so the TPOT win
|
||||
never pays for itself. This is the cost/benefit imbalance made concrete:
|
||||
phase isolation is real, but it is the wrong thing to optimize when the pool
|
||||
is the binding constraint.
|
||||
|
||||
---
|
||||
|
||||
## 5. Root cause — per-role KV pool occupancy (the kill shot)
|
||||
|
||||
The cluster-average KV utilization is *misleading* and nearly hid the result:
|
||||
|
||||

|
||||
|
||||
6P+2D and 4P+4D look only ~42–46% utilized on cluster average — yet they have
|
||||
128–152 requests queued. The average hides that **one pool is pegged while
|
||||
the other idles.** Splitting the KV pool by role exposes it:
|
||||
|
||||

|
||||
|
||||
| Config | P-pool steady | D-pool steady | D-pool **peak** | binding side |
|
||||
|---|---|---|---|---|
|
||||
| 8C | — single shared pool — | 34% | 72% | none (elastic) |
|
||||
| 6P+2D | 31% | **74%** | **99.6%** | **decode** |
|
||||
| 4P+4D | 29% | **60%** | **97.5%** | **decode** |
|
||||
| 2P+6D | **92%** | 95% | 96% | **prefill** (P jams first) |
|
||||
|
||||

|
||||
|
||||
**The mechanism, unified:**
|
||||
|
||||
- A static P:D split fixes the KV capacity on each side at deploy time.
|
||||
- The agentic workload's instantaneous P:D demand *drifts* (bursts of new
|
||||
sessions = prefill-heavy; long tool-call-driven turns = decode-heavy).
|
||||
- Whichever side is undersized *for the current phase* saturates and
|
||||
back-pressures the whole pipeline, while the other side's KV sits stranded.
|
||||
- 6P+2D / 4P+4D → decode side too small → D-pool hits ~100%, prefilled
|
||||
requests queue for a decode slot → TTFT explodes (this is **H1**).
|
||||
- 2P+6D → prefill side too small → P-pool hits ~100%, requests can't even
|
||||
start → 872 queued, 91% dropped.
|
||||
- **8C colocation has no partition**: prefill and decode share one pool, so
|
||||
the pool elastically reallocates to whichever phase is hot. Steady
|
||||
utilization stays at 34% with 100% completion.
|
||||
|
||||
This is **H1 (D-pool capacity ceiling)** and **H2 (static-partition
|
||||
mismatch)** turning out to be the *same* phenomenon seen from two ratios.
|
||||
|
||||
---
|
||||
|
||||
## 6. The routing handicap — and whether smarter routing rescues PD
|
||||
|
||||
Every PD config above shows **prefix-cache hit = 0%**, versus 8C's 19%. That
|
||||
is not fundamental to disaggregation — it is the stock proxy round-robining
|
||||
the **prefill** side: consecutive turns of one agentic session land on
|
||||
*different* producers, so each turn re-prefills the whole conversation from
|
||||
scratch. That both inflates TTFT and piles extra load on the prefill pool
|
||||
(directly worsening the 2P+6D collapse).
|
||||
|
||||
The correct PD scheduling policy (as the design argues): **P should be chosen
|
||||
by session affinity** (reuse the producer's prefix cache) while **D is chosen
|
||||
by load balance** (decode KV is freshly transferred per turn, so D gains
|
||||
nothing from affinity). We added this as an env-gated mode in the proxy
|
||||
(`MB5_P_ROUTING=session`, consistent hash on `X-Session-Id`; D stays
|
||||
round-robin) and re-ran the best-performing disaggregated config, **6P+2D**.
|
||||
|
||||
> **Status: session-affinity 6P+2D run in progress.** Results below will be
|
||||
> filled in when it completes; the question it answers is *how much of the
|
||||
> gap to 8C does restoring prefix-cache reuse close.*
|
||||
|
||||
<!-- SESSION_AFFINITY_RESULTS -->
|
||||
*(pending)*
|
||||
|
||||
---
|
||||
|
||||
## 7. Caveats / honesty
|
||||
|
||||
- **Single rep** for this analysis. The earlier 3-rep sweep showed 8C and
|
||||
4P+4D are tight run-to-run, but 6P+2D completion varied (rep1 100% vs rep2
|
||||
56% vs rep3 80%) — i.e. the D-pool sits right at the cliff edge, so 6P+2D's
|
||||
"100% rep1" is optimistic. The qualitative ranking is robust; exact numbers
|
||||
on the marginal configs are not.
|
||||
- **Latency percentiles count successes only** (see §3 warning). For failing
|
||||
configs the latency bars *understate* the damage.
|
||||
- **Round-robin baseline.** §6 addresses the routing fairness concern head-on
|
||||
with a session-affinity re-run.
|
||||
- Trace is a single agentic workload; conclusions are about *this* class of
|
||||
workload (sub-second tool-call cadence, multi-turn sessions), not all LLM
|
||||
serving.
|
||||
|
||||
---
|
||||
|
||||
## 8. Reproduce
|
||||
|
||||
```bash
|
||||
# from repo root, after microbench/fresh_setup/deploy.sh dash1
|
||||
# 1. round-robin baseline sweep (1 rep)
|
||||
ssh dash1 'CONFIGS="8C 6P+2D 4P+4D 2P+6D" REPS=1 RUN_TAG=<tag> \
|
||||
bash /home/admin/cpfs/wjh/agentic-kv-fresh/scripts/mb5_run.sh'
|
||||
|
||||
# 2. reduce on dash1 (numpy-only; handles the multi-GB snapshot dirs)
|
||||
ssh dash1 '.venv/bin/python scripts/aggregate_mb5.py --sweep-root mb5_runs \
|
||||
--tag <tag> --configs "8C 6P+2D 4P+4D 2P+6D" --reps 1 \
|
||||
--reduce-to mb5_runs/reduced_<tag>.json'
|
||||
|
||||
# 3. pull the compact JSON, render figures locally
|
||||
scp dash1:.../mb5_runs/reduced_<tag>.json analysis/mb5/
|
||||
.venv/bin/python microbench/fresh_setup/aggregate_mb5.py \
|
||||
--from-reduced analysis/mb5/reduced_<tag>.json --out-dir figs/mb5
|
||||
|
||||
# session-affinity arm: prefix the run with MB5_P_ROUTING=session
|
||||
```
|
||||
@@ -31,11 +31,11 @@ import json
|
||||
from collections import defaultdict
|
||||
from pathlib import Path
|
||||
|
||||
import matplotlib
|
||||
matplotlib.use("Agg")
|
||||
import matplotlib.pyplot as plt
|
||||
import numpy as np
|
||||
|
||||
# matplotlib is imported lazily inside the plot functions so the --reduce
|
||||
# path (numpy-only) can run on a serving host without matplotlib installed.
|
||||
|
||||
|
||||
def load_snapshots_for_run(snap_dir: Path) -> list[dict]:
|
||||
"""Merge all per-PID snapshot files in snap_dir, tag with pid, sort by t_unix."""
|
||||
@@ -57,17 +57,58 @@ def load_snapshots_for_run(snap_dir: Path) -> list[dict]:
|
||||
return out
|
||||
|
||||
|
||||
def cluster_timeline(snaps: list[dict], bin_size_s: float = 1.0) -> tuple[np.ndarray, ...]:
|
||||
def load_pid_roles(logs_dir: Path) -> dict[int, str]:
|
||||
"""Map EngineCore PID -> 'P' | 'D' | 'C' by parsing vllm_logs filenames.
|
||||
|
||||
File names look like vllm_idx{i}_gpu{g}_kv_{producer|consumer|both}.log and
|
||||
each contains '(EngineCore pid=NNNN)'. Returns {} if no logs found.
|
||||
"""
|
||||
role_map = {"producer": "P", "consumer": "D", "both": "C"}
|
||||
out: dict[int, str] = {}
|
||||
if not logs_dir.is_dir():
|
||||
return out
|
||||
for f in logs_dir.glob("vllm_idx*_kv_*.log"):
|
||||
role = None
|
||||
for key, short in role_map.items():
|
||||
if f.name.endswith(f"kv_{key}.log"):
|
||||
role = short
|
||||
break
|
||||
if role is None:
|
||||
continue
|
||||
with f.open(errors="ignore") as fh:
|
||||
for line in fh:
|
||||
if "EngineCore pid=" in line:
|
||||
try:
|
||||
pid = int(line.split("EngineCore pid=")[1].split(")")[0].split()[0])
|
||||
out[pid] = role
|
||||
break
|
||||
except (ValueError, IndexError):
|
||||
continue
|
||||
return out
|
||||
|
||||
|
||||
def cluster_timeline(snaps: list[dict], bin_size_s: float = 1.0,
|
||||
keep_pids: set | None = None,
|
||||
t0: float | None = None,
|
||||
n_bins: int | None = None) -> tuple[np.ndarray, ...]:
|
||||
"""Bin per-PID snapshots into a cluster-wide timeline.
|
||||
|
||||
For each time bin, sum used_blocks across PIDs that emitted a snapshot
|
||||
in that bin. PIDs without a sample in a bin carry their previous value
|
||||
forward (so a quiet PID doesn't artificially drop the total).
|
||||
|
||||
If keep_pids is given, only those PIDs are counted (used for per-role
|
||||
P-pool / D-pool splits); the pool ceiling is summed over the same subset.
|
||||
Pass a shared t0/n_bins so role-splits land on the same time grid.
|
||||
"""
|
||||
if keep_pids is not None:
|
||||
snaps = [s for s in snaps if s["pid"] in keep_pids]
|
||||
if not snaps:
|
||||
empty = np.array([], dtype=float)
|
||||
return empty, empty, empty, empty, empty
|
||||
if t0 is None:
|
||||
t0 = snaps[0]["t_unix"]
|
||||
if n_bins is None:
|
||||
t_end = snaps[-1]["t_unix"]
|
||||
n_bins = max(1, int(np.ceil((t_end - t0) / bin_size_s)) + 1)
|
||||
times = np.arange(n_bins) * bin_size_s
|
||||
@@ -118,34 +159,60 @@ def load_summary(rundir: Path) -> dict | None:
|
||||
return json.loads(p.read_text())
|
||||
|
||||
|
||||
def _steady_median(arr: np.ndarray) -> float:
|
||||
n = len(arr)
|
||||
if n == 0:
|
||||
return 0.0
|
||||
if n >= 10:
|
||||
return float(np.median(arr[int(n * 0.1):int(n * 0.9)]))
|
||||
return float(np.median(arr))
|
||||
|
||||
|
||||
def per_run_metrics(snaps_dir: Path, rundir: Path) -> dict:
|
||||
snaps = load_snapshots_for_run(snaps_dir)
|
||||
times, total_used, pool_frac, total_waiting, total_running = cluster_timeline(snaps)
|
||||
summary = load_summary(rundir) or {}
|
||||
|
||||
# Trim the warmup/cooldown 10% to compute "steady-state" stats
|
||||
n = len(times)
|
||||
if n >= 10:
|
||||
lo, hi = int(n * 0.1), int(n * 0.9)
|
||||
frac_steady = pool_frac[lo:hi]
|
||||
wait_steady = total_waiting[lo:hi]
|
||||
# Establish a shared time grid (global t0 / n_bins) so the overall and
|
||||
# per-role timelines all line up on the same x axis.
|
||||
if snaps:
|
||||
t0 = snaps[0]["t_unix"]
|
||||
t_end = snaps[-1]["t_unix"]
|
||||
n_bins = max(1, int(np.ceil(t_end - t0)) + 1)
|
||||
else:
|
||||
frac_steady = pool_frac
|
||||
wait_steady = total_waiting
|
||||
t0, n_bins = None, None
|
||||
|
||||
return {
|
||||
"snaps": snaps,
|
||||
"times": times,
|
||||
"total_used": total_used,
|
||||
"pool_frac": pool_frac,
|
||||
"total_waiting": total_waiting,
|
||||
"total_running": total_running,
|
||||
times, total_used, pool_frac, total_waiting, total_running = cluster_timeline(
|
||||
snaps, t0=t0, n_bins=n_bins
|
||||
)
|
||||
n = len(times)
|
||||
|
||||
out = {
|
||||
"times": times.tolist(),
|
||||
"total_used": total_used.tolist(),
|
||||
"pool_frac": pool_frac.tolist(),
|
||||
"total_waiting": total_waiting.tolist(),
|
||||
"total_running": total_running.tolist(),
|
||||
"peak_pool_frac": float(pool_frac.max()) if n else 0.0,
|
||||
"steady_pool_frac": float(np.median(frac_steady)) if n else 0.0,
|
||||
"steady_pool_frac": _steady_median(pool_frac),
|
||||
"peak_waiting": int(total_waiting.max()) if n else 0,
|
||||
"summary": summary,
|
||||
}
|
||||
|
||||
# Per-role (P-pool vs D-pool) split for PD configs.
|
||||
roles = load_pid_roles(snaps_dir.parent / "vllm_logs")
|
||||
p_pids = {pid for pid, r in roles.items() if r == "P"}
|
||||
d_pids = {pid for pid, r in roles.items() if r == "D"}
|
||||
if p_pids and d_pids:
|
||||
for tag, subset in (("p", p_pids), ("d", d_pids)):
|
||||
_, _, frac, _, run = cluster_timeline(
|
||||
snaps, keep_pids=subset, t0=t0, n_bins=n_bins
|
||||
)
|
||||
out[f"{tag}_pool_frac"] = frac.tolist()
|
||||
out[f"{tag}_running"] = run.tolist()
|
||||
out[f"{tag}_peak_frac"] = float(frac.max()) if len(frac) else 0.0
|
||||
out[f"{tag}_steady_frac"] = _steady_median(frac)
|
||||
return out
|
||||
|
||||
|
||||
def collect_sweep(sweep_root: Path, tag: str, configs: list[str], reps: int) -> dict:
|
||||
"""Returns {config: [run_record_per_rep]}."""
|
||||
@@ -169,6 +236,10 @@ def collect_sweep(sweep_root: Path, tag: str, configs: list[str], reps: int) ->
|
||||
|
||||
|
||||
def plot_kv_timeline(sweep: dict, out: Path) -> None:
|
||||
import matplotlib
|
||||
matplotlib.use("Agg")
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
n_configs = len(sweep)
|
||||
if n_configs == 0:
|
||||
return
|
||||
@@ -177,8 +248,8 @@ def plot_kv_timeline(sweep: dict, out: Path) -> None:
|
||||
axes = [axes]
|
||||
for ax, (config, reps) in zip(axes, sweep.items()):
|
||||
for rep_data in reps:
|
||||
t = rep_data["times"]
|
||||
ax.plot(t, rep_data["pool_frac"] * 100, alpha=0.4, lw=1.0,
|
||||
t = np.asarray(rep_data["times"])
|
||||
ax.plot(t, np.asarray(rep_data["pool_frac"]) * 100, alpha=0.4, lw=1.0,
|
||||
label=f"rep{rep_data['rep']}")
|
||||
# bold median across reps (need to align times — use longest series)
|
||||
if reps:
|
||||
@@ -203,6 +274,10 @@ def plot_kv_timeline(sweep: dict, out: Path) -> None:
|
||||
|
||||
|
||||
def plot_peak_utilization(sweep: dict, out: Path) -> None:
|
||||
import matplotlib
|
||||
matplotlib.use("Agg")
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
configs = list(sweep.keys())
|
||||
peaks = [[r["peak_pool_frac"] * 100 for r in sweep[c]] for c in configs]
|
||||
steady = [[r["steady_pool_frac"] * 100 for r in sweep[c]] for c in configs]
|
||||
@@ -234,6 +309,10 @@ def plot_peak_utilization(sweep: dict, out: Path) -> None:
|
||||
|
||||
|
||||
def plot_latency_compare(sweep: dict, out: Path) -> None:
|
||||
import matplotlib
|
||||
matplotlib.use("Agg")
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
configs = list(sweep.keys())
|
||||
metrics = ["p50", "p90", "p99"]
|
||||
data = {m: [] for m in metrics}
|
||||
@@ -264,6 +343,50 @@ def plot_latency_compare(sweep: dict, out: Path) -> None:
|
||||
print(f"wrote {out}")
|
||||
|
||||
|
||||
def plot_role_split(sweep: dict, out: Path) -> None:
|
||||
"""For PD configs, show P-pool vs D-pool KV % over time (rep1) — exposes
|
||||
the imbalance that the cluster average hides. 8C (no role split) shows
|
||||
the overall cluster line for reference."""
|
||||
import matplotlib
|
||||
matplotlib.use("Agg")
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
n_configs = len(sweep)
|
||||
if n_configs == 0:
|
||||
return
|
||||
fig, axes = plt.subplots(n_configs, 1, figsize=(14, 2.6 * n_configs), sharex=True)
|
||||
if n_configs == 1:
|
||||
axes = [axes]
|
||||
for ax, (config, reps) in zip(axes, sweep.items()):
|
||||
if not reps:
|
||||
continue
|
||||
r = reps[0] # rep1
|
||||
t = np.asarray(r["times"])
|
||||
if "p_pool_frac" in r and "d_pool_frac" in r:
|
||||
ax.plot(t, np.asarray(r["p_pool_frac"]) * 100, color="#4c72b0",
|
||||
lw=1.5, label="P-pool (prefill)")
|
||||
ax.plot(t, np.asarray(r["d_pool_frac"]) * 100, color="#c44e52",
|
||||
lw=1.5, label="D-pool (decode)")
|
||||
ax.plot(t, np.asarray(r["pool_frac"]) * 100, color="#999",
|
||||
lw=1.0, ls=":", label="cluster avg")
|
||||
else:
|
||||
ax.plot(t, np.asarray(r["pool_frac"]) * 100, color="#222",
|
||||
lw=1.5, label="cluster (kv_both)")
|
||||
ax.axhline(90, color="#444", ls="--", alpha=0.5, lw=1)
|
||||
ax.set_ylim(0, 105)
|
||||
ax.set_ylabel(f"{config}\nKV pool (%)")
|
||||
ax.grid(True, alpha=0.3)
|
||||
ax.legend(loc="upper right", fontsize=8, ncol=3)
|
||||
axes[-1].set_xlabel("wall-clock since first snapshot (s)")
|
||||
fig.suptitle("MB5: per-role KV pool utilization (P-pool vs D-pool), rep1",
|
||||
fontsize=12)
|
||||
fig.tight_layout()
|
||||
out.parent.mkdir(parents=True, exist_ok=True)
|
||||
fig.savefig(out, dpi=120)
|
||||
plt.close(fig)
|
||||
print(f"wrote {out}")
|
||||
|
||||
|
||||
def write_summary_csv(sweep: dict, out: Path) -> None:
|
||||
rows = []
|
||||
for config, reps in sweep.items():
|
||||
@@ -279,6 +402,10 @@ def write_summary_csv(sweep: dict, out: Path) -> None:
|
||||
"wall_clock_s": s.get("wall_clock_s"),
|
||||
"peak_pool_frac": r["peak_pool_frac"],
|
||||
"steady_pool_frac": r["steady_pool_frac"],
|
||||
"p_pool_peak_frac": r.get("p_peak_frac"),
|
||||
"p_pool_steady_frac": r.get("p_steady_frac"),
|
||||
"d_pool_peak_frac": r.get("d_peak_frac"),
|
||||
"d_pool_steady_frac": r.get("d_steady_frac"),
|
||||
"peak_waiting": r["peak_waiting"],
|
||||
"latency_p50_s": lat.get("p50"),
|
||||
"latency_p90_s": lat.get("p90"),
|
||||
@@ -299,24 +426,55 @@ def write_summary_csv(sweep: dict, out: Path) -> None:
|
||||
print(f"wrote {out} ({len(rows)} rows)")
|
||||
|
||||
|
||||
def render_all(sweep: dict, out_dir: Path) -> None:
|
||||
plot_kv_timeline(sweep, out_dir / "mb5_kv_timeline.png")
|
||||
plot_role_split(sweep, out_dir / "mb5_role_split.png")
|
||||
plot_peak_utilization(sweep, out_dir / "mb5_peak_utilization.png")
|
||||
plot_latency_compare(sweep, out_dir / "mb5_latency_compare.png")
|
||||
write_summary_csv(sweep, out_dir / "mb5_summary.csv")
|
||||
|
||||
|
||||
def main() -> None:
|
||||
p = argparse.ArgumentParser()
|
||||
p.add_argument("--sweep-root", type=Path, required=True,
|
||||
p = argparse.ArgumentParser(
|
||||
description="MB5 aggregate. Two-stage: --reduce (numpy-only, runs on "
|
||||
"a serving host) dumps a compact JSON; --from-reduced "
|
||||
"(needs matplotlib) renders figures locally. Or run "
|
||||
"directly (raw snapshots -> figures) when both the data "
|
||||
"and matplotlib are local."
|
||||
)
|
||||
p.add_argument("--sweep-root", type=Path,
|
||||
help="dir containing ${tag}_${config}_rep${N}/ subdirs")
|
||||
p.add_argument("--tag", required=True)
|
||||
p.add_argument("--tag")
|
||||
p.add_argument("--configs", default="8C 6P+2D 4P+4D 2P+6D",
|
||||
help="space-separated config names")
|
||||
p.add_argument("--reps", type=int, default=3)
|
||||
p.add_argument("--out-dir", type=Path, default=Path("figs/mb5"))
|
||||
p.add_argument("--reduce-to", type=Path,
|
||||
help="numpy-only: write reduced sweep JSON here and exit "
|
||||
"(no plotting, no matplotlib needed)")
|
||||
p.add_argument("--from-reduced", type=Path,
|
||||
help="load a reduced sweep JSON (from --reduce-to) and "
|
||||
"render figures into --out-dir")
|
||||
args = p.parse_args()
|
||||
|
||||
if args.from_reduced:
|
||||
sweep = json.loads(args.from_reduced.read_text())
|
||||
render_all(sweep, args.out_dir)
|
||||
return
|
||||
|
||||
if not (args.sweep_root and args.tag):
|
||||
p.error("--sweep-root and --tag are required unless --from-reduced is given")
|
||||
|
||||
configs = args.configs.split()
|
||||
sweep = collect_sweep(args.sweep_root, args.tag, configs, args.reps)
|
||||
|
||||
plot_kv_timeline(sweep, args.out_dir / "mb5_kv_timeline.png")
|
||||
plot_peak_utilization(sweep, args.out_dir / "mb5_peak_utilization.png")
|
||||
plot_latency_compare(sweep, args.out_dir / "mb5_latency_compare.png")
|
||||
write_summary_csv(sweep, args.out_dir / "mb5_summary.csv")
|
||||
if args.reduce_to:
|
||||
args.reduce_to.parent.mkdir(parents=True, exist_ok=True)
|
||||
args.reduce_to.write_text(json.dumps(sweep))
|
||||
print(f"wrote reduced sweep -> {args.reduce_to}")
|
||||
return
|
||||
|
||||
render_all(sweep, args.out_dir)
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
|
||||
Reference in New Issue
Block a user