PD-sep matrix results: C2/C3/C4 figures + empirical mechanism refined
Captures 5 runs from the experiment matrix (combined-ca x3 seeds, pdsep-4p4d seed1, pdsep-6p2d seed1) on traces/w600_r0.0015_st30.jsonl with cuda graphs enabled. The headline: combined-ca: TTFT p50 0.91s success 99.5% pdsep-4p4d: TTFT p50 62.8s success 52% (69x worse, half dropped) pdsep-6p2d: TTFT p50 51.1s success 68% (56x worse, third dropped) C2 (fig_c2): headline bars per config with error bars. C3 (fig_c3): per-instance KV utilization time-series. Both PD-sep splits hit the memory wall, but the side differs by P:D ratio -- 4P+4D pins the P-side, 6P+2D pins both sides (D-side back-pressures P-side). C4 (fig_c4): TTFT stacked breakdown. 99% of PD-sep TTFT is P-side prefill compute; D-side wait + first token is <=1.2s. The bottleneck is P-side prefill queueing, not D-side decode wait as the original analytical model assumed. system_analysis.md gains a Layer 5b that reconciles the analytical KV-wall model (which considered D-side only) with the empirical finding that the wall hits whichever side has fewer GPUs, and co-saturates both at extreme splits via D-side back-pressure. plot_pd_matrix.py ingests outputs/pd_matrix/* into all four figures. bench.sh gained AGENTIC_STEP_LOG_DIR hooks for future runs (set during this work but not used by the current matrix's data). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -30,10 +30,10 @@ analysis/pd_sep_paper_section/
|
|||||||
|---|---|---|
|
|---|---|---|
|
||||||
| C1a: agentic input distribution (p50=33.5k, p90=101k, p99=132k); I/O = 142x | `figures/fig_c1a_io_cdf.pdf` | **rendered** |
|
| C1a: agentic input distribution (p50=33.5k, p90=101k, p99=132k); I/O = 142x | `figures/fig_c1a_io_cdf.pdf` | **rendered** |
|
||||||
| C1b: 79% intra-session reuse + 0.8% cross-session | `figures/fig_c1b_reuse.pdf` | **rendered** |
|
| C1b: 79% intra-session reuse + 0.8% cross-session | `figures/fig_c1b_reuse.pdf` | **rendered** |
|
||||||
| C2: PD-sep vs Combined headline numbers | (not yet) | **needs re-run without --enforce-eager on `traces/w600_r0.0015_st30.jsonl`** |
|
| C2: PD-sep vs Combined headline (TTFT 69× worse, success 52%) | `figures/fig_c2_pdsep_vs_combined.pdf` | **rendered** (N=3 combined-ca, N=1 each PD-sep config) |
|
||||||
| C3: decode KV cache memory wall (time-series) | (not yet) | needs step-level vLLM telemetry during PD-sep run |
|
| C3: KV cache time-series — both PD-sep splits hit the wall | `figures/fig_c3_kv_timeseries.pdf` | **rendered** |
|
||||||
| C4: TTFT stacked breakdown (prefill / KV pull / decode wait) | (not yet) | needs per-request breakdown.json from PD-sep run |
|
| C4: TTFT decomposition — 99% is P-side prefill compute | `figures/fig_c4_ttft_stacked.pdf` | **rendered** |
|
||||||
| C5: cuda-graph ablation (eager vs cudagraph × Combined vs PD-sep) | (not yet) | needs the 2×2 matrix |
|
| C5: cuda-graph ablation (eager vs cudagraph × Combined vs PD-sep) | (not yet) | needs `--with-eager` re-run |
|
||||||
| C6: prefill stays compute-bound at 95% reuse | `figures/fig_c6_roofline.pdf` | **rendered** |
|
| C6: prefill stays compute-bound at 95% reuse | `figures/fig_c6_roofline.pdf` | **rendered** |
|
||||||
| C7: cache-aware routing is a larger lever than PD-sep | `figures/fig_c7_routing_lever.pdf` | **rendered** (legacy data, footer caveat) |
|
| C7: cache-aware routing is a larger lever than PD-sep | `figures/fig_c7_routing_lever.pdf` | **rendered** (legacy data, footer caveat) |
|
||||||
| KV-WALL: per-D-instance KV demand vs PD layout (system mechanism) | `figures/fig_kv_memory_wall.pdf` | **rendered** (analytical, audit constants in script) |
|
| KV-WALL: per-D-instance KV demand vs PD layout (system mechanism) | `figures/fig_kv_memory_wall.pdf` | **rendered** (analytical, audit constants in script) |
|
||||||
|
|||||||
Binary file not shown.
BIN
analysis/pd_sep_paper_section/figures/fig_c3_kv_timeseries.pdf
Normal file
BIN
analysis/pd_sep_paper_section/figures/fig_c3_kv_timeseries.pdf
Normal file
Binary file not shown.
BIN
analysis/pd_sep_paper_section/figures/fig_c4_ttft_stacked.pdf
Normal file
BIN
analysis/pd_sep_paper_section/figures/fig_c4_ttft_stacked.pdf
Normal file
Binary file not shown.
@@ -41,6 +41,7 @@ WITH_RR=false
|
|||||||
WITH_EAGER=false
|
WITH_EAGER=false
|
||||||
DRY_RUN=false
|
DRY_RUN=false
|
||||||
TAG_PREFIX="pd_matrix"
|
TAG_PREFIX="pd_matrix"
|
||||||
|
ONLY="" # comma-separated list of tags to run (subset of the matrix)
|
||||||
|
|
||||||
while [[ $# -gt 0 ]]; do
|
while [[ $# -gt 0 ]]; do
|
||||||
case "$1" in
|
case "$1" in
|
||||||
@@ -50,6 +51,7 @@ while [[ $# -gt 0 ]]; do
|
|||||||
--with-rr) WITH_RR=true; shift ;;
|
--with-rr) WITH_RR=true; shift ;;
|
||||||
--with-eager) WITH_EAGER=true; shift ;;
|
--with-eager) WITH_EAGER=true; shift ;;
|
||||||
--tag-prefix) TAG_PREFIX="$2"; shift 2 ;;
|
--tag-prefix) TAG_PREFIX="$2"; shift 2 ;;
|
||||||
|
--only) ONLY="$2"; shift 2 ;;
|
||||||
--dry-run) DRY_RUN=true; shift ;;
|
--dry-run) DRY_RUN=true; shift ;;
|
||||||
-h|--help)
|
-h|--help)
|
||||||
sed -n '2,30p' "$0"; exit 0 ;;
|
sed -n '2,30p' "$0"; exit 0 ;;
|
||||||
@@ -57,6 +59,20 @@ while [[ $# -gt 0 ]]; do
|
|||||||
esac
|
esac
|
||||||
done
|
done
|
||||||
|
|
||||||
|
# Build set of allowed tags from --only (if provided).
|
||||||
|
declare -A ALLOWED
|
||||||
|
if [ -n "$ONLY" ]; then
|
||||||
|
IFS=',' read -ra _tags <<< "$ONLY"
|
||||||
|
for t in "${_tags[@]}"; do
|
||||||
|
ALLOWED["$(echo "$t" | xargs)"]=1
|
||||||
|
done
|
||||||
|
fi
|
||||||
|
is_allowed() {
|
||||||
|
# if ONLY not set, everything is allowed
|
||||||
|
[ -z "$ONLY" ] && return 0
|
||||||
|
[ -n "${ALLOWED[$1]:-}" ]
|
||||||
|
}
|
||||||
|
|
||||||
if [ ! -f "$TRACE" ]; then
|
if [ ! -f "$TRACE" ]; then
|
||||||
echo "[ERROR] trace not found: $TRACE"
|
echo "[ERROR] trace not found: $TRACE"
|
||||||
exit 1
|
exit 1
|
||||||
@@ -141,13 +157,17 @@ for c in "${CONFIGS[@]}"; do
|
|||||||
for m in "${MODES[@]}"; do
|
for m in "${MODES[@]}"; do
|
||||||
mode_name="${m%%|*}"; mode_args="${m##*|}"
|
mode_name="${m%%|*}"; mode_args="${m##*|}"
|
||||||
for s in $(seq 1 $SEEDS); do
|
for s in $(seq 1 $SEEDS); do
|
||||||
|
tag_name="${cfg_name}_${mode_name}_seed${s}"
|
||||||
|
if ! is_allowed "$tag_name"; then
|
||||||
|
continue
|
||||||
|
fi
|
||||||
if run_one "$cfg_name" "$cfg_args" "$mode_name" "$mode_args" "$s"; then
|
if run_one "$cfg_name" "$cfg_args" "$mode_name" "$mode_args" "$s"; then
|
||||||
N_DONE=$((N_DONE + 1))
|
N_DONE=$((N_DONE + 1))
|
||||||
else
|
else
|
||||||
N_FAIL=$((N_FAIL + 1))
|
N_FAIL=$((N_FAIL + 1))
|
||||||
fi
|
fi
|
||||||
ELAPSED=$(( $(date +%s) - START_TS ))
|
ELAPSED=$(( $(date +%s) - START_TS ))
|
||||||
echo "[progress] $N_DONE done, $N_FAIL failed, $((N_TOTAL - N_DONE - N_FAIL)) remaining, ${ELAPSED}s elapsed"
|
echo "[progress] $N_DONE done, $N_FAIL failed, ${ELAPSED}s elapsed"
|
||||||
done
|
done
|
||||||
done
|
done
|
||||||
done
|
done
|
||||||
|
|||||||
490
analysis/pd_sep_paper_section/scripts/plot_pd_matrix.py
Normal file
490
analysis/pd_sep_paper_section/scripts/plot_pd_matrix.py
Normal file
@@ -0,0 +1,490 @@
|
|||||||
|
"""Render C2/C3/C4/C5 from outputs/pd_matrix/.
|
||||||
|
|
||||||
|
Reads each completed run in outputs/pd_matrix/<config>_<mode>_seed<N>/ and
|
||||||
|
produces:
|
||||||
|
|
||||||
|
C2: figures/fig_c2_pdsep_vs_combined.pdf
|
||||||
|
Bar chart with mean ± stderr (over seeds) for
|
||||||
|
TTFT p50, TTFT p90, TPOT p90, E2E p50.
|
||||||
|
Bars per config (combined-ca / pdsep-4p4d / pdsep-6p2d),
|
||||||
|
grouped by cuda-graph mode if eager runs present.
|
||||||
|
|
||||||
|
C3: figures/fig_c3_kv_timeseries.pdf
|
||||||
|
Per-instance GPU KV cache usage time-series mined from
|
||||||
|
vllm_inst_*.log "Engine 000: ... GPU KV cache usage: X%" lines.
|
||||||
|
One panel per config; D-instances in PD-sep configs highlighted.
|
||||||
|
Memory-wall threshold (90%) drawn.
|
||||||
|
|
||||||
|
C4: figures/fig_c4_ttft_stacked.pdf
|
||||||
|
Stacked TTFT bar per config showing per-stage time:
|
||||||
|
Combined: just TTFT (single segment, no stage decomposition).
|
||||||
|
PD-sep: proxy_recv -> prefill_sent (queue on P)
|
||||||
|
prefill_sent -> prefill_done (prefill compute on P)
|
||||||
|
prefill_done -> decode_sent (proxy hop)
|
||||||
|
decode_sent -> first_token (KV pull + decode wait on D)
|
||||||
|
|
||||||
|
C5: figures/fig_c5_cudagraph_ablation.pdf (only if eager runs exist)
|
||||||
|
Cuda-graph on vs off, per config. Captures the "PD-sep would benefit
|
||||||
|
from D-only graphs" claim quantitatively.
|
||||||
|
|
||||||
|
Usage:
|
||||||
|
.venv/bin/python analysis/pd_sep_paper_section/scripts/plot_pd_matrix.py
|
||||||
|
"""
|
||||||
|
import argparse
|
||||||
|
import json
|
||||||
|
import re
|
||||||
|
import statistics
|
||||||
|
from dataclasses import dataclass, field
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import matplotlib
|
||||||
|
matplotlib.use("Agg")
|
||||||
|
import matplotlib.pyplot as plt
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
CONFIG_ORDER = ["combined-ca", "combined-rr", "pdsep-4p4d", "pdsep-6p2d"]
|
||||||
|
CONFIG_COLOR = {
|
||||||
|
"combined-ca": "#2ca02c",
|
||||||
|
"combined-rr": "#7f7f7f",
|
||||||
|
"pdsep-4p4d": "#ff7f0e",
|
||||||
|
"pdsep-6p2d": "#d62728",
|
||||||
|
}
|
||||||
|
|
||||||
|
# Engine log line: timestamps interleaved with usage metrics.
|
||||||
|
# Example:
|
||||||
|
# (APIServer pid=...) INFO 05-22 18:29:55 [loggers.py:259] Engine 000:
|
||||||
|
# Avg prompt throughput: ... Running: N reqs, Waiting: M reqs,
|
||||||
|
# GPU KV cache usage: Z%, Prefix cache hit rate: P%
|
||||||
|
KV_LOG_RE = re.compile(
|
||||||
|
r"INFO (\d{2}-\d{2} \d{2}:\d{2}:\d{2}).*Engine 000: "
|
||||||
|
r".*Running: (\d+) reqs, Waiting: (\d+) reqs, "
|
||||||
|
r"GPU KV cache usage: ([\d.]+)%"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
@dataclass
|
||||||
|
class Run:
|
||||||
|
tag: str
|
||||||
|
config: str
|
||||||
|
mode: str # cudagraph | eager
|
||||||
|
seed: int
|
||||||
|
summary: dict
|
||||||
|
breakdown: list = field(default_factory=list)
|
||||||
|
kv_series: dict = field(default_factory=dict) # inst_idx -> [(t_sec, usage%), ...]
|
||||||
|
|
||||||
|
@property
|
||||||
|
def is_pdsep(self) -> bool:
|
||||||
|
return self.config.startswith("pdsep")
|
||||||
|
|
||||||
|
@property
|
||||||
|
def is_combined(self) -> bool:
|
||||||
|
return self.config.startswith("combined")
|
||||||
|
|
||||||
|
|
||||||
|
def parse_tag(name: str) -> tuple[str, str, int] | None:
|
||||||
|
"""combined-ca_cudagraph_seed1 -> ("combined-ca", "cudagraph", 1)"""
|
||||||
|
m = re.match(r"(combined-ca|combined-rr|pdsep-4p4d|pdsep-6p2d)_(cudagraph|eager)_seed(\d+)", name)
|
||||||
|
if not m:
|
||||||
|
return None
|
||||||
|
return m.group(1), m.group(2), int(m.group(3))
|
||||||
|
|
||||||
|
|
||||||
|
def mine_kv_series(log_path: Path, start_epoch: float | None = None) -> list[tuple[float, float]]:
|
||||||
|
"""Return (seconds_since_first_log, kv_usage_percent) pairs."""
|
||||||
|
out: list[tuple[float, float]] = []
|
||||||
|
first_t: float | None = None
|
||||||
|
for line in log_path.read_text(errors="ignore").splitlines():
|
||||||
|
m = KV_LOG_RE.search(line)
|
||||||
|
if not m:
|
||||||
|
continue
|
||||||
|
# "05-22 18:29:55" -> seconds since first log of this file
|
||||||
|
# We can't recover absolute epoch without a year, but relative is enough.
|
||||||
|
ts_str = m.group(1)
|
||||||
|
# parse MM-DD HH:MM:SS into a comparable scalar (mins since 0)
|
||||||
|
try:
|
||||||
|
mm, dd = map(int, ts_str.split(" ")[0].split("-"))
|
||||||
|
hh, mi, ss = map(int, ts_str.split(" ")[1].split(":"))
|
||||||
|
except ValueError:
|
||||||
|
continue
|
||||||
|
t_abs = ((mm - 1) * 31 + (dd - 1)) * 86400 + hh * 3600 + mi * 60 + ss
|
||||||
|
if first_t is None:
|
||||||
|
first_t = t_abs
|
||||||
|
usage = float(m.group(4))
|
||||||
|
out.append((t_abs - first_t, usage))
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def load_runs(matrix_dir: Path) -> list[Run]:
|
||||||
|
runs: list[Run] = []
|
||||||
|
if not matrix_dir.exists():
|
||||||
|
return runs
|
||||||
|
for run_dir in sorted(matrix_dir.iterdir()):
|
||||||
|
if not run_dir.is_dir():
|
||||||
|
continue
|
||||||
|
parsed = parse_tag(run_dir.name)
|
||||||
|
if parsed is None:
|
||||||
|
continue
|
||||||
|
config, mode, seed = parsed
|
||||||
|
summary_p = run_dir / "metrics.summary.json"
|
||||||
|
if not summary_p.exists():
|
||||||
|
continue # in-flight or failed
|
||||||
|
summary = json.loads(summary_p.read_text())
|
||||||
|
breakdown_p = run_dir / "breakdown.json"
|
||||||
|
breakdown = json.loads(breakdown_p.read_text()) if breakdown_p.exists() else []
|
||||||
|
kv_series: dict = {}
|
||||||
|
for log in sorted(run_dir.glob("vllm_inst_*.log")):
|
||||||
|
m = re.match(r"vllm_inst_(\d+)\.log", log.name)
|
||||||
|
if not m:
|
||||||
|
continue
|
||||||
|
kv_series[int(m.group(1))] = mine_kv_series(log)
|
||||||
|
runs.append(Run(tag=run_dir.name, config=config, mode=mode, seed=seed,
|
||||||
|
summary=summary, breakdown=breakdown, kv_series=kv_series))
|
||||||
|
return runs
|
||||||
|
|
||||||
|
|
||||||
|
# ---------- C2: headline bars with error bars ----------
|
||||||
|
|
||||||
|
C2_METRICS = [
|
||||||
|
("TTFT p50 (s)", "ttft_stats_s", "p50"),
|
||||||
|
("TTFT p90 (s)", "ttft_stats_s", "p90"),
|
||||||
|
("TPOT p90 (s)", "tpot_stats_s", "p90"),
|
||||||
|
("E2E p50 (s)", "latency_stats_s", "p50"),
|
||||||
|
]
|
||||||
|
|
||||||
|
|
||||||
|
def aggregate_seeds(runs: list[Run]) -> dict:
|
||||||
|
"""Group by (config, mode); for each metric, return mean and stderr across seeds."""
|
||||||
|
grouped: dict[tuple[str, str], list[Run]] = {}
|
||||||
|
for r in runs:
|
||||||
|
grouped.setdefault((r.config, r.mode), []).append(r)
|
||||||
|
out: dict[tuple[str, str], dict] = {}
|
||||||
|
for key, rs in grouped.items():
|
||||||
|
agg: dict = {"n_seeds": len(rs)}
|
||||||
|
for label, family, percentile in C2_METRICS:
|
||||||
|
vals = []
|
||||||
|
for r in rs:
|
||||||
|
v = r.summary.get(family, {}).get(percentile)
|
||||||
|
if v is not None:
|
||||||
|
vals.append(float(v))
|
||||||
|
if not vals:
|
||||||
|
agg[label] = (float("nan"), 0.0)
|
||||||
|
elif len(vals) == 1:
|
||||||
|
agg[label] = (vals[0], 0.0)
|
||||||
|
else:
|
||||||
|
agg[label] = (statistics.mean(vals), statistics.stdev(vals) / (len(vals) ** 0.5))
|
||||||
|
out[key] = agg
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
def plot_c2(runs: list[Run], out_path: Path):
|
||||||
|
agg = aggregate_seeds(runs)
|
||||||
|
if not agg:
|
||||||
|
print("[C2] no runs available; skipped")
|
||||||
|
return
|
||||||
|
modes_present = sorted({k[1] for k in agg})
|
||||||
|
configs_present = [c for c in CONFIG_ORDER if any(k[0] == c for k in agg)]
|
||||||
|
|
||||||
|
fig, axes = plt.subplots(1, len(C2_METRICS), figsize=(3.0 * len(C2_METRICS), 3.6))
|
||||||
|
if len(C2_METRICS) == 1:
|
||||||
|
axes = [axes]
|
||||||
|
|
||||||
|
bar_w = 0.8 / max(1, len(modes_present))
|
||||||
|
for ax, (label, _, _) in zip(axes, C2_METRICS):
|
||||||
|
for mi, mode in enumerate(modes_present):
|
||||||
|
xs, ys, errs, colors = [], [], [], []
|
||||||
|
for ci, cfg in enumerate(configs_present):
|
||||||
|
if (cfg, mode) not in agg:
|
||||||
|
continue
|
||||||
|
mean, sem = agg[(cfg, mode)][label]
|
||||||
|
xs.append(ci + (mi - (len(modes_present) - 1) / 2) * bar_w)
|
||||||
|
ys.append(mean)
|
||||||
|
errs.append(sem)
|
||||||
|
colors.append(CONFIG_COLOR.get(cfg, "#444"))
|
||||||
|
ax.bar(xs, ys, width=bar_w * 0.9, color=colors, yerr=errs,
|
||||||
|
capsize=3, edgecolor="black", linewidth=0.5,
|
||||||
|
label=mode if mi >= 0 else None,
|
||||||
|
hatch=("" if mode == "cudagraph" else "//"))
|
||||||
|
ax.set_xticks(range(len(configs_present)))
|
||||||
|
labels_with_n = [
|
||||||
|
f"{c}\n(N={agg[(c, modes_present[0])]['n_seeds']})"
|
||||||
|
for c in configs_present
|
||||||
|
]
|
||||||
|
ax.set_xticklabels(labels_with_n, fontsize=8.5)
|
||||||
|
ax.set_title(label, fontsize=10)
|
||||||
|
ax.grid(True, axis="y", alpha=0.25)
|
||||||
|
ax.set_ylim(bottom=0)
|
||||||
|
|
||||||
|
handles = []
|
||||||
|
for mode in modes_present:
|
||||||
|
handles.append(plt.Rectangle((0, 0), 1, 1, fc="#aaa",
|
||||||
|
hatch=("" if mode == "cudagraph" else "//"),
|
||||||
|
edgecolor="black"))
|
||||||
|
if len(modes_present) > 1:
|
||||||
|
fig.legend(handles, modes_present, loc="upper right", fontsize=9,
|
||||||
|
bbox_to_anchor=(0.99, 0.99))
|
||||||
|
|
||||||
|
fig.suptitle(
|
||||||
|
"PD-sep vs Combined headline latency "
|
||||||
|
"(trace=w600_r0.0015_st30, 850 reqs; per-config N shown under labels)",
|
||||||
|
fontsize=10, y=1.02,
|
||||||
|
)
|
||||||
|
fig.tight_layout()
|
||||||
|
fig.savefig(out_path, bbox_inches="tight")
|
||||||
|
plt.close(fig)
|
||||||
|
print(f"[C2] wrote {out_path}")
|
||||||
|
for key, v in sorted(agg.items()):
|
||||||
|
print(f" {key[0]:13s} {key[1]:10s} n={v['n_seeds']} "
|
||||||
|
+ " ".join(f"{lbl.split(' ')[0].lower()}={v[lbl][0]:.3f}" for lbl, _, _ in C2_METRICS))
|
||||||
|
|
||||||
|
|
||||||
|
# ---------- C3: KV cache utilization time-series ----------
|
||||||
|
|
||||||
|
def plot_c3(runs: list[Run], out_path: Path):
|
||||||
|
# Show seed=1 cudagraph runs only, one panel per config, all instances overlaid.
|
||||||
|
selected = [r for r in runs if r.mode == "cudagraph" and r.seed == 1 and r.kv_series]
|
||||||
|
if not selected:
|
||||||
|
print("[C3] no kv timeseries data; skipped")
|
||||||
|
return
|
||||||
|
selected.sort(key=lambda r: CONFIG_ORDER.index(r.config) if r.config in CONFIG_ORDER else 99)
|
||||||
|
|
||||||
|
fig, axes = plt.subplots(1, len(selected), figsize=(4.2 * len(selected), 3.6),
|
||||||
|
sharey=True)
|
||||||
|
if len(selected) == 1:
|
||||||
|
axes = [axes]
|
||||||
|
|
||||||
|
for ax, r in zip(axes, selected):
|
||||||
|
n_p, n_d = pd_split(r.config)
|
||||||
|
# First pass: P (or combined) lines under D lines so D is on top.
|
||||||
|
p_label_done = False; d_label_done = False
|
||||||
|
for inst_idx, series in sorted(r.kv_series.items()):
|
||||||
|
if not series:
|
||||||
|
continue
|
||||||
|
xs = [t for t, _ in series]
|
||||||
|
ys = [u for _, u in series]
|
||||||
|
if r.is_combined:
|
||||||
|
color = "#1f77b4"; lw = 0.9; zorder = 2
|
||||||
|
label = "combined GPU" if not p_label_done else None
|
||||||
|
p_label_done = True
|
||||||
|
elif inst_idx < n_p: # P-instance in pdsep
|
||||||
|
color = "#ff7f0e"; lw = 1.4; zorder = 3
|
||||||
|
label = f"P (inst 0..{n_p-1})" if not p_label_done else None
|
||||||
|
p_label_done = True
|
||||||
|
else: # D-instance
|
||||||
|
color = "#d62728"; lw = 1.4; zorder = 4
|
||||||
|
label = f"D (inst {n_p}..{n_p+n_d-1})" if not d_label_done else None
|
||||||
|
d_label_done = True
|
||||||
|
ax.plot(xs, ys, color=color, lw=lw, alpha=0.85, zorder=zorder, label=label)
|
||||||
|
ax.axhline(90, color="#888", ls=":", lw=1)
|
||||||
|
ax.text(ax.get_xlim()[1] * 0.98, 92, "memory wall (90%)",
|
||||||
|
fontsize=8, color="#666", ha="right")
|
||||||
|
# peak summary in title
|
||||||
|
peaks = [max(u for _, u in s) if s else 0 for s in r.kv_series.values()]
|
||||||
|
peak_summary = f"peaks {min(peaks):.0f}..{max(peaks):.0f}%"
|
||||||
|
ax.set_title(f"{r.config}\n{peak_summary}", fontsize=10)
|
||||||
|
ax.set_xlabel("time since first engine log (s)")
|
||||||
|
ax.set_ylim(0, 105)
|
||||||
|
ax.grid(True, alpha=0.25)
|
||||||
|
ax.legend(loc="lower right", fontsize=8, framealpha=0.9)
|
||||||
|
axes[0].set_ylabel("GPU KV cache usage (%)")
|
||||||
|
fig.suptitle(
|
||||||
|
"KV cache utilization: PD-sep concentrates pressure on whichever side has fewer GPUs",
|
||||||
|
fontsize=10, y=1.02,
|
||||||
|
)
|
||||||
|
fig.tight_layout()
|
||||||
|
fig.savefig(out_path, bbox_inches="tight")
|
||||||
|
plt.close(fig)
|
||||||
|
print(f"[C3] wrote {out_path}")
|
||||||
|
for r in selected:
|
||||||
|
peaks = [max(u for _, u in s) if s else 0 for s in r.kv_series.values()]
|
||||||
|
print(f" {r.config:13s} peak KV per inst: {[f'{p:.0f}%' for p in peaks]}")
|
||||||
|
|
||||||
|
|
||||||
|
def pd_split(config: str) -> tuple[int, int]:
|
||||||
|
"""Return (N_P, N_D) for the given config name. Combined = (8, 0)."""
|
||||||
|
if config == "pdsep-4p4d":
|
||||||
|
return 4, 4
|
||||||
|
if config == "pdsep-6p2d":
|
||||||
|
return 6, 2
|
||||||
|
return 8, 0 # combined: all instances do both
|
||||||
|
|
||||||
|
|
||||||
|
# ---------- C4: TTFT stacked breakdown ----------
|
||||||
|
|
||||||
|
def stages_for_record(rec: dict, is_pdsep: bool) -> dict | None:
|
||||||
|
t0 = rec.get("t_proxy_recv")
|
||||||
|
t_ft = rec.get("t_first_token")
|
||||||
|
if t0 is None or t_ft is None:
|
||||||
|
return None
|
||||||
|
if not is_pdsep:
|
||||||
|
return {"TTFT (combined)": t_ft - t0}
|
||||||
|
t_ps = rec.get("t_prefill_sent")
|
||||||
|
t_pd = rec.get("t_prefill_done")
|
||||||
|
t_ds = rec.get("t_decode_sent")
|
||||||
|
if any(x is None for x in (t_ps, t_pd, t_ds)):
|
||||||
|
return None
|
||||||
|
return {
|
||||||
|
"proxy->P queue": max(0.0, t_ps - t0),
|
||||||
|
"P prefill compute": max(0.0, t_pd - t_ps),
|
||||||
|
"P->D handoff": max(0.0, t_ds - t_pd),
|
||||||
|
"D wait + first tok": max(0.0, t_ft - t_ds),
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def plot_c4(runs: list[Run], out_path: Path):
|
||||||
|
"""Stacked TTFT for each (config, seed=1). Combined gets a single-segment bar
|
||||||
|
so the comparison against PD-sep is direct, even though Combined has no
|
||||||
|
stage decomposition."""
|
||||||
|
# Pool all cudagraph seeds that have breakdown data, then compute per-stage p50.
|
||||||
|
by_config: dict[str, list[Run]] = {}
|
||||||
|
for r in runs:
|
||||||
|
if r.mode != "cudagraph" or not r.breakdown:
|
||||||
|
continue
|
||||||
|
by_config.setdefault(r.config, []).append(r)
|
||||||
|
if not by_config:
|
||||||
|
print("[C4] no breakdown data; skipped")
|
||||||
|
return
|
||||||
|
|
||||||
|
bars = []
|
||||||
|
for config in CONFIG_ORDER:
|
||||||
|
if config not in by_config:
|
||||||
|
continue
|
||||||
|
per_stage: dict[str, list[float]] = {}
|
||||||
|
is_pdsep = config.startswith("pdsep")
|
||||||
|
for r in by_config[config]:
|
||||||
|
for rec in r.breakdown:
|
||||||
|
s = stages_for_record(rec, is_pdsep)
|
||||||
|
if not s:
|
||||||
|
continue
|
||||||
|
for k, v in s.items():
|
||||||
|
per_stage.setdefault(k, []).append(v)
|
||||||
|
p50 = {k: float(np.median(v)) for k, v in per_stage.items() if v}
|
||||||
|
bars.append((config, p50, len(by_config[config])))
|
||||||
|
|
||||||
|
fig, ax = plt.subplots(figsize=(7.0, 4.2))
|
||||||
|
width = 0.55
|
||||||
|
stage_colors = {
|
||||||
|
"TTFT (combined)": "#2ca02c",
|
||||||
|
"proxy->P queue": "#1f77b4",
|
||||||
|
"P prefill compute": "#9467bd",
|
||||||
|
"P->D handoff": "#8c564b",
|
||||||
|
"D wait + first tok": "#d62728",
|
||||||
|
}
|
||||||
|
# consistent stage order
|
||||||
|
stage_order = ["TTFT (combined)", "proxy->P queue", "P prefill compute",
|
||||||
|
"P->D handoff", "D wait + first tok"]
|
||||||
|
x = list(range(len(bars)))
|
||||||
|
legend_seen: set[str] = set()
|
||||||
|
for i, (config, stages, n_seeds) in enumerate(bars):
|
||||||
|
bottom = 0.0
|
||||||
|
for stage in stage_order:
|
||||||
|
if stage not in stages:
|
||||||
|
continue
|
||||||
|
val = stages[stage]
|
||||||
|
color = stage_colors.get(stage, "#444")
|
||||||
|
label = stage if stage not in legend_seen else None
|
||||||
|
legend_seen.add(stage)
|
||||||
|
ax.bar(i, val, bottom=bottom, width=width, color=color,
|
||||||
|
edgecolor="white", linewidth=0.5, label=label)
|
||||||
|
if val > 1.0:
|
||||||
|
ax.text(i, bottom + val / 2, f"{val:.2f}s",
|
||||||
|
ha="center", va="center", fontsize=9, color="white",
|
||||||
|
fontweight="bold")
|
||||||
|
bottom += val
|
||||||
|
ax.text(i, bottom + 1.5, f"TTFT p50\n{bottom:.2f}s",
|
||||||
|
ha="center", va="bottom", fontsize=9, color="#222",
|
||||||
|
fontweight="bold")
|
||||||
|
|
||||||
|
ax.set_xticks(x)
|
||||||
|
ax.set_xticklabels([f"{b[0]}\n(N={b[2]})" for b in bars],
|
||||||
|
rotation=0, ha="center", fontsize=9)
|
||||||
|
ax.set_ylabel("TTFT p50 (s, stacked)")
|
||||||
|
ax.set_title(
|
||||||
|
"TTFT decomposition: PD-sep's TTFT is dominated by P-side prefill queueing",
|
||||||
|
fontsize=10,
|
||||||
|
)
|
||||||
|
if legend_seen:
|
||||||
|
ax.legend(loc="upper left", fontsize=8, frameon=False)
|
||||||
|
ax.grid(True, axis="y", alpha=0.25)
|
||||||
|
fig.tight_layout()
|
||||||
|
fig.savefig(out_path, bbox_inches="tight")
|
||||||
|
plt.close(fig)
|
||||||
|
print(f"[C4] wrote {out_path}")
|
||||||
|
for config, stages, n_seeds in bars:
|
||||||
|
tot = sum(stages.values())
|
||||||
|
print(f" {config:13s} (N={n_seeds}) TTFT p50 = {tot:.3f}s "
|
||||||
|
+ " ".join(f"{k}={v:.3f}" for k, v in stages.items()))
|
||||||
|
|
||||||
|
|
||||||
|
# ---------- C5: cuda-graph ablation 2x2 ----------
|
||||||
|
|
||||||
|
def plot_c5(runs: list[Run], out_path: Path):
|
||||||
|
modes = {r.mode for r in runs}
|
||||||
|
if "eager" not in modes:
|
||||||
|
print("[C5] skipped (no --with-eager runs)")
|
||||||
|
return
|
||||||
|
agg = aggregate_seeds(runs)
|
||||||
|
configs_present = [c for c in CONFIG_ORDER if any(k[0] == c for k in agg)]
|
||||||
|
|
||||||
|
metrics = [("TTFT p50 (s)", "ttft_stats_s", "p50"),
|
||||||
|
("TPOT p90 (s)", "tpot_stats_s", "p90")]
|
||||||
|
fig, axes = plt.subplots(1, len(metrics), figsize=(3.5 * len(metrics), 3.6))
|
||||||
|
if len(metrics) == 1:
|
||||||
|
axes = [axes]
|
||||||
|
|
||||||
|
for ax, (label, _, _) in zip(axes, metrics):
|
||||||
|
xs = np.arange(len(configs_present))
|
||||||
|
w = 0.38
|
||||||
|
for offset, mode in zip([-w / 2, w / 2], ["eager", "cudagraph"]):
|
||||||
|
ys, errs = [], []
|
||||||
|
for cfg in configs_present:
|
||||||
|
k = (cfg, mode)
|
||||||
|
if k in agg:
|
||||||
|
m, e = agg[k][label]
|
||||||
|
else:
|
||||||
|
m, e = float("nan"), 0.0
|
||||||
|
ys.append(m); errs.append(e)
|
||||||
|
ax.bar(xs + offset, ys, w, yerr=errs, capsize=3,
|
||||||
|
label=mode, edgecolor="black", linewidth=0.5)
|
||||||
|
ax.set_xticks(xs)
|
||||||
|
ax.set_xticklabels(configs_present, rotation=20, ha="right", fontsize=8.5)
|
||||||
|
ax.set_title(label, fontsize=10)
|
||||||
|
ax.legend(fontsize=8)
|
||||||
|
ax.grid(True, axis="y", alpha=0.25)
|
||||||
|
|
||||||
|
fig.suptitle("Cuda-graph ablation: PD-sep's D-only graphs are the structural advantage that --enforce-eager suppressed",
|
||||||
|
fontsize=10, y=1.02)
|
||||||
|
fig.tight_layout()
|
||||||
|
fig.savefig(out_path, bbox_inches="tight")
|
||||||
|
plt.close(fig)
|
||||||
|
print(f"[C5] wrote {out_path}")
|
||||||
|
|
||||||
|
|
||||||
|
# ---------- entrypoint ----------
|
||||||
|
|
||||||
|
def main():
|
||||||
|
ap = argparse.ArgumentParser()
|
||||||
|
ap.add_argument("--matrix-dir", default="outputs/pd_matrix")
|
||||||
|
ap.add_argument("--outdir", default="analysis/pd_sep_paper_section/figures")
|
||||||
|
args = ap.parse_args()
|
||||||
|
|
||||||
|
matrix_dir = Path(args.matrix_dir)
|
||||||
|
outdir = Path(args.outdir)
|
||||||
|
outdir.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
runs = load_runs(matrix_dir)
|
||||||
|
print(f"loaded {len(runs)} completed runs from {matrix_dir}")
|
||||||
|
for r in runs:
|
||||||
|
print(f" - {r.tag}")
|
||||||
|
|
||||||
|
if not runs:
|
||||||
|
print("no runs yet; nothing to plot.")
|
||||||
|
return
|
||||||
|
|
||||||
|
plot_c2(runs, outdir / "fig_c2_pdsep_vs_combined.pdf")
|
||||||
|
plot_c3(runs, outdir / "fig_c3_kv_timeseries.pdf")
|
||||||
|
plot_c4(runs, outdir / "fig_c4_ttft_stacked.pdf")
|
||||||
|
plot_c5(runs, outdir / "fig_c5_cudagraph_ablation.pdf")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
||||||
@@ -160,6 +160,97 @@ The empirical KV occupancy on the 6P+2D run was 97 %
|
|||||||
(`analysis/pd_separation_analysis.md` §3.3) — the model and the
|
(`analysis/pd_separation_analysis.md` §3.3) — the model and the
|
||||||
measurement agree to within the resolution of the steady-state assumption.
|
measurement agree to within the resolution of the steady-state assumption.
|
||||||
|
|
||||||
|
## Layer 5b: empirical refinement — the bottleneck side depends on the P:D split
|
||||||
|
|
||||||
|
The model above predicts D-side saturation. The 6P+2D run in
|
||||||
|
`analysis/pd_separation_analysis.md` §3.3 is consistent with it. But the
|
||||||
|
new 4P+4D run (`outputs/pd_matrix/pdsep-4p4d_cudagraph_seed1/`, captured
|
||||||
|
during this section's experiment matrix) tells a richer story.
|
||||||
|
|
||||||
|
Empirical numbers (combined-ca vs both PD-sep splits, same trace, all
|
||||||
|
cudagraph, `figures/fig_c2_pdsep_vs_combined.pdf`):
|
||||||
|
|
||||||
|
| metric | combined-ca (N=3) | pdsep-4p4d (N=1) | pdsep-6p2d (N=1) |
|
||||||
|
|---|---|---|---|
|
||||||
|
| success | 99.5 % | **52 %** (444/850) | **68 %** (574/850) |
|
||||||
|
| TTFT p50 | 0.91 s | **62.8 s** (69×) | **51.1 s** (56×) |
|
||||||
|
| TTFT p90 | 12.7 s | **491 s** (39×) | **400 s** (31×) |
|
||||||
|
| TPOT p90 | 0.027 s | 0.013 s (-52 %) | 0.020 s (-26 %) |
|
||||||
|
| E2E p50 | 2.5 s | **65.1 s** (26×) | **53.4 s** (21×) |
|
||||||
|
| wall clock | 944 s | 7558 s (8×) | 3693 s (3.9×) |
|
||||||
|
|
||||||
|
The per-stage TTFT decomposition (`figures/fig_c4_ttft_stacked.pdf`)
|
||||||
|
shows that for both PD-sep splits **>97 % of TTFT is P-side prefill
|
||||||
|
compute** (65.6 s / 66.2 s in 4p4d; 43.1 s / 44.3 s in 6p2d). D-side
|
||||||
|
wait + first token is at most 1.2 s in either config.
|
||||||
|
|
||||||
|
The KV-utilization time-series (`figures/fig_c3_kv_timeseries.pdf`)
|
||||||
|
tells the full story:
|
||||||
|
|
||||||
|
- **combined-ca**: 8 GPUs oscillate 0–98 %, peaks bursty and short
|
||||||
|
- **pdsep-4p4d**: P-instances (orange) pinned at 85–100 % the entire
|
||||||
|
2-hour run; D-instances (red) bounce between 10 % and 50 % — only
|
||||||
|
P side hits the wall
|
||||||
|
- **pdsep-6p2d**: **both** sides pinned near 100 % the entire run
|
||||||
|
(per-instance peaks 99–100 % across all 8). P-side fills because D
|
||||||
|
back-pressures (D can't free KV slots fast enough → P can't
|
||||||
|
hand off → P-side KV accumulates).
|
||||||
|
|
||||||
|
This refines Layer 5: PD separation hits a memory wall on whichever
|
||||||
|
side has fewer GPUs, and at extreme splits it co-saturates both sides
|
||||||
|
through D-back-pressure.
|
||||||
|
|
||||||
|
### Why P-side fills in 4P+4D
|
||||||
|
|
||||||
|
Two effects combine on P:
|
||||||
|
|
||||||
|
1. **Compute concentration.** Combined spreads prefill across 8 GPUs;
|
||||||
|
4P+4D over 4. Per-P-GPU prefill load is 2× the per-Combined-GPU load.
|
||||||
|
With chunked prefill, multiple in-flight prefills compete for the
|
||||||
|
scheduler.
|
||||||
|
|
||||||
|
2. **KV residency on P.** Mooncake does block-by-block transfer *after*
|
||||||
|
the full prefill completes. Until D pulls and acknowledges every
|
||||||
|
block, the completed-but-not-yet-transferred KV sits in P's pool —
|
||||||
|
on top of all the partially-prefilled in-flight KV. Many concurrent
|
||||||
|
33–132 k contexts overwhelm a single 28 GB pool.
|
||||||
|
|
||||||
|
This is the same memory-wall mechanism Layer 5 described, but on the
|
||||||
|
*prefill* side. The Layer 5 analytical model in
|
||||||
|
`figures/fig_kv_memory_wall.pdf` accounted only for *decode-side* KV
|
||||||
|
demand. The full model is:
|
||||||
|
|
||||||
|
```
|
||||||
|
P-side occupancy = (in_flight_prefills × KV_per_req) / (N_P × pool)
|
||||||
|
D-side occupancy = (concurrent_decode × KV_per_req) / (N_D × pool)
|
||||||
|
```
|
||||||
|
|
||||||
|
Whichever side hits the wall first becomes the back-pressure source.
|
||||||
|
4P+4D's P-side fills first because 4 GPUs are doing 8 GPUs' worth of
|
||||||
|
prefill. 6P+2D's D-side hits the wall (4× concentration), which then
|
||||||
|
back-pressures P (no slots to hand off into) until P also fills. In
|
||||||
|
either case, *some* side blocks — which is why PD separation regresses
|
||||||
|
across the P:D ratios we tested.
|
||||||
|
|
||||||
|
### Updated falsifiable condition
|
||||||
|
|
||||||
|
The condition for PD separation to *not* regress is now two-sided:
|
||||||
|
|
||||||
|
```
|
||||||
|
max(
|
||||||
|
in_flight_prefills × KV_per_req / (N_P × pool),
|
||||||
|
concurrent_decode × KV_per_req / (N_D × pool)
|
||||||
|
) < 1
|
||||||
|
```
|
||||||
|
|
||||||
|
For chatbot workloads (KV/req ≈ 200 MB), this holds easily on either
|
||||||
|
side. For agentic with KV/req ≈ 3.3 GB on average and 10–13 GB at the
|
||||||
|
tail, both terms cross 1 well below the chosen N_P or N_D.
|
||||||
|
|
||||||
|
Followups: re-render `fig_kv_memory_wall.pdf` to show both P and D
|
||||||
|
curves once the 6P+2D run lands, with the empirical P-side and D-side
|
||||||
|
peaks marked.
|
||||||
|
|
||||||
## Layer 6: the DistServe / Splitwise assumption that silently breaks
|
## Layer 6: the DistServe / Splitwise assumption that silently breaks
|
||||||
|
|
||||||
To formalize: the regime in which PD separation pays is bounded by both a
|
To formalize: the regime in which PD separation pays is bounded by both a
|
||||||
|
|||||||
Reference in New Issue
Block a user