D2: run_benchmark.sh and run_experiments.sh still pass --time-scale and --max-inflight-sessions to the replayer, but those flags were removed when the project moved to trace-driven dispatch. The scripts cannot run as-is. D3: ~25 ad-hoc analyze_* / compare_* / profile_* / final_* scripts and a handful of single-experiment run_*.sh point at /home/admin/cpfs paths, deleted output directories, or a sampled trace file that no longer exists. Keep them in scripts/legacy/ for historical reference; the scripts that remain in scripts/ (analyze_trace, analyze_breakdown, analyze_cache_hit, analyze_eviction, compare_results, compute_roofline, sample_trace, analyze_agentic_patterns, simulate_cache_policies, plus launch_*.sh, gpu_monitor.sh, bench.sh) cover the current workflow. Adds scripts/legacy/README.md to document the archival policy.
235 lines
9.6 KiB
Python
235 lines
9.6 KiB
Python
"""System-level profile: why PD-Sep loses to session-sticky PD-combined.
|
|
|
|
Compares per-request breakdown, GPU utilization patterns, KV cache behavior,
|
|
and routing efficiency across configurations to identify the exact mechanisms.
|
|
"""
|
|
import json, csv, statistics, os
|
|
from collections import defaultdict, Counter
|
|
|
|
BLOCK_SIZE = 512
|
|
|
|
def load_metrics(path):
|
|
rows = [json.loads(l) for l in open(path)]
|
|
ok = [r for r in rows if not r.get("error")]
|
|
return rows, ok
|
|
|
|
def load_gpu(path):
|
|
return list(csv.DictReader(open(path)))
|
|
|
|
def pct(v, q):
|
|
return v[min(int(q*len(v)), len(v)-1)] if v else 0
|
|
|
|
# Load all configs that have both metrics + GPU data
|
|
configs = {}
|
|
for d, label, tp, n_inst in [
|
|
("gpu_ab_combined", "TP=1 DP=8 old-CA", 1, 8),
|
|
("gpu_ab_hybrid", "TP=1 DP=8 hybrid", 1, 8),
|
|
("tp2dp4_hybrid", "TP=2 DP=4 hybrid", 2, 4),
|
|
("gpu_ab_pdsep", "PD-Sep 4P+4D", 1, 8),
|
|
("gpu_ab_6p2d", "PD-Sep 6P+2D", 1, 8),
|
|
("adaptive_v2_offload", "Adaptive offload", 1, 8),
|
|
]:
|
|
mp = "outputs/%s/metrics.jsonl" % d
|
|
if not os.path.exists(mp):
|
|
continue
|
|
rows, ok = load_metrics(mp)
|
|
gp = "outputs/%s/gpu_util.csv" % d
|
|
gpu = load_gpu(gp) if os.path.exists(gp) else []
|
|
|
|
ttfts = sorted([r["ttft_s"] for r in ok if r.get("ttft_s")])
|
|
tpots = sorted([r["tpot_s"] for r in ok if r.get("tpot_s") and r["tpot_s"] > 0])
|
|
lats = sorted([r["latency_s"] for r in ok])
|
|
outs = [r.get("actual_output_tokens", 0) or 0 for r in ok]
|
|
|
|
configs[d] = {
|
|
"label": label, "tp": tp, "n_inst": n_inst,
|
|
"ok": len(ok), "n": len(rows),
|
|
"ttfts": ttfts, "tpots": tpots, "lats": lats, "outs": outs,
|
|
"gpu": gpu, "rows": rows, "ok_rows": ok,
|
|
}
|
|
|
|
sep = "=" * 75
|
|
print(sep)
|
|
print(" WHY PD-SEP LOSES: SYSTEM-LEVEL PROFILE")
|
|
print(sep)
|
|
|
|
# ===================================================================
|
|
# EVIDENCE 1: Overhead decomposition (where does the extra time go?)
|
|
# ===================================================================
|
|
print("\n" + "-" * 75)
|
|
print(" EVIDENCE 1: TTFT Overhead Decomposition")
|
|
print("-" * 75)
|
|
|
|
for d in ["gpu_ab_hybrid", "gpu_ab_pdsep", "gpu_ab_6p2d", "tp2dp4_hybrid", "adaptive_v2_offload"]:
|
|
if d not in configs:
|
|
continue
|
|
c = configs[d]
|
|
# Bucket by input length
|
|
buckets = [(0, 5000, "<5k"), (5000, 20000, "5-20k"), (20000, 50000, "20-50k"), (50000, 999999, ">50k")]
|
|
print("\n %s:" % c["label"])
|
|
for lo, hi, blabel in buckets:
|
|
subset = [r for r in c["ok_rows"] if lo <= r["input_length"] < hi and r.get("ttft_s")]
|
|
if not subset:
|
|
continue
|
|
ttfts = sorted([r["ttft_s"] for r in subset])
|
|
n = len(subset)
|
|
print(" %6s: n=%3d TTFT p50=%.3fs p90=%.3fs" % (
|
|
blabel, n, pct(ttfts, .5), pct(ttfts, .9)))
|
|
|
|
# ===================================================================
|
|
# EVIDENCE 2: GPU Utilization efficiency
|
|
# ===================================================================
|
|
print("\n" + "-" * 75)
|
|
print(" EVIDENCE 2: GPU Utilization Efficiency")
|
|
print("-" * 75)
|
|
|
|
for d in ["gpu_ab_hybrid", "tp2dp4_hybrid", "gpu_ab_pdsep", "gpu_ab_6p2d"]:
|
|
if d not in configs or not configs[d]["gpu"]:
|
|
continue
|
|
c = configs[d]
|
|
vals = [float(r["util_pct"]) for r in c["gpu"]]
|
|
nz = sum(1 for v in vals if v > 0)
|
|
n_samples = len(vals) // 8 if len(vals) >= 8 else len(vals)
|
|
|
|
# Compute effective throughput: total output tokens / wall time
|
|
total_out = sum(c["outs"])
|
|
wall = max(c["lats"]) if c["lats"] else 1
|
|
tput = total_out / wall
|
|
|
|
print(" %s:" % c["label"])
|
|
print(" GPU util: mean=%.1f%% active=%d%% (%d samples)" % (
|
|
statistics.fmean(vals), nz * 100 // len(vals), n_samples))
|
|
print(" Output throughput: %.1f tokens/s" % tput)
|
|
print(" Efficiency: %.1f output_tokens per GPU%%" % (tput / max(statistics.fmean(vals), 0.1)))
|
|
|
|
# ===================================================================
|
|
# EVIDENCE 3: KV Cache memory pressure
|
|
# ===================================================================
|
|
print("\n" + "-" * 75)
|
|
print(" EVIDENCE 3: The KV Cache Memory Wall (PD-Sep specific)")
|
|
print("-" * 75)
|
|
print("""
|
|
PD-Sep concentrates ALL decode traffic onto fewer GPUs:
|
|
Combined DP=8: 8 instances, each ~1 concurrent decode request
|
|
PD-Sep 4P+4D: 4 decode instances, each ~2 concurrent decode requests
|
|
PD-Sep 6P+2D: 2 decode instances, each ~4 concurrent decode requests
|
|
|
|
KV cache per TP=1 instance: 281,888 tokens (~550 blocks)
|
|
Average request input: 33,611 tokens (~66 blocks)
|
|
|
|
Combined: 1 req * 66 blocks = 66/550 = 12% KV cache per instance
|
|
PD-Sep 4P+4D: 2 req * 66 blocks = 132/550 = 24% KV cache per decode inst
|
|
PD-Sep 6P+2D: 4 req * 66 blocks = 264/550 = 48% KV cache per decode inst
|
|
|
|
At peak (large requests, 100+ blocks each):
|
|
Combined: 100/550 = 18% per instance (comfortable)
|
|
PD-Sep 6P+2D: 400/550 = 73% per decode inst (near saturation)
|
|
Observed: 97.1% on decode instances (per-request breakdown showed
|
|
87.7% of TTFT was waiting for KV cache memory release)
|
|
""")
|
|
|
|
# ===================================================================
|
|
# EVIDENCE 4: KV Transfer overhead is not free
|
|
# ===================================================================
|
|
print("-" * 75)
|
|
print(" EVIDENCE 4: KV Transfer is Real Overhead")
|
|
print("-" * 75)
|
|
|
|
# Compare same-input requests between combined and PD-Sep
|
|
if "gpu_ab_hybrid" in configs and "gpu_ab_pdsep" in configs:
|
|
c_ok = configs["gpu_ab_hybrid"]["ok_rows"]
|
|
p_ok = configs["gpu_ab_pdsep"]["ok_rows"]
|
|
c_by_id = {r["request_id"]: r for r in c_ok}
|
|
p_by_id = {r["request_id"]: r for r in p_ok}
|
|
common = set(c_by_id.keys()) & set(p_by_id.keys())
|
|
|
|
if common:
|
|
overhead = []
|
|
for rid in common:
|
|
c = c_by_id[rid]
|
|
p = p_by_id[rid]
|
|
if c.get("ttft_s") and p.get("ttft_s") and c["ttft_s"] > 0:
|
|
overhead.append({
|
|
"input": c["input_length"],
|
|
"c_ttft": c["ttft_s"],
|
|
"p_ttft": p["ttft_s"],
|
|
"overhead": p["ttft_s"] - c["ttft_s"],
|
|
"ratio": p["ttft_s"] / c["ttft_s"],
|
|
})
|
|
overhead.sort(key=lambda x: x["input"])
|
|
|
|
print("\n Per-request TTFT: PD-Sep vs Combined (matched requests)")
|
|
print(" %8s %10s %10s %10s %7s" % ("input", "combined", "pdsep", "overhead", "ratio"))
|
|
for o in overhead[:10]:
|
|
print(" %8d %10.3f %10.3f %10.3f %6.1fx" % (
|
|
o["input"], o["c_ttft"], o["p_ttft"], o["overhead"], o["ratio"]))
|
|
|
|
overheads = [o["overhead"] for o in overhead]
|
|
ratios = [o["ratio"] for o in overhead]
|
|
print("\n Overhead stats:")
|
|
print(" Mean: %.3fs extra TTFT per request" % statistics.fmean(overheads))
|
|
print(" Mean ratio: %.1fx slower" % statistics.fmean(ratios))
|
|
|
|
# By input size
|
|
for lo, hi, blabel in [(0, 5000, "<5k"), (5000, 50000, "5-50k"), (50000, 999999, ">50k")]:
|
|
sub = [o for o in overhead if lo <= o["input"] < hi]
|
|
if sub:
|
|
print(" %6s: mean overhead=%.3fs, ratio=%.1fx" % (
|
|
blabel, statistics.fmean([o["overhead"] for o in sub]),
|
|
statistics.fmean([o["ratio"] for o in sub])))
|
|
|
|
# ===================================================================
|
|
# EVIDENCE 5: Session affinity loss in PD-Sep
|
|
# ===================================================================
|
|
print("\n" + "-" * 75)
|
|
print(" EVIDENCE 5: Session Affinity Disruption in PD-Sep")
|
|
print("-" * 75)
|
|
print("""
|
|
In PD-combined: session turn N and turn N+1 go to the SAME instance.
|
|
-> Turn N's KV stays in GPU cache
|
|
-> Turn N+1 gets prefix cache hit (80%+ APC for multi-turn)
|
|
-> Zero KV transfer needed
|
|
|
|
In PD-Sep: turn N's prefill goes to P instance, KV transfers to D instance.
|
|
Turn N+1's prefill goes to P instance again.
|
|
-> P instance does NOT have turn N's KV (it was transferred to D)
|
|
-> Turn N+1 must re-prefill from scratch on P
|
|
-> Then transfer KV to D again
|
|
-> Double penalty: re-prefill + KV transfer
|
|
|
|
This is the fundamental reason PD-Sep destroys multi-turn APC:
|
|
Combined APC for multi-turn: ~80%
|
|
PD-Sep: effectively ~0% for prefill (P never has prior turn's KV)
|
|
The only cache hit is on D, but D doesn't do prefill — it just decodes.
|
|
""")
|
|
|
|
# ===================================================================
|
|
# SUMMARY
|
|
# ===================================================================
|
|
print(sep)
|
|
print(" SUMMARY: 4 MECHANISMS WHY PD-SEP LOSES")
|
|
print(sep)
|
|
print("""
|
|
1. KV CACHE MEMORY WALL: Concentrating decode onto fewer GPUs fills
|
|
KV cache to 97%, causing 100+s waits for memory release.
|
|
Combined distributes across 8 instances, keeping usage <20%.
|
|
|
|
2. KV TRANSFER OVERHEAD: Every PD-Sep request pays RDMA transfer cost
|
|
(even small requests). Combined has zero transfer — KV stays on GPU.
|
|
|
|
3. SESSION AFFINITY BROKEN: Multi-turn sessions lose prefix cache on P
|
|
because prior turn's KV was transferred to D. Combined keeps KV
|
|
on the same instance, achieving 80% multi-turn APC vs ~0% on P.
|
|
|
|
4. GPU UNDERUTILIZATION: PD-Sep decode GPUs idle at 7-19% (memory-bound
|
|
decode doesn't need GPU compute). Combined uses all GPUs flexibly
|
|
at 28-30% average utilization.
|
|
|
|
ROOT CAUSE: PD-Sep was designed for chatbot workloads (short input,
|
|
no prefix sharing, compute-heavy prefill). Agentic workloads have:
|
|
- Long context (33k avg) -> large KV, memory pressure on D
|
|
- High prefix reuse (91% intra-session) -> session-sticky routing essential
|
|
- MoE model (3B active) -> low per-token compute, P-D interference small
|
|
These characteristics make PD-Sep's costs exceed its benefits.
|
|
""")
|