Files
agentic-kvc/scripts/legacy/profile_why_pdsep_loses.py
Gahow Wang 547611e022 scripts: archive obsolete one-off shell/python scripts to legacy/ (D2, D3)
D2: run_benchmark.sh and run_experiments.sh still pass --time-scale and
--max-inflight-sessions to the replayer, but those flags were removed when
the project moved to trace-driven dispatch. The scripts cannot run as-is.

D3: ~25 ad-hoc analyze_* / compare_* / profile_* / final_* scripts and a
handful of single-experiment run_*.sh point at /home/admin/cpfs paths,
deleted output directories, or a sampled trace file that no longer exists.
Keep them in scripts/legacy/ for historical reference; the scripts that
remain in scripts/ (analyze_trace, analyze_breakdown, analyze_cache_hit,
analyze_eviction, compare_results, compute_roofline, sample_trace,
analyze_agentic_patterns, simulate_cache_policies, plus launch_*.sh,
gpu_monitor.sh, bench.sh) cover the current workflow.

Adds scripts/legacy/README.md to document the archival policy.
2026-05-23 20:57:32 +08:00

235 lines
9.6 KiB
Python

"""System-level profile: why PD-Sep loses to session-sticky PD-combined.
Compares per-request breakdown, GPU utilization patterns, KV cache behavior,
and routing efficiency across configurations to identify the exact mechanisms.
"""
import json, csv, statistics, os
from collections import defaultdict, Counter
BLOCK_SIZE = 512
def load_metrics(path):
rows = [json.loads(l) for l in open(path)]
ok = [r for r in rows if not r.get("error")]
return rows, ok
def load_gpu(path):
return list(csv.DictReader(open(path)))
def pct(v, q):
return v[min(int(q*len(v)), len(v)-1)] if v else 0
# Load all configs that have both metrics + GPU data
configs = {}
for d, label, tp, n_inst in [
("gpu_ab_combined", "TP=1 DP=8 old-CA", 1, 8),
("gpu_ab_hybrid", "TP=1 DP=8 hybrid", 1, 8),
("tp2dp4_hybrid", "TP=2 DP=4 hybrid", 2, 4),
("gpu_ab_pdsep", "PD-Sep 4P+4D", 1, 8),
("gpu_ab_6p2d", "PD-Sep 6P+2D", 1, 8),
("adaptive_v2_offload", "Adaptive offload", 1, 8),
]:
mp = "outputs/%s/metrics.jsonl" % d
if not os.path.exists(mp):
continue
rows, ok = load_metrics(mp)
gp = "outputs/%s/gpu_util.csv" % d
gpu = load_gpu(gp) if os.path.exists(gp) else []
ttfts = sorted([r["ttft_s"] for r in ok if r.get("ttft_s")])
tpots = sorted([r["tpot_s"] for r in ok if r.get("tpot_s") and r["tpot_s"] > 0])
lats = sorted([r["latency_s"] for r in ok])
outs = [r.get("actual_output_tokens", 0) or 0 for r in ok]
configs[d] = {
"label": label, "tp": tp, "n_inst": n_inst,
"ok": len(ok), "n": len(rows),
"ttfts": ttfts, "tpots": tpots, "lats": lats, "outs": outs,
"gpu": gpu, "rows": rows, "ok_rows": ok,
}
sep = "=" * 75
print(sep)
print(" WHY PD-SEP LOSES: SYSTEM-LEVEL PROFILE")
print(sep)
# ===================================================================
# EVIDENCE 1: Overhead decomposition (where does the extra time go?)
# ===================================================================
print("\n" + "-" * 75)
print(" EVIDENCE 1: TTFT Overhead Decomposition")
print("-" * 75)
for d in ["gpu_ab_hybrid", "gpu_ab_pdsep", "gpu_ab_6p2d", "tp2dp4_hybrid", "adaptive_v2_offload"]:
if d not in configs:
continue
c = configs[d]
# Bucket by input length
buckets = [(0, 5000, "<5k"), (5000, 20000, "5-20k"), (20000, 50000, "20-50k"), (50000, 999999, ">50k")]
print("\n %s:" % c["label"])
for lo, hi, blabel in buckets:
subset = [r for r in c["ok_rows"] if lo <= r["input_length"] < hi and r.get("ttft_s")]
if not subset:
continue
ttfts = sorted([r["ttft_s"] for r in subset])
n = len(subset)
print(" %6s: n=%3d TTFT p50=%.3fs p90=%.3fs" % (
blabel, n, pct(ttfts, .5), pct(ttfts, .9)))
# ===================================================================
# EVIDENCE 2: GPU Utilization efficiency
# ===================================================================
print("\n" + "-" * 75)
print(" EVIDENCE 2: GPU Utilization Efficiency")
print("-" * 75)
for d in ["gpu_ab_hybrid", "tp2dp4_hybrid", "gpu_ab_pdsep", "gpu_ab_6p2d"]:
if d not in configs or not configs[d]["gpu"]:
continue
c = configs[d]
vals = [float(r["util_pct"]) for r in c["gpu"]]
nz = sum(1 for v in vals if v > 0)
n_samples = len(vals) // 8 if len(vals) >= 8 else len(vals)
# Compute effective throughput: total output tokens / wall time
total_out = sum(c["outs"])
wall = max(c["lats"]) if c["lats"] else 1
tput = total_out / wall
print(" %s:" % c["label"])
print(" GPU util: mean=%.1f%% active=%d%% (%d samples)" % (
statistics.fmean(vals), nz * 100 // len(vals), n_samples))
print(" Output throughput: %.1f tokens/s" % tput)
print(" Efficiency: %.1f output_tokens per GPU%%" % (tput / max(statistics.fmean(vals), 0.1)))
# ===================================================================
# EVIDENCE 3: KV Cache memory pressure
# ===================================================================
print("\n" + "-" * 75)
print(" EVIDENCE 3: The KV Cache Memory Wall (PD-Sep specific)")
print("-" * 75)
print("""
PD-Sep concentrates ALL decode traffic onto fewer GPUs:
Combined DP=8: 8 instances, each ~1 concurrent decode request
PD-Sep 4P+4D: 4 decode instances, each ~2 concurrent decode requests
PD-Sep 6P+2D: 2 decode instances, each ~4 concurrent decode requests
KV cache per TP=1 instance: 281,888 tokens (~550 blocks)
Average request input: 33,611 tokens (~66 blocks)
Combined: 1 req * 66 blocks = 66/550 = 12% KV cache per instance
PD-Sep 4P+4D: 2 req * 66 blocks = 132/550 = 24% KV cache per decode inst
PD-Sep 6P+2D: 4 req * 66 blocks = 264/550 = 48% KV cache per decode inst
At peak (large requests, 100+ blocks each):
Combined: 100/550 = 18% per instance (comfortable)
PD-Sep 6P+2D: 400/550 = 73% per decode inst (near saturation)
Observed: 97.1% on decode instances (per-request breakdown showed
87.7% of TTFT was waiting for KV cache memory release)
""")
# ===================================================================
# EVIDENCE 4: KV Transfer overhead is not free
# ===================================================================
print("-" * 75)
print(" EVIDENCE 4: KV Transfer is Real Overhead")
print("-" * 75)
# Compare same-input requests between combined and PD-Sep
if "gpu_ab_hybrid" in configs and "gpu_ab_pdsep" in configs:
c_ok = configs["gpu_ab_hybrid"]["ok_rows"]
p_ok = configs["gpu_ab_pdsep"]["ok_rows"]
c_by_id = {r["request_id"]: r for r in c_ok}
p_by_id = {r["request_id"]: r for r in p_ok}
common = set(c_by_id.keys()) & set(p_by_id.keys())
if common:
overhead = []
for rid in common:
c = c_by_id[rid]
p = p_by_id[rid]
if c.get("ttft_s") and p.get("ttft_s") and c["ttft_s"] > 0:
overhead.append({
"input": c["input_length"],
"c_ttft": c["ttft_s"],
"p_ttft": p["ttft_s"],
"overhead": p["ttft_s"] - c["ttft_s"],
"ratio": p["ttft_s"] / c["ttft_s"],
})
overhead.sort(key=lambda x: x["input"])
print("\n Per-request TTFT: PD-Sep vs Combined (matched requests)")
print(" %8s %10s %10s %10s %7s" % ("input", "combined", "pdsep", "overhead", "ratio"))
for o in overhead[:10]:
print(" %8d %10.3f %10.3f %10.3f %6.1fx" % (
o["input"], o["c_ttft"], o["p_ttft"], o["overhead"], o["ratio"]))
overheads = [o["overhead"] for o in overhead]
ratios = [o["ratio"] for o in overhead]
print("\n Overhead stats:")
print(" Mean: %.3fs extra TTFT per request" % statistics.fmean(overheads))
print(" Mean ratio: %.1fx slower" % statistics.fmean(ratios))
# By input size
for lo, hi, blabel in [(0, 5000, "<5k"), (5000, 50000, "5-50k"), (50000, 999999, ">50k")]:
sub = [o for o in overhead if lo <= o["input"] < hi]
if sub:
print(" %6s: mean overhead=%.3fs, ratio=%.1fx" % (
blabel, statistics.fmean([o["overhead"] for o in sub]),
statistics.fmean([o["ratio"] for o in sub])))
# ===================================================================
# EVIDENCE 5: Session affinity loss in PD-Sep
# ===================================================================
print("\n" + "-" * 75)
print(" EVIDENCE 5: Session Affinity Disruption in PD-Sep")
print("-" * 75)
print("""
In PD-combined: session turn N and turn N+1 go to the SAME instance.
-> Turn N's KV stays in GPU cache
-> Turn N+1 gets prefix cache hit (80%+ APC for multi-turn)
-> Zero KV transfer needed
In PD-Sep: turn N's prefill goes to P instance, KV transfers to D instance.
Turn N+1's prefill goes to P instance again.
-> P instance does NOT have turn N's KV (it was transferred to D)
-> Turn N+1 must re-prefill from scratch on P
-> Then transfer KV to D again
-> Double penalty: re-prefill + KV transfer
This is the fundamental reason PD-Sep destroys multi-turn APC:
Combined APC for multi-turn: ~80%
PD-Sep: effectively ~0% for prefill (P never has prior turn's KV)
The only cache hit is on D, but D doesn't do prefill — it just decodes.
""")
# ===================================================================
# SUMMARY
# ===================================================================
print(sep)
print(" SUMMARY: 4 MECHANISMS WHY PD-SEP LOSES")
print(sep)
print("""
1. KV CACHE MEMORY WALL: Concentrating decode onto fewer GPUs fills
KV cache to 97%, causing 100+s waits for memory release.
Combined distributes across 8 instances, keeping usage <20%.
2. KV TRANSFER OVERHEAD: Every PD-Sep request pays RDMA transfer cost
(even small requests). Combined has zero transfer — KV stays on GPU.
3. SESSION AFFINITY BROKEN: Multi-turn sessions lose prefix cache on P
because prior turn's KV was transferred to D. Combined keeps KV
on the same instance, achieving 80% multi-turn APC vs ~0% on P.
4. GPU UNDERUTILIZATION: PD-Sep decode GPUs idle at 7-19% (memory-bound
decode doesn't need GPU compute). Combined uses all GPUs flexibly
at 28-30% average utilization.
ROOT CAUSE: PD-Sep was designed for chatbot workloads (short input,
no prefix sharing, compute-heavy prefill). Agentic workloads have:
- Long context (33k avg) -> large KV, memory pressure on D
- High prefix reuse (91% intra-session) -> session-sticky routing essential
- MoE model (3B active) -> low per-token compute, P-D interference small
These characteristics make PD-Sep's costs exceed its benefits.
""")