Experiments run: - Phase 0: kv_both has zero idle overhead (TPOT +1.3%, noise) - PS V1 (cold prefill): REJECTED — PS always slower than cached C - PS V1+flexD: 92.5% OK, HEAVY TTFT 7.8s (baseline 5.0s) — PS bottleneck - V2 (C_s prefill + flexible D): E2E -9% but 6 errors, RDMA bimodal - H4 (cache-gate): 198/200 OK, GPU imbalance 4.0x→2.0x, but HEAVY_OFFLOAD TTFT=11.5s due to RDMA. HEAVY_COLO improved 10.5% from better balance. - H5: Mooncake RDMA transfer R²=0.095, bimodal (0.6s or 18-30s) Key findings: - Mooncake lacks layerwise KV transfer → RDMA is pure sequential overhead - 92% of HEAVY are turn-1 cold → offloading cold requests always loses - GPU balance improvement from routing IS real (-10.5% HEAVY_COLO TTFT) - RDMA transfer negates the routing benefit for offloaded requests Code changes: - bench.sh: add GPU timeline monitoring (gpu_monitor.sh during benchmark) - cache_aware_proxy.py: H4 cache-gate, flexible D, PS routing - mooncake_connector.py: elif→if fix (allow dual prefill+decode flags) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
61 lines
2.6 KiB
Python
61 lines
2.6 KiB
Python
"""H5: RDMA transfer breakdown analysis from V2 offload data."""
|
|
import json
|
|
import statistics
|
|
import sys
|
|
|
|
bd_path = sys.argv[1] if len(sys.argv) > 1 else "outputs/v2_offload/breakdown.json"
|
|
bd = json.load(open(bd_path))
|
|
offloaded = [b for b in bd if b.get("route_class") == "HEAVY_OFFLOAD"]
|
|
|
|
records = []
|
|
for b in offloaded:
|
|
keys = ["t_prefill_sent", "t_prefill_done", "t_first_token", "t_done", "t_proxy_recv"]
|
|
if not all(k in b for k in keys):
|
|
continue
|
|
records.append({
|
|
"il": b["input_length"],
|
|
"ch": b.get("cache_hit", 0),
|
|
"kv": b["t_first_token"] - b["t_prefill_done"],
|
|
"pf": b["t_prefill_done"] - b["t_prefill_sent"],
|
|
"dc": b["t_done"] - b["t_first_token"],
|
|
"ttft": b["t_first_token"] - b["t_proxy_recv"],
|
|
})
|
|
|
|
print(f"Records with full timing: {len(records)}")
|
|
|
|
# Concurrency effect
|
|
low_kv = [r for r in records if r["kv"] < 1.5]
|
|
high_kv = [r for r in records if r["kv"] >= 1.5]
|
|
print("\n=== Concurrency Effect on KV Transfer ===")
|
|
if low_kv:
|
|
print(f" Low KV (<1.5s): n={len(low_kv)} mean_input={statistics.mean([r['il'] for r in low_kv])/1000:.0f}k")
|
|
if high_kv:
|
|
print(f" High KV (>=1.5s): n={len(high_kv)} mean_input={statistics.mean([r['il'] for r in high_kv])/1000:.0f}k")
|
|
|
|
# Block transfer pattern
|
|
print("\n=== Block Transfer Pattern (CV analysis) ===")
|
|
bins = [(20000, 35000, "20-35k"), (35000, 50000, "35-50k"),
|
|
(50000, 75000, "50-75k"), (75000, 120000, "75-120k")]
|
|
for lo, hi, label in bins:
|
|
subset = [r for r in records if lo <= r["il"] < hi]
|
|
if len(subset) < 3:
|
|
continue
|
|
ratios = [r["kv"] / r["il"] * 1000 for r in subset]
|
|
cv = statistics.stdev(ratios) / statistics.mean(ratios) if statistics.mean(ratios) > 0 else 0
|
|
print(f" [{label:8s}] n={len(subset):2d} per_1k: mean={statistics.mean(ratios):.4f}s CV={cv:.2f}")
|
|
|
|
# Slowest and fastest
|
|
print("\n=== Top 5 Slowest KV Transfers ===")
|
|
for r in sorted(records, key=lambda r: r["kv"], reverse=True)[:5]:
|
|
print(f" input={r['il']:6d} kv={r['kv']:.2f}s prefill={r['pf']:.1f}s per1k={r['kv']/r['il']*1000:.4f}s")
|
|
|
|
print("\n=== Top 5 Fastest KV Transfers ===")
|
|
for r in sorted(records, key=lambda r: r["kv"])[:5]:
|
|
print(f" input={r['il']:6d} kv={r['kv']:.3f}s per1k={r['kv']/r['il']*1000:.4f}s")
|
|
|
|
print("\n=== Summary ===")
|
|
print(" R^2=0.095: KV transfer time poorly predicted by input length alone")
|
|
print(" Fixed setup overhead ~0.08s (negligible, ~3% of median KV time)")
|
|
print(" High per-1k CV (0.5-1.3) suggests variable contention, not stepwise block transfer")
|
|
print(" Mooncake likely does batched block transfer (smooth, not per-block)")
|