Gahow Wang cd82b8c2a2 PD-sep matrix results: C2/C3/C4 figures + empirical mechanism refined
Captures 5 runs from the experiment matrix (combined-ca x3 seeds,
pdsep-4p4d seed1, pdsep-6p2d seed1) on traces/w600_r0.0015_st30.jsonl
with cuda graphs enabled. The headline:

  combined-ca:  TTFT p50 0.91s   success 99.5%
  pdsep-4p4d:   TTFT p50 62.8s   success 52%   (69x worse, half dropped)
  pdsep-6p2d:   TTFT p50 51.1s   success 68%   (56x worse, third dropped)

C2 (fig_c2): headline bars per config with error bars.
C3 (fig_c3): per-instance KV utilization time-series. Both PD-sep
  splits hit the memory wall, but the side differs by P:D ratio --
  4P+4D pins the P-side, 6P+2D pins both sides (D-side back-pressures
  P-side).
C4 (fig_c4): TTFT stacked breakdown. 99% of PD-sep TTFT is P-side
  prefill compute; D-side wait + first token is <=1.2s. The bottleneck
  is P-side prefill queueing, not D-side decode wait as the original
  analytical model assumed.

system_analysis.md gains a Layer 5b that reconciles the analytical
KV-wall model (which considered D-side only) with the empirical
finding that the wall hits whichever side has fewer GPUs, and
co-saturates both at extreme splits via D-side back-pressure.

plot_pd_matrix.py ingests outputs/pd_matrix/* into all four figures.
bench.sh gained AGENTIC_STEP_LOG_DIR hooks for future runs (set during
this work but not used by the current matrix's data).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 16:23:52 +08:00
Description
No description provided
48 MiB
Languages
Python 82.9%
Shell 17.1%