6243b78bbaabcc648ab0599e3852e578935ff879
Swept session-affinity P routing (MB5_P_ROUTING=session) across all four ratios on the metrics-fixed stack. Findings: - Strictly worse than round-robin at every ratio. 4P+4D: round-robin 100% vs session-affinity 36% completion. - Success DECREASES monotonically as decode capacity grows (6P+2D 59% -> 4P+4D 36% -> 3P+5D 24% -> 2P+6D 19%) — refutes the "session prefill is faster so it needs more D" hypothesis. - GPUs sit at ~0% utilization (2P+6D entirely idle) — the cluster stalls on KV-transfer/admission coordination, not compute. This is the deepest anti-PD argument: paid-for hardware does nothing while requests pile up; colocation keeps every GPU busy. - Mechanism: session-affinity pins heavy multi-turn sessions onto single producers (producer hot-pinning, same pathology as sticky routing in the colocated §3.3 study); fewer producers -> worse concentration -> the monotonic decline. Failed transfers also pin producer KV (kv_load_failure_policy=fail), compounding to deadlock. Verdict: neither ratio tuning nor routing policy rescues static PD-disagg for this agentic workload — the failure is structural. mb5_launch.sh: add 5P+3D / 3P+5D ratios for the sweep. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Description
No description provided
Languages
Python
82.9%
Shell
17.1%