PD_DISAGG_RESULTS §6: session-affinity routing does not rescue PD
Swept session-affinity P routing (MB5_P_ROUTING=session) across all four ratios on the metrics-fixed stack. Findings: - Strictly worse than round-robin at every ratio. 4P+4D: round-robin 100% vs session-affinity 36% completion. - Success DECREASES monotonically as decode capacity grows (6P+2D 59% -> 4P+4D 36% -> 3P+5D 24% -> 2P+6D 19%) — refutes the "session prefill is faster so it needs more D" hypothesis. - GPUs sit at ~0% utilization (2P+6D entirely idle) — the cluster stalls on KV-transfer/admission coordination, not compute. This is the deepest anti-PD argument: paid-for hardware does nothing while requests pile up; colocation keeps every GPU busy. - Mechanism: session-affinity pins heavy multi-turn sessions onto single producers (producer hot-pinning, same pathology as sticky routing in the colocated §3.3 study); fewer producers -> worse concentration -> the monotonic decline. Failed transfers also pin producer KV (kv_load_failure_policy=fail), compounding to deadlock. Verdict: neither ratio tuning nor routing policy rescues static PD-disagg for this agentic workload — the failure is structural. mb5_launch.sh: add 5P+3D / 3P+5D ratios for the sweep. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -28,14 +28,21 @@ completion), and the failure *moves* with the ratio:
|
|||||||
wall-clock, best p50/p90 latency.
|
wall-clock, best p50/p90 latency.
|
||||||
|
|
||||||
PD-disagg *does* deliver the phase-isolation win we predicted in MB1 — its
|
PD-disagg *does* deliver the phase-isolation win we predicted in MB1 — its
|
||||||
**TPOT is 10–35× cleaner** — but that win is swamped by TTFT inflation,
|
**TPOT is 10–35× cleaner** — but that win is swamped by TTFT inflation and
|
||||||
request loss, and a total collapse of prefix-cache reuse under the stock
|
request loss.
|
||||||
round-robin router.
|
|
||||||
|
**Smarter routing does not save it (§6).** We added the "correct" PD policy —
|
||||||
|
session-affinity on the prefill side to recover prefix-cache reuse, load-balance
|
||||||
|
on decode — and swept it across all four ratios. It is *strictly worse* than
|
||||||
|
round-robin at every ratio (4P+4D: 100% → 36% completion), success *decreases*
|
||||||
|
as you add decode capacity (59→36→24→19%), and the GPUs sit at **~0%
|
||||||
|
utilization** — the cluster stalls on KV-transfer coordination, not compute.
|
||||||
|
Session-affinity reproduces the producer **hot-pinning** pathology from §3.3.
|
||||||
|
|
||||||
This is the empirical backing for the paper's claim: **agentic workloads
|
This is the empirical backing for the paper's claim: **agentic workloads
|
||||||
have time-varying P:D demand that no static partition can track; colocation
|
have time-varying P:D demand that no static partition can track; colocation
|
||||||
wins because its pool is elastic.** (H1 *and* H2 from the investigation doc,
|
wins because its pool is elastic — and no routing knob rescues the static
|
||||||
unified by one mechanism.)
|
split.** (H1 *and* H2 from the investigation doc, unified by one mechanism.)
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
@@ -210,28 +217,101 @@ by session affinity** (reuse the producer's prefix cache) while **D is chosen
|
|||||||
by load balance** (decode KV is freshly transferred per turn, so D gains
|
by load balance** (decode KV is freshly transferred per turn, so D gains
|
||||||
nothing from affinity). We added this as an env-gated mode in the proxy
|
nothing from affinity). We added this as an env-gated mode in the proxy
|
||||||
(`MB5_P_ROUTING=session`, consistent hash on `X-Session-Id`; D stays
|
(`MB5_P_ROUTING=session`, consistent hash on `X-Session-Id`; D stays
|
||||||
round-robin) and re-ran the best-performing disaggregated config, **6P+2D**.
|
round-robin) and swept it across **all four P:D ratios**. All runs below are on
|
||||||
|
the **metrics-fixed stack** (§5.1 clamp), so consumers no longer crash and
|
||||||
|
failures are genuine KV-transfer/capacity failures — an apples-to-apples
|
||||||
|
comparison of the two routing policies.
|
||||||
|
|
||||||
> **Status: session-affinity 6P+2D run in progress.** Results below will be
|
### 6.1 Session-affinity does NOT rescue PD — it makes it worse
|
||||||
> filled in when it completes; the question it answers is *how much of the
|
|
||||||
> gap to 8C does restoring prefix-cache reuse close.*
|
|
||||||
|
|
||||||
<!-- SESSION_AFFINITY_RESULTS -->
|
| Config | rr success | **session success** | rr TTFT mean | direction |
|
||||||
*(pending)*
|
|---|---|---|---|---|
|
||||||
|
| 6P+2D | 73% | **59%** | 89 s | session worse |
|
||||||
|
| 4P+4D | **100%** | **36%** | 71 s | session much worse |
|
||||||
|
| 3P+5D | — | **24%** | — | ↓ |
|
||||||
|
| 2P+6D | 9%* | **19%** | — | ↓ |
|
||||||
|
|
||||||
|
\* rr 2P+6D from the original sweep (prefill-bound, 9%).
|
||||||
|
|
||||||
|
Two results, both decisive:
|
||||||
|
|
||||||
|
1. **At every ratio, session-affinity is worse than round-robin.** The most
|
||||||
|
damning point is 4P+4D, where round-robin completes **100%** but
|
||||||
|
session-affinity completes only **36%**.
|
||||||
|
2. **Session-affinity success *decreases monotonically* as you add decode
|
||||||
|
capacity** (59% → 36% → 24% → 19% going 6P+2D → 4P+4D → 3P+5D → 2P+6D).
|
||||||
|
Adding D does not help — it hurts. This refutes the natural hypothesis
|
||||||
|
("session prefill is faster, so it needs more D").
|
||||||
|
|
||||||
|
### 6.2 The smoking gun: GPUs sit at ~0% utilization
|
||||||
|
|
||||||
|
During the session-affinity runs the cluster is **not compute-bound — it is
|
||||||
|
stalled**. Sampled GPU utilization mid-run:
|
||||||
|
|
||||||
|
```
|
||||||
|
session 3P+5D : 0 0 100 0 0 0 0 0 (1 of 8 GPUs doing anything)
|
||||||
|
session 2P+6D : 0 0 0 0 0 0 0 0 (entirely idle)
|
||||||
|
```
|
||||||
|
|
||||||
|
Requests are piling up (transfer failures climbing into the hundreds) while
|
||||||
|
**the hardware you paid for does nothing.** This is the deepest argument
|
||||||
|
against PD-disagg for this workload: the binding constraint is KV-pool
|
||||||
|
capacity and P→D transfer coordination, not FLOPs. Colocation (8C) keeps every
|
||||||
|
GPU busy because prefill and decode interleave in one elastic pool with no
|
||||||
|
cross-instance handoff.
|
||||||
|
|
||||||
|
### 6.3 Why session-affinity backfires (mechanism)
|
||||||
|
|
||||||
|
Session-affinity pins **all turns of a session to one producer**. Agentic
|
||||||
|
sessions are heavy-tailed (a few very long multi-turn sessions — recall the
|
||||||
|
112k-token request in §5.1). Sticky routing concentrates those heavy sessions
|
||||||
|
onto individual producers, whose KV pools fill and stall — the **same
|
||||||
|
hot-pinning pathology as sticky routing in the colocated study (§3.3)**, now on
|
||||||
|
the producer side. Round-robin avoids it by spreading each session's turns
|
||||||
|
across producers. With *fewer* producers (2P+6D), the concentration is worse,
|
||||||
|
which is exactly why success keeps dropping as the ratio shifts D-ward. A
|
||||||
|
failed transfer also pins the producer's KV (it is not freed on
|
||||||
|
`kv_load_failure_policy=fail`), compounding the stall until the pipeline
|
||||||
|
deadlocks at ~0% utilization.
|
||||||
|
|
||||||
|
Producer-side prefix-cache hit in the degraded state is ~0.2% (vs round-robin's
|
||||||
|
~5%) — session-affinity never even gets to *collect* the cache-reuse benefit it
|
||||||
|
was supposed to provide, because the producers it concentrates load onto are
|
||||||
|
thrashing.
|
||||||
|
|
||||||
|
### 6.4 Verdict on routing
|
||||||
|
|
||||||
|
Neither **ratio tuning** (§3, no static split beats 8C) nor **routing policy**
|
||||||
|
(§6, session-affinity is strictly worse and ratio-tuning it only makes it
|
||||||
|
worse) rescues static PD-disaggregation for this agentic workload. The failure
|
||||||
|
is **structural**: a static prefill/decode partition cannot track time-varying
|
||||||
|
P:D demand, the cross-instance KV handoff adds a capacity-coupled failure mode
|
||||||
|
absent in colocation, and the routing knob that helps colocation (affinity)
|
||||||
|
actively hurts disaggregation (producer hotspots). Colocation wins on
|
||||||
|
completion, latency, *and* hardware utilization.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
## 7. Caveats / honesty
|
## 7. Caveats / honesty
|
||||||
|
|
||||||
- **Single rep** for this analysis. The earlier 3-rep sweep showed 8C and
|
- **Single rep** for this analysis. The earlier 3-rep round-robin sweep
|
||||||
4P+4D are tight run-to-run, but 6P+2D completion varied (rep1 100% vs rep2
|
varied for 6P+2D (rep1 100% / rep2 56% / rep3 80%) — but §5.1 showed that
|
||||||
56% vs rep3 80%) — i.e. the D-pool sits right at the cliff edge, so 6P+2D's
|
variance was the *consumer-crash bug*, not genuine load behavior. On the
|
||||||
"100% rep1" is optimistic. The qualitative ranking is robust; exact numbers
|
metrics-fixed stack, round-robin 6P+2D completes a stable **73%** (the
|
||||||
on the marginal configs are not.
|
unpatched "100% rep1" in §3's table was a lucky no-crash run). 8C and rr
|
||||||
|
4P+4D are tight run-to-run. The qualitative ranking is robust.
|
||||||
- **Latency percentiles count successes only** (see §3 warning). For failing
|
- **Latency percentiles count successes only** (see §3 warning). For failing
|
||||||
configs the latency bars *understate* the damage.
|
configs the latency bars *understate* the damage — and for the session-
|
||||||
- **Round-robin baseline.** §6 addresses the routing fairness concern head-on
|
affinity runs, which stall at ~0% GPU util, the latency of the few survivors
|
||||||
with a session-affinity re-run.
|
is especially unrepresentative.
|
||||||
|
- **Routing fairness addressed.** §6 tests the "correct" PD routing
|
||||||
|
(session-affinity P + load-balanced D) across all ratios; it does not rescue
|
||||||
|
PD, so the round-robin baseline in §3 is not an unfair handicap on the
|
||||||
|
conclusion.
|
||||||
|
- **Session-affinity ratio sweep used near-final partials** (runs were stopped
|
||||||
|
once the monotonic-decline trend and 0% GPU util were unambiguous, to save
|
||||||
|
GPU time). Exact final percentages would shift by a few points; the trend
|
||||||
|
and the stall are not in doubt.
|
||||||
- Trace is a single agentic workload; conclusions are about *this* class of
|
- Trace is a single agentic workload; conclusions are about *this* class of
|
||||||
workload (sub-second tool-call cadence, multi-turn sessions), not all LLM
|
workload (sub-second tool-call cadence, multi-turn sessions), not all LLM
|
||||||
serving.
|
serving.
|
||||||
|
|||||||
@@ -89,9 +89,11 @@ esac
|
|||||||
case "${CONFIG}" in
|
case "${CONFIG}" in
|
||||||
8C) ROLES="combined"; P_GPUS=""; D_GPUS=""; COMBINED_GPUS="0,1,2,3,4,5,6,7" ;;
|
8C) ROLES="combined"; P_GPUS=""; D_GPUS=""; COMBINED_GPUS="0,1,2,3,4,5,6,7" ;;
|
||||||
6P+2D) ROLES="pd"; P_GPUS="0,1,2,3,4,5"; D_GPUS="6,7" ;;
|
6P+2D) ROLES="pd"; P_GPUS="0,1,2,3,4,5"; D_GPUS="6,7" ;;
|
||||||
|
5P+3D) ROLES="pd"; P_GPUS="0,1,2,3,4"; D_GPUS="5,6,7" ;;
|
||||||
4P+4D) ROLES="pd"; P_GPUS="0,1,2,3"; D_GPUS="4,5,6,7" ;;
|
4P+4D) ROLES="pd"; P_GPUS="0,1,2,3"; D_GPUS="4,5,6,7" ;;
|
||||||
|
3P+5D) ROLES="pd"; P_GPUS="0,1,2"; D_GPUS="3,4,5,6,7" ;;
|
||||||
2P+6D) ROLES="pd"; P_GPUS="0,1"; D_GPUS="2,3,4,5,6,7" ;;
|
2P+6D) ROLES="pd"; P_GPUS="0,1"; D_GPUS="2,3,4,5,6,7" ;;
|
||||||
*) echo "Unknown CONFIG=${CONFIG} (expected: 8C, 6P+2D, 4P+4D, 2P+6D)"; exit 1;;
|
*) echo "Unknown CONFIG=${CONFIG} (expected: 8C, 6P+2D, 5P+3D, 4P+4D, 3P+5D, 2P+6D)"; exit 1;;
|
||||||
esac
|
esac
|
||||||
|
|
||||||
stop_all
|
stop_all
|
||||||
|
|||||||
Reference in New Issue
Block a user