Root cause: 75% of HEAVY requests are cold (cache_ratio=0%), failing the
cache_ratio>=0.3 gate. Only 17/118 HEAVY offloaded, insufficient to reduce
prefill-decode interference. Offloaded requests are 50% SLOWER due to
P-side queuing (14.7s) + RDMA overhead (5.7s).
Interference IS real: 89% of WARM/MEDIUM have 1+ concurrent HEAVY prefill.
But elastic PS in current form can't address it because cold HEAVY prefills
(the majority) can't benefit from offload.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Our --max-inflight-sessions 8 yields 1 req/GPU, masking prefill-decode
interference that appears at 2/GPU (+38% TPOT) and would dominate at
production load (~15/GPU). Updated §8 to re-evaluate elastic PS at
production concurrency. Next step: --max-inflight-sessions 64 benchmark.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Key findings:
- Session-sticky imbalance is 8.6x at 200 req (small-sample artifact)
but only 1.24x at 1000 req (moderate, TPOT unaffected)
- Elastic PS not justified: interference reduction 0% at 1/GPU,
migration reduces imbalance 1.24x→1.18x at 1.5s/event cost
- Corrected LMetric (no affinity) matches Linear (sticky) on all
metrics (<2%), proving soft affinity from cache-hit scoring works
- Updated §3.4 errata, added §8 GPU imbalance + elastic PS analysis
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>