Gahow Wang 9835d6af5d Elastic PS eval: near-neutral, offload gate triggers only 14% of HEAVY
Root cause: 75% of HEAVY requests are cold (cache_ratio=0%), failing the
cache_ratio>=0.3 gate. Only 17/118 HEAVY offloaded, insufficient to reduce
prefill-decode interference. Offloaded requests are 50% SLOWER due to
P-side queuing (14.7s) + RDMA overhead (5.7s).

Interference IS real: 89% of WARM/MEDIUM have 1+ concurrent HEAVY prefill.
But elastic PS in current form can't address it because cold HEAVY prefills
(the majority) can't benefit from offload.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 16:49:25 +08:00
Description
No description provided
48 MiB
Languages
Python 82.9%
Shell 17.1%