All 8 GPUs stay PD-combined. Global scheduler classifies requests as
WARM/MEDIUM/HEAVY based on estimated new tokens after prefix cache.
Only HEAVY requests (20%, cold start >20k new tokens) get offloaded;
80% of requests are co-located with zero KV transfer.
This avoids the KV cache memory wall (no decode concentration) while
isolating heavy prefills from decode when needed.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>