Added --heavy-threshold to cache_aware_proxy.py. HEAVY requests (new
tokens >= threshold) route to instance with least decode load; WARM/MEDIUM
route by cache-hit + token-level LB as before.
Result: no significant difference vs baseline on single-machine combined mode.
TTFT: +1.2%, TPOT: -1.5%, E2E: -0.3% (all within noise)
Per-class TTFT breakdown shows the optimization target:
WARM (75 req): p50=0.198s (cache hit, nearly free)
MEDIUM (72 req): p50=1.356s
HEAVY (54 req): p50=7.124s (36x slower than WARM)
Conclusion: single-machine combined mode already distributes load well
enough that adaptive routing adds no benefit. True isolation of HEAVY
prefills requires cross-machine offload (v2 with Mooncake or multi-node).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>