Design: offload HEAVY prefill only when P instance is less loaded than D
AND P is not overloaded (< 1.5x avg). Preserves session-sticky on D
for future KV reuse. External KV correctly registered in prefix cache.
Result (67/200 processed, 75% success):
TTFT p50: 0.551s (-49% vs baseline 1.080s)
TTFT p90: 4.135s (vs baseline 9.410s, -56%)
TPOT p90: 0.074s (same as baseline)
E2E p50: 2.938s (-45% vs baseline 5.306s)
25% error rate from ReadTimeout on very large HEAVY requests queuing on P.
Needs stricter elastic gate or higher timeout. But successful requests
show significant improvement over both baseline and previous P2P.
Also: added external_prefix_cache metrics tracking to replayer summary.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>