8d422c4301bb2f90c24da8c94fe4d864bf362ee6
Ran unified vs unified_v4 A/B on dash2 (8×H20, kv_both+DR-fix substrate, w600_r0.0015_st30_first600s trace). Key findings: - At 1x QPS (~1.3 req/s): zero migrations. pending_prefill_tokens is 0 for 95% of routing decisions because instances complete prefill before the next request arrives. The relative arm (src_pp > fleet_median*1.5) never fires. - At 2x QPS (~2.7 req/s): 4 migrations (0.5%). src_pp>0 rises to 24% of eligible decisions. Trigger correctly identifies genuinely overloaded instances (src_pp 13k–73k vs fleet median 3.8k–33k). Conclusion: mechanism is correct but migration benefit requires higher concurrency (scale-out or >3x QPS) where queue pressure makes the signal non-trivial. At single-node 8-instance scale, Pillar 1 (affinity routing) is sufficient and Pillar 2 gracefully degrades to no-op. Next: scale-out validation (16+ GPU) where session skew naturally concentrates load and triggers migration. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Description
No description provided
Languages
Python
82.9%
Shell
17.1%