Ran unified vs unified_v4 A/B on dash2 (8×H20, kv_both+DR-fix substrate,
w600_r0.0015_st30_first600s trace). Key findings:
- At 1x QPS (~1.3 req/s): zero migrations. pending_prefill_tokens is 0 for
95% of routing decisions because instances complete prefill before the next
request arrives. The relative arm (src_pp > fleet_median*1.5) never fires.
- At 2x QPS (~2.7 req/s): 4 migrations (0.5%). src_pp>0 rises to 24% of
eligible decisions. Trigger correctly identifies genuinely overloaded
instances (src_pp 13k–73k vs fleet median 3.8k–33k).
Conclusion: mechanism is correct but migration benefit requires higher
concurrency (scale-out or >3x QPS) where queue pressure makes the signal
non-trivial. At single-node 8-instance scale, Pillar 1 (affinity routing)
is sufficient and Pillar 2 gracefully degrades to no-op.
Next: scale-out validation (16+ GPU) where session skew naturally
concentrates load and triggers migration.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>