Approach A (contention-aware cost model): TTFT p90 -52% vs baseline. Approach B (session migration): 0 triggers at 1.5x threshold — needs tuning. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3.5 KiB
Migration Policy Design: Improving Load Balance in Elastic KV
Problem Statement
With the unified cost model (v3), elastic routing achieves TTFT p90 -37% vs baseline on WARM/MEDIUM requests. However, HEAVY turn>=2 requests with 99% cache hit still suffer TTFT 5-150s due to queuing contention on overloaded instances.
Root cause: the cost model combines cache benefit and queuing into a single
scalar. When cache hit is 99%, the cost is dominated by queue estimation, but
queue is inaccurately estimated via (pending_prefill + decode_tokens) / throughput — a token-based proxy that misses real contention (batch size).
Key data (v3, 850 requests, 8 instances):
- 391 turn>=2 HEAVY LOCAL requests were migration candidates
- 298 (76%) had cache>80% — affinity held correctly
- 38 of those 298 (13%) had TTFT>5s despite 94-99% cache hit (queuing victims)
- Only 8 offloads triggered total (2 real migrations, 6 useless turn-1 offloads)
- Theoretical TTFT for turn2+ HEAVY: mean=0.81s (actual: 4.73s, 5.8x gap)
Approach A: Contention-Aware Cost Model [ADOPTED]
Replace (pending_prefill + decode_tokens) / throughput with
num_requests * decode_iteration_s + pending_prefill / throughput as the
queue estimation. num_requests (batch size) is the primary driver of
decode iteration time and thus real contention.
Add a migration discount for sessions with accumulated cache (turn >= 2), reflecting the long-term value of migrating a session off a loaded instance.
Parameters
decode_iteration_s = 0.05(per-request decode iteration cost on H20)migration_discount_cap = 5(max turns to discount)
Results (vs baseline, 850 requests, 8×H20)
| Metric | Baseline | Approach A | Change |
|---|---|---|---|
| ALL TTFT mean | 5.639 | 3.675 | -35% |
| ALL TTFT p90 | 16.058 | 7.638 | -52% |
| MEDIUM TTFT p90 | 4.412 | 1.681 | -62% |
| HEAVY TTFT p90 | 23.780 | 15.929 | -33% |
| ALL TPOT p90 | 0.105 | 0.075 | -28% |
| ALL E2E p50 | 7.446 | 6.429 | -14% |
| Errors | 0 | 0 | — |
Approach B: Session-Level Lazy Migration [UNDER TUNING]
Add a migration trigger before the cost model. When a request arrives for a session on an overloaded instance, force migration if:
- Instance busy:
num_requests > avg * migration_request_factor - Session has cache:
cache_ratio > 0.5 - Request is HEAVY:
input_length >= heavy_threshold - Target meaningfully less loaded:
target.num_requests < source - 2
Results (A+B combined, migration_request_factor=1.5)
0 migrations triggered — Approach A's contention-aware routing already distributes load well enough that no instance reaches 1.5x average. The threshold needs to be lowered or the trigger redesigned.
Next steps
- Lower
migration_request_factor(e.g. 1.2 or 1.3) - Consider absolute threshold instead of relative (e.g. > avg + 3)
- Or trigger based on recent TTFT rather than instantaneous num_requests
Evolution of Results
| Version | Description | ALL TTFT p90 | HEAVY TTFT p90 | tok max/min |
|---|---|---|---|---|
| Baseline | linear routing | 16.058 | 23.780 | 2.7x |
| v2 (bug) | unified, queue=prefill only | 23.339 | 38.070 | 10.3x |
| v3 | +decode in queue, +hard gate | 10.121 | 18.471 | 2.6x |
| A | +num_requests contention | 7.638 | 15.929 | 3.5x |
| A+B | +session migration (1.5x) | 8.291 | 16.384 | 3.0x |