Add migration policy design doc with A/B experiment results
Approach A (contention-aware cost model): TTFT p90 -52% vs baseline. Approach B (session migration): 0 triggers at 1.5x threshold — needs tuning. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This commit is contained in:
78
docs/migration-policy-design.md
Normal file
78
docs/migration-policy-design.md
Normal file
@@ -0,0 +1,78 @@
|
||||
# Migration Policy Design: Improving Load Balance in Elastic KV
|
||||
|
||||
## Problem Statement
|
||||
|
||||
With the unified cost model (v3), elastic routing achieves TTFT p90 -37% vs
|
||||
baseline on WARM/MEDIUM requests. However, **HEAVY turn>=2 requests with 99%
|
||||
cache hit still suffer TTFT 5-150s due to queuing contention** on overloaded
|
||||
instances.
|
||||
|
||||
Root cause: the cost model combines cache benefit and queuing into a single
|
||||
scalar. When cache hit is 99%, the cost is dominated by queue estimation, but
|
||||
queue is inaccurately estimated via `(pending_prefill + decode_tokens) /
|
||||
throughput` — a token-based proxy that misses real contention (batch size).
|
||||
|
||||
**Key data (v3, 850 requests, 8 instances):**
|
||||
- 391 turn>=2 HEAVY LOCAL requests were migration candidates
|
||||
- 298 (76%) had cache>80% — affinity held correctly
|
||||
- **38 of those 298 (13%) had TTFT>5s** despite 94-99% cache hit (queuing victims)
|
||||
- Only 8 offloads triggered total (2 real migrations, 6 useless turn-1 offloads)
|
||||
- Theoretical TTFT for turn2+ HEAVY: mean=0.81s (actual: 4.73s, **5.8x gap**)
|
||||
|
||||
## Approach A: Contention-Aware Cost Model [ADOPTED]
|
||||
|
||||
Replace `(pending_prefill + decode_tokens) / throughput` with
|
||||
`num_requests * decode_iteration_s + pending_prefill / throughput` as the
|
||||
queue estimation. `num_requests` (batch size) is the primary driver of
|
||||
decode iteration time and thus real contention.
|
||||
|
||||
Add a migration discount for sessions with accumulated cache (turn >= 2),
|
||||
reflecting the long-term value of migrating a session off a loaded instance.
|
||||
|
||||
### Parameters
|
||||
|
||||
- `decode_iteration_s = 0.05` (per-request decode iteration cost on H20)
|
||||
- `migration_discount_cap = 5` (max turns to discount)
|
||||
|
||||
### Results (vs baseline, 850 requests, 8×H20)
|
||||
|
||||
| Metric | Baseline | Approach A | Change |
|
||||
|------------------|----------|------------|---------|
|
||||
| ALL TTFT mean | 5.639 | 3.675 | -35% |
|
||||
| ALL TTFT p90 | 16.058 | 7.638 | **-52%**|
|
||||
| MEDIUM TTFT p90 | 4.412 | 1.681 | **-62%**|
|
||||
| HEAVY TTFT p90 | 23.780 | 15.929 | -33% |
|
||||
| ALL TPOT p90 | 0.105 | 0.075 | -28% |
|
||||
| ALL E2E p50 | 7.446 | 6.429 | -14% |
|
||||
| Errors | 0 | 0 | — |
|
||||
|
||||
## Approach B: Session-Level Lazy Migration [UNDER TUNING]
|
||||
|
||||
Add a migration trigger **before** the cost model. When a request arrives for
|
||||
a session on an overloaded instance, force migration if:
|
||||
1. Instance busy: `num_requests > avg * migration_request_factor`
|
||||
2. Session has cache: `cache_ratio > 0.5`
|
||||
3. Request is HEAVY: `input_length >= heavy_threshold`
|
||||
4. Target meaningfully less loaded: `target.num_requests < source - 2`
|
||||
|
||||
### Results (A+B combined, migration_request_factor=1.5)
|
||||
|
||||
**0 migrations triggered** — Approach A's contention-aware routing already
|
||||
distributes load well enough that no instance reaches 1.5x average. The
|
||||
threshold needs to be lowered or the trigger redesigned.
|
||||
|
||||
### Next steps
|
||||
|
||||
- Lower `migration_request_factor` (e.g. 1.2 or 1.3)
|
||||
- Consider absolute threshold instead of relative (e.g. > avg + 3)
|
||||
- Or trigger based on recent TTFT rather than instantaneous num_requests
|
||||
|
||||
## Evolution of Results
|
||||
|
||||
| Version | Description | ALL TTFT p90 | HEAVY TTFT p90 | tok max/min |
|
||||
|---------|-------------|-------------|----------------|-------------|
|
||||
| Baseline | linear routing | 16.058 | 23.780 | 2.7x |
|
||||
| v2 (bug) | unified, queue=prefill only | 23.339 | 38.070 | 10.3x |
|
||||
| v3 | +decode in queue, +hard gate | 10.121 | 18.471 | 2.6x |
|
||||
| **A** | **+num_requests contention** | **7.638** | **15.929** | **3.5x** |
|
||||
| A+B | +session migration (1.5x) | 8.291 | 16.384 | 3.0x |
|
||||
Reference in New Issue
Block a user