Add migration policy design doc with A/B experiment results

Approach A (contention-aware cost model): TTFT p90 -52% vs baseline. Approach B (session migration): 0 triggers at 1.5x threshold — needs tuning. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-24 18:24:49 +08:00
parent e9919605af
commit 45b82272c3
1 changed files with 78 additions and 0 deletions
--- a/docs/migration-policy-design.md
+++ b/docs/migration-policy-design.md
@@ -0,0 +1,78 @@
+# Migration Policy Design: Improving Load Balance in Elastic KV
+
+## Problem Statement
+
+With the unified cost model (v3), elastic routing achieves TTFT p90 -37% vs
+baseline on WARM/MEDIUM requests. However, **HEAVY turn>=2 requests with 99%
+cache hit still suffer TTFT 5-150s due to queuing contention** on overloaded
+instances.
+
+Root cause: the cost model combines cache benefit and queuing into a single
+scalar. When cache hit is 99%, the cost is dominated by queue estimation, but
+queue is inaccurately estimated via `(pending_prefill + decode_tokens) /
+throughput` — a token-based proxy that misses real contention (batch size).
+
+**Key data (v3, 850 requests, 8 instances):**
+- 391 turn>=2 HEAVY LOCAL requests were migration candidates
+- 298 (76%) had cache>80% — affinity held correctly
+- **38 of those 298 (13%) had TTFT>5s** despite 94-99% cache hit (queuing victims)
+- Only 8 offloads triggered total (2 real migrations, 6 useless turn-1 offloads)
+- Theoretical TTFT for turn2+ HEAVY: mean=0.81s (actual: 4.73s, **5.8x gap**)
+
+## Approach A: Contention-Aware Cost Model [ADOPTED]
+
+Replace `(pending_prefill + decode_tokens) / throughput` with
+`num_requests * decode_iteration_s + pending_prefill / throughput` as the
+queue estimation. `num_requests` (batch size) is the primary driver of
+decode iteration time and thus real contention.
+
+Add a migration discount for sessions with accumulated cache (turn >= 2),
+reflecting the long-term value of migrating a session off a loaded instance.
+
+### Parameters
+
+- `decode_iteration_s = 0.05` (per-request decode iteration cost on H20)
+- `migration_discount_cap = 5` (max turns to discount)
+
+### Results (vs baseline, 850 requests, 8×H20)
+
+| Metric           | Baseline | Approach A | Change  |
+|------------------|----------|------------|---------|
+| ALL TTFT mean    |    5.639 |      3.675 |   -35%  |
+| ALL TTFT p90     |   16.058 |      7.638 | **-52%**|
+| MEDIUM TTFT p90  |    4.412 |      1.681 | **-62%**|
+| HEAVY TTFT p90   |   23.780 |     15.929 |   -33%  |
+| ALL TPOT p90     |    0.105 |      0.075 |   -28%  |
+| ALL E2E p50      |    7.446 |      6.429 |   -14%  |
+| Errors           |        0 |          0 |     —   |
+
+## Approach B: Session-Level Lazy Migration [UNDER TUNING]
+
+Add a migration trigger **before** the cost model. When a request arrives for
+a session on an overloaded instance, force migration if:
+1. Instance busy: `num_requests > avg * migration_request_factor`
+2. Session has cache: `cache_ratio > 0.5`
+3. Request is HEAVY: `input_length >= heavy_threshold`
+4. Target meaningfully less loaded: `target.num_requests < source - 2`
+
+### Results (A+B combined, migration_request_factor=1.5)
+
+**0 migrations triggered** — Approach A's contention-aware routing already
+distributes load well enough that no instance reaches 1.5x average. The
+threshold needs to be lowered or the trigger redesigned.
+
+### Next steps
+
+- Lower `migration_request_factor` (e.g. 1.2 or 1.3)
+- Consider absolute threshold instead of relative (e.g. > avg + 3)
+- Or trigger based on recent TTFT rather than instantaneous num_requests
+
+## Evolution of Results
+
+| Version | Description | ALL TTFT p90 | HEAVY TTFT p90 | tok max/min |
+|---------|-------------|-------------|----------------|-------------|
+| Baseline | linear routing | 16.058 | 23.780 | 2.7x |
+| v2 (bug) | unified, queue=prefill only | 23.339 | 38.070 | 10.3x |
+| v3 | +decode in queue, +hard gate | 10.121 | 18.471 | 2.6x |
+| **A** | **+num_requests contention** | **7.638** | **15.929** | **3.5x** |
+| A+B | +session migration (1.5x) | 8.291 | 16.384 | 3.0x |