Update design doc: final results + review findings

Unified routing (baseline mode) beats LMetric E2E mean/p50/p90. PD-sep offload consistently degrades performance (5-134 offloads tested). Independent review: fair comparison, no reward hacking, needs multi-run significance verification (running 3x paired test). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-25 03:48:18 +08:00
parent 4c583f2f1c
commit 448361cf83
1 changed files with 59 additions and 61 deletions
--- a/docs/migration-policy-design.md
+++ b/docs/migration-policy-design.md
@@ -1,78 +1,76 @@
 # Migration Policy Design: Improving Load Balance in Elastic KV

-## Problem Statement
+## Final Result

-With the unified cost model (v3), elastic routing achieves TTFT p90 -37% vs
-baseline on WARM/MEDIUM requests. However, **HEAVY turn>=2 requests with 99%
-cache hit still suffer TTFT 5-150s due to queuing contention** on overloaded
-instances.
+**Unified routing (baseline mode, no Mooncake)** beats LMetric on E2E mean/p50/p90.
+Pending multi-run significance verification.

-Root cause: the cost model combines cache benefit and queuing into a single
-scalar. When cache hit is 99%, the cost is dominated by queue estimation, but
-queue is inaccurately estimated via `(pending_prefill + decode_tokens) /
-throughput` — a token-based proxy that misses real contention (batch size).
+| Metric | LMetric | Unified | Change |
+|--------|---------|---------|--------|
+| E2E mean | 18.204 | **17.831** | -2.0% |
+| E2E p50 | 6.184 | **6.074** | -1.8% |
+| E2E p90 | 39.438 | **37.073** | -6.0% |
+| TTFT p90 | 9.331 | **8.034** | -13.9% |
+| Errors | 0 | 0 | — |

-**Key data (v3, 850 requests, 8 instances):**
- 391 turn>=2 HEAVY LOCAL requests were migration candidates
- 298 (76%) had cache>80% — affinity held correctly
- **38 of those 298 (13%) had TTFT>5s** despite 94-99% cache hit (queuing victims)
- Only 8 offloads triggered total (2 real migrations, 6 useless turn-1 offloads)
- Theoretical TTFT for turn2+ HEAVY: mean=0.81s (actual: 4.73s, **5.8x gap**)
+### Why Unified beats LMetric

-## Approach A: Contention-Aware Cost Model [ADOPTED]
+1. **Session affinity** preserves KV cache across turns → turn 2+ TTFT much lower
+2. **Additive cost model** (`contention + queue + prefill`) avoids LMetric's degenerate
+   case when `num_requests = 0` (all instances score 0, tie-break to instance 0)
+3. **`num_requests` as contention signal** better captures GPU batch scheduling
+   overhead than `ongoing_tokens`

-Replace `(pending_prefill + decode_tokens) / throughput` with
-`num_requests * decode_iteration_s + pending_prefill / throughput` as the
-queue estimation. `num_requests` (batch size) is the primary driver of
-decode iteration time and thus real contention.
+### Why PD-sep offload doesn't help (yet)

-Add a migration discount for sessions with accumulated cache (turn >= 2),
-reflecting the long-term value of migrating a session off a loaded instance.
+Extensive experimentation with offload/migration showed that PD-sep overhead
+(C queue + prefill + KV transfer + D scheduling) consistently exceeds load
+balance benefit:

-### Parameters
+| Experiment | Offloads | E2E p90 | vs Baseline |
+|-----------|----------|---------|-------------|
+| A (old gate, ~5 offloads) | 5 | 39.0 | -25% |
+| A (relaxed gate, ~6 offloads) | 6 | 46.0 | -12% |
+| A+B2 (forced migration) | 57 | 84.2 | +61% |
+| A (relaxed gate v2, both gates removed) | 134 | 81.5 | +56% |

- `decode_iteration_s = 0.05` (per-request decode iteration cost on H20)
- `migration_discount_cap = 5` (max turns to discount)
+More offloads → worse performance. The offload mechanism itself is the bottleneck.

-### Results (vs baseline, 850 requests, 8×H20)
+## Algorithm: Unified Routing

-| Metric           | Baseline | Approach A | Change  |
-|------------------|----------|------------|---------|
-| ALL TTFT mean    |    5.639 |      3.675 |   -35%  |
-| ALL TTFT p90     |   16.058 |      7.638 | **-52%**|
-| MEDIUM TTFT p90  |    4.412 |      1.681 | **-62%**|
-| HEAVY TTFT p90   |   23.780 |     15.929 |   -33%  |
-| ALL TPOT p90     |    0.105 |      0.075 |   -28%  |
-| ALL E2E p50      |    7.446 |      6.429 |   -14%  |
-| Errors           |        0 |          0 |     —   |
+```python
+cost(instance_i) = num_requests_i × decode_iteration_s     # contention
+                 + pending_prefill_tokens_i / throughput     # prefill queue
+                 + max(0, input - cache_hit_i) / throughput  # new prefill

-## Approach B: Session-Level Lazy Migration [UNDER TUNING]
+# Session affinity with two gates:
+if affinity instance exists:
+    gate 1: ongoing_tokens <= avg * overload_factor  (hard gate)
+    gate 2: affinity_cost <= global_best * overload_factor  (cost ratio)
+    if both pass → use affinity instance
+    else → use globally best instance
+else:
+    use globally best instance
+```

-Add a migration trigger **before** the cost model. When a request arrives for
-a session on an overloaded instance, force migration if:
-1. Instance busy: `num_requests > avg * migration_request_factor`
-2. Session has cache: `cache_ratio > 0.5`
-3. Request is HEAVY: `input_length >= heavy_threshold`
-4. Target meaningfully less loaded: `target.num_requests < source - 2`
-
-### Results (A+B combined, migration_request_factor=1.5)
-
-**0 migrations triggered** — Approach A's contention-aware routing already
-distributes load well enough that no instance reaches 1.5x average. The
-threshold needs to be lowered or the trigger redesigned.
-
-### Next steps
-
- Lower `migration_request_factor` (e.g. 1.2 or 1.3)
- Consider absolute threshold instead of relative (e.g. > avg + 3)
- Or trigger based on recent TTFT rather than instantaneous num_requests
+Parameters: `decode_iteration_s=0.05` (H20), `throughput=7000` (H20),
+`overload_factor=2.0`.

 ## Evolution of Results

-| Version | Description | ALL TTFT p90 | HEAVY TTFT p90 | tok max/min |
-|---------|-------------|-------------|----------------|-------------|
-| Baseline | linear routing | 16.058 | 23.780 | 2.7x |
-| v2 (bug) | unified, queue=prefill only | 23.339 | 38.070 | 10.3x |
-| v3 | +decode in queue, +hard gate | 10.121 | 18.471 | 2.6x |
-| **A** | **+num_requests contention** | **7.638** | **15.929** | **3.5x** |
-| A+B | +session migration (1.5x) | 8.291 | 16.384 | 3.0x |
+| Version | Description | ALL TTFT p90 | ALL E2E p90 | tok max/min |
+|---------|-------------|-------------|-------------|-------------|
+| Baseline | linear routing | 16.058 | 52.292 | 2.7x |
+| LMetric | P×BS, no affinity | 9.331 | 39.438 | 2.4x |
+| v2 (bug) | unified, queue=prefill only | 23.339 | 66.307 | 10.3x |
+| v3 | +decode in queue, +hard gate | 10.121 | 42.393 | 2.6x |
+| A (elastic) | +num_requests contention | 7.638 | 39.044 | 3.5x |
+| **A (baseline)** | **same routing, no Mooncake** | **8.034** | **37.073** | **—** |
+
+## Rigorous Review Summary
+
+Independent review found:
+- **CLEAN**: Fair comparison (identical vLLM/proxy/trace/measurement)
+- **CLEAN**: No reward hacking (improvement from algorithmic difference)
+- **WARNING**: 2% mean improvement needs multi-run verification (3-5 runs)
+- **NOTE**: Hardcoded constants (0.05, 7000) are hardware-specific but legitimate