Docs: reconcile routing docs with current hybrid direction
Per analysis/unified_routing_fix_review.md #2, several docs still presented the retired single-argmin + PUSH-migration design as the final algorithm. Mark them superseded and document the current hybrid direction (commit255c8e6). - REPORT.md §1.1 / §3.9: add errata callout and section header noting the "Final Design" framing was retired aftercc6e562/ 4c583f2; point readers to docs/migration-policy-design.md. - docs/migration-policy-design.md: rewrite. Opens with the current hybrid algorithm (LMetric base + cache_ratio>0.5 affinity gate + tie-breaker), then a "What Was Retired" commit table, then the old Approach A numbers preserved as "Historical Baseline-Mode Comparison". - analysis/research_findings.md §2.2 / §5: correct the LMetric framing. LMetric isn't "neutralized by affinity constraints" (pure --policy lmetric has no affinity at all); it converges to similar placements because P_tokens includes new_uncached_tokens, giving it implicit soft affinity. - analysis/elastic_hypotheses.md: same LMetric correction in the "DOESN'T work" summary, plus a footer cross-referencing the current routing direction. - analysis/unified_routing_fix_review.md: track this file (was untracked); it is the review handoff cited from the updated docs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -1,62 +1,109 @@
|
||||
# Migration Policy Design: Improving Load Balance in Elastic KV
|
||||
# Routing & Migration Policy: Design Log
|
||||
|
||||
## Final Result
|
||||
This file is the active reference for the routing policy. It supersedes the
|
||||
"single argmin + PUSH migration" framing once described as the final design
|
||||
(see commit notes below and `REPORT.md` §3.9 errata).
|
||||
|
||||
**Unified routing (baseline mode, no Mooncake)** beats LMetric on E2E mean/p50/p90.
|
||||
Pending multi-run significance verification.
|
||||
## Current Algorithm: Hybrid LMetric + High-Cache Affinity
|
||||
|
||||
| Metric | LMetric | Unified | Change |
|
||||
|--------|---------|---------|--------|
|
||||
| E2E mean | 18.204 | **17.831** | -2.0% |
|
||||
| E2E p50 | 6.184 | **6.074** | -1.8% |
|
||||
| E2E p90 | 39.438 | **37.073** | -6.0% |
|
||||
| TTFT p90 | 9.331 | **8.034** | -13.9% |
|
||||
Implemented in `255c8e6`. Active under `--policy unified` in
|
||||
`scripts/cache_aware_proxy.py`.
|
||||
|
||||
```python
|
||||
# Step 1: affinity gate (only for sessions that have a recorded owner)
|
||||
if session has affinity instance:
|
||||
cache_ratio = cache_hit_on_affinity / input_length
|
||||
gate_1: cache_ratio > 0.5
|
||||
gate_2: affinity.num_requests <= avg_num_requests * overload_factor
|
||||
if gate_1 AND gate_2:
|
||||
decision = "affinity"
|
||||
return affinity_instance
|
||||
|
||||
# Step 2: LMetric fallback with deterministic tie-breaker
|
||||
for each instance i:
|
||||
score_i = (pending_prefill_tokens_i + new_uncached_tokens_i) * num_requests_i
|
||||
= P_tokens * BS # primary
|
||||
secondary key: new_uncached_tokens # prefer cache
|
||||
tertiary key: num_requests # prefer idle
|
||||
quaternary: round-robin counter % n # break ties
|
||||
return argmin
|
||||
```
|
||||
|
||||
The pure `--policy lmetric` baseline stays affinity-free; the hybrid lives
|
||||
entirely under `--policy unified`. The round-robin counter is required because
|
||||
`P_tokens * BS = 0` whenever `BS = 0` for all instances (new sessions, cold
|
||||
start), which would otherwise pin every fresh session to instance 0.
|
||||
|
||||
Parameters: `overload_factor=2.0` (default). The previously-introduced
|
||||
`decode_iteration_s` / `prefill_throughput` / `rdma_overhead_s` are kept in
|
||||
`Settings` but no longer drive routing — they were Approach-A inputs.
|
||||
|
||||
### Why this shape
|
||||
|
||||
- **LMetric for load balance**: `P_tokens × BS` is hyperparameter-free and
|
||||
captures both pending prefill work and current batch contention.
|
||||
- **Implicit soft affinity from LMetric itself**: `P_tokens` includes
|
||||
`new_uncached_tokens = input - cache_hit`. Later turns naturally prefer
|
||||
the instance that already cached the prefix, because their `P_tokens` are
|
||||
smaller there. This is the dominant reason explicit migration buys little.
|
||||
- **Explicit affinity only for the long-cache case**: when cache_ratio > 0.5,
|
||||
the placement cost of breaking sticky is large enough to justify a hard
|
||||
gate. Below that ratio, defer to LMetric.
|
||||
|
||||
## What Was Retired and Why
|
||||
|
||||
| Commit | Approach | Outcome |
|
||||
|---|---|---|
|
||||
| `6b255fa` | Single `argmin(queue+prefill+transfer)` over local/PUSH/cold | Initial design; numbers in REPORT §3.9 |
|
||||
| `5892739` | Soft affinity added (pure argmin overloaded cache owners) | Stabilized but tail still degraded |
|
||||
| `2b9eae0` | Reported Unified v3 (116 PUSH migrations) | TTFT -25%/-32%, **E2E p90 +12%, p99 +24%** |
|
||||
| `e991960`/`5772149` | Forced session migration triggers (Approach B) | 57 migrations, **HEAVY TTFT p90 15.9s → 59.1s** |
|
||||
| `cc6e562` | Revert Approach B | "overhead exceeds LB benefit" |
|
||||
| `bf4469a` | Tighter push_cost + aligned hard gate | Triggered too few migrations to recover |
|
||||
| `4c583f2` | Revert relaxed gate | 134 offloads, **E2E p90 37s → 82s** |
|
||||
| `255c8e6` | **Current** hybrid LMetric + high-cache affinity | Stable baseline |
|
||||
|
||||
The shared lesson across the retired variants: PD-sep offload pays
|
||||
`C_queue + C_prefill + RDMA + D_schedule + D_decode_start` and the saved
|
||||
prefill time on D rarely amortizes this — especially because 92% of HEAVY
|
||||
requests are turn-1 cold (no source-side cache to migrate). See
|
||||
`analysis/elastic_hypotheses.md` H3-H9 for the per-variant evidence.
|
||||
|
||||
## Historical Baseline-Mode Comparison (Approach A)
|
||||
|
||||
These numbers are from the additive-cost-model variant of Unified routing
|
||||
(before `255c8e6`). Kept for reference; the hybrid currently lives on top of
|
||||
LMetric, not on this additive cost. The "Unified" column should not be cited
|
||||
as the current implementation.
|
||||
|
||||
| Metric | LMetric | Unified (Approach A, historical) | Change |
|
||||
|--------|---------|----------------------------------|--------|
|
||||
| E2E mean | 18.204 | 17.831 | -2.0% |
|
||||
| E2E p50 | 6.184 | 6.074 | -1.8% |
|
||||
| E2E p90 | 39.438 | 37.073 | -6.0% |
|
||||
| TTFT p90 | 9.331 | 8.034 | -13.9% |
|
||||
| Errors | 0 | 0 | — |
|
||||
|
||||
### Why Unified beats LMetric
|
||||
|
||||
1. **Session affinity** preserves KV cache across turns → turn 2+ TTFT much lower
|
||||
2. **Additive cost model** (`contention + queue + prefill`) avoids LMetric's degenerate
|
||||
case when `num_requests = 0` (all instances score 0, tie-break to instance 0)
|
||||
3. **`num_requests` as contention signal** better captures GPU batch scheduling
|
||||
overhead than `ongoing_tokens`
|
||||
|
||||
### Why PD-sep offload doesn't help (yet)
|
||||
|
||||
Extensive experimentation with offload/migration showed that PD-sep overhead
|
||||
(C queue + prefill + KV transfer + D scheduling) consistently exceeds load
|
||||
balance benefit:
|
||||
|
||||
| Experiment | Offloads | E2E p90 | vs Baseline |
|
||||
|-----------|----------|---------|-------------|
|
||||
| A (old gate, ~5 offloads) | 5 | 39.0 | -25% |
|
||||
| A (relaxed gate, ~6 offloads) | 6 | 46.0 | -12% |
|
||||
| A+B2 (forced migration) | 57 | 84.2 | +61% |
|
||||
| A (relaxed gate v2, both gates removed) | 134 | 81.5 | +56% |
|
||||
|
||||
More offloads → worse performance. The offload mechanism itself is the bottleneck.
|
||||
|
||||
## Algorithm: Unified Routing
|
||||
Approach A (additive cost model, historical):
|
||||
|
||||
```python
|
||||
cost(instance_i) = num_requests_i × decode_iteration_s # contention
|
||||
+ pending_prefill_tokens_i / throughput # prefill queue
|
||||
+ max(0, input - cache_hit_i) / throughput # new prefill
|
||||
|
||||
# Session affinity with two gates:
|
||||
if affinity instance exists:
|
||||
gate 1: ongoing_tokens <= avg * overload_factor (hard gate)
|
||||
gate 2: affinity_cost <= global_best * overload_factor (cost ratio)
|
||||
if both pass → use affinity instance
|
||||
else → use globally best instance
|
||||
else:
|
||||
use globally best instance
|
||||
gate 1: ongoing_tokens <= avg * overload_factor
|
||||
gate 2: affinity_cost <= global_best * overload_factor
|
||||
if both pass → affinity
|
||||
else → global best
|
||||
```
|
||||
|
||||
Parameters: `decode_iteration_s=0.05` (H20), `throughput=7000` (H20),
|
||||
`overload_factor=2.0`.
|
||||
Reason for retirement: the additive cost was load-bearing on
|
||||
`decode_iteration_s` and `prefill_throughput` constants. LMetric reproduces
|
||||
the same ordering without those constants because `P_tokens` and `BS` already
|
||||
capture both prefill queue and batch contention. The hybrid keeps the cheap
|
||||
LMetric core and adds an explicit affinity gate only for high-cache cases.
|
||||
|
||||
## Evolution of Results
|
||||
### Evolution Table (historical)
|
||||
|
||||
| Version | Description | ALL TTFT p90 | ALL E2E p90 | tok max/min |
|
||||
|---------|-------------|-------------|-------------|-------------|
|
||||
@@ -65,12 +112,34 @@ Parameters: `decode_iteration_s=0.05` (H20), `throughput=7000` (H20),
|
||||
| v2 (bug) | unified, queue=prefill only | 23.339 | 66.307 | 10.3x |
|
||||
| v3 | +decode in queue, +hard gate | 10.121 | 42.393 | 2.6x |
|
||||
| A (elastic) | +num_requests contention | 7.638 | 39.044 | 3.5x |
|
||||
| **A (baseline)** | **same routing, no Mooncake** | **8.034** | **37.073** | **—** |
|
||||
| A (baseline, Approach A) | same routing, no Mooncake | 8.034 | 37.073 | — |
|
||||
| **Hybrid (current)** | **LMetric + high-cache affinity** | **see §8 re-run** | **see §8 re-run** | **—** |
|
||||
|
||||
## Rigorous Review Summary
|
||||
The current Hybrid row deliberately has no number: per
|
||||
`analysis/unified_routing_fix_review.md` #8, the small (≤2%) improvements
|
||||
need 3-5 paired multi-run trials before being quoted.
|
||||
|
||||
Independent review found:
|
||||
## Open Questions / Next Steps
|
||||
|
||||
- **#8 paired multi-run**: 3-5 fresh trials of LMetric vs current hybrid on
|
||||
identical trace + restart procedure, with full artifact bundle
|
||||
(`config.json`, `metrics.jsonl`, `metrics.summary.json`, `breakdown.json`,
|
||||
`gpu_util.csv`, per-instance APC snapshot, git commit hash).
|
||||
- **#10 production-concurrency track**: re-introduce a controlled-concurrency
|
||||
flag (`--max-inflight-sessions` or equivalent) and rerun the comparison at
|
||||
64 / 128 active sessions before drawing conclusions for production.
|
||||
- **Hard affinity ceiling for the edge case where cache_ratio = 1.0 and the
|
||||
affinity instance is overloaded**: in that scenario the LMetric fallback's
|
||||
tie-breaker re-elects the same instance because its `new_uncached_tokens`
|
||||
is 0. Either add a hard num_requests ceiling or accept the behavior. Open
|
||||
in the test `test_hybrid_high_cache_breaks_on_overload` (works only when
|
||||
the request has at least one uncached block).
|
||||
|
||||
## Original "Rigorous Review Summary" (historical)
|
||||
|
||||
Independent review of the Approach A result above:
|
||||
- **CLEAN**: Fair comparison (identical vLLM/proxy/trace/measurement)
|
||||
- **CLEAN**: No reward hacking (improvement from algorithmic difference)
|
||||
- **WARNING**: 2% mean improvement needs multi-run verification (3-5 runs)
|
||||
- **NOTE**: Hardcoded constants (0.05, 7000) are hardware-specific but legitimate
|
||||
- **NOTE**: Hardcoded constants (`0.05`, `7000`) are hardware-specific but
|
||||
legitimate
|
||||
|
||||
Reference in New Issue
Block a user