Per analysis/unified_routing_fix_review.md #2, several docs still presented the retired single-argmin + PUSH-migration design as the final algorithm. Mark them superseded and document the current hybrid direction (commit255c8e6). - REPORT.md §1.1 / §3.9: add errata callout and section header noting the "Final Design" framing was retired aftercc6e562/ 4c583f2; point readers to docs/migration-policy-design.md. - docs/migration-policy-design.md: rewrite. Opens with the current hybrid algorithm (LMetric base + cache_ratio>0.5 affinity gate + tie-breaker), then a "What Was Retired" commit table, then the old Approach A numbers preserved as "Historical Baseline-Mode Comparison". - analysis/research_findings.md §2.2 / §5: correct the LMetric framing. LMetric isn't "neutralized by affinity constraints" (pure --policy lmetric has no affinity at all); it converges to similar placements because P_tokens includes new_uncached_tokens, giving it implicit soft affinity. - analysis/elastic_hypotheses.md: same LMetric correction in the "DOESN'T work" summary, plus a footer cross-referencing the current routing direction. - analysis/unified_routing_fix_review.md: track this file (was untracked); it is the review handoff cited from the updated docs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
7.2 KiB
Routing & Migration Policy: Design Log
This file is the active reference for the routing policy. It supersedes the
"single argmin + PUSH migration" framing once described as the final design
(see commit notes below and REPORT.md §3.9 errata).
Current Algorithm: Hybrid LMetric + High-Cache Affinity
Implemented in 255c8e6. Active under --policy unified in
scripts/cache_aware_proxy.py.
# Step 1: affinity gate (only for sessions that have a recorded owner)
if session has affinity instance:
cache_ratio = cache_hit_on_affinity / input_length
gate_1: cache_ratio > 0.5
gate_2: affinity.num_requests <= avg_num_requests * overload_factor
if gate_1 AND gate_2:
decision = "affinity"
return affinity_instance
# Step 2: LMetric fallback with deterministic tie-breaker
for each instance i:
score_i = (pending_prefill_tokens_i + new_uncached_tokens_i) * num_requests_i
= P_tokens * BS # primary
secondary key: new_uncached_tokens # prefer cache
tertiary key: num_requests # prefer idle
quaternary: round-robin counter % n # break ties
return argmin
The pure --policy lmetric baseline stays affinity-free; the hybrid lives
entirely under --policy unified. The round-robin counter is required because
P_tokens * BS = 0 whenever BS = 0 for all instances (new sessions, cold
start), which would otherwise pin every fresh session to instance 0.
Parameters: overload_factor=2.0 (default). The previously-introduced
decode_iteration_s / prefill_throughput / rdma_overhead_s are kept in
Settings but no longer drive routing — they were Approach-A inputs.
Why this shape
- LMetric for load balance:
P_tokens × BSis hyperparameter-free and captures both pending prefill work and current batch contention. - Implicit soft affinity from LMetric itself:
P_tokensincludesnew_uncached_tokens = input - cache_hit. Later turns naturally prefer the instance that already cached the prefix, because theirP_tokensare smaller there. This is the dominant reason explicit migration buys little. - Explicit affinity only for the long-cache case: when cache_ratio > 0.5, the placement cost of breaking sticky is large enough to justify a hard gate. Below that ratio, defer to LMetric.
What Was Retired and Why
| Commit | Approach | Outcome |
|---|---|---|
6b255fa |
Single argmin(queue+prefill+transfer) over local/PUSH/cold |
Initial design; numbers in REPORT §3.9 |
5892739 |
Soft affinity added (pure argmin overloaded cache owners) | Stabilized but tail still degraded |
2b9eae0 |
Reported Unified v3 (116 PUSH migrations) | TTFT -25%/-32%, E2E p90 +12%, p99 +24% |
e991960/5772149 |
Forced session migration triggers (Approach B) | 57 migrations, HEAVY TTFT p90 15.9s → 59.1s |
cc6e562 |
Revert Approach B | "overhead exceeds LB benefit" |
bf4469a |
Tighter push_cost + aligned hard gate | Triggered too few migrations to recover |
4c583f2 |
Revert relaxed gate | 134 offloads, E2E p90 37s → 82s |
255c8e6 |
Current hybrid LMetric + high-cache affinity | Stable baseline |
The shared lesson across the retired variants: PD-sep offload pays
C_queue + C_prefill + RDMA + D_schedule + D_decode_start and the saved
prefill time on D rarely amortizes this — especially because 92% of HEAVY
requests are turn-1 cold (no source-side cache to migrate). See
analysis/elastic_hypotheses.md H3-H9 for the per-variant evidence.
Historical Baseline-Mode Comparison (Approach A)
These numbers are from the additive-cost-model variant of Unified routing
(before 255c8e6). Kept for reference; the hybrid currently lives on top of
LMetric, not on this additive cost. The "Unified" column should not be cited
as the current implementation.
| Metric | LMetric | Unified (Approach A, historical) | Change |
|---|---|---|---|
| E2E mean | 18.204 | 17.831 | -2.0% |
| E2E p50 | 6.184 | 6.074 | -1.8% |
| E2E p90 | 39.438 | 37.073 | -6.0% |
| TTFT p90 | 9.331 | 8.034 | -13.9% |
| Errors | 0 | 0 | — |
Approach A (additive cost model, historical):
cost(instance_i) = num_requests_i × decode_iteration_s # contention
+ pending_prefill_tokens_i / throughput # prefill queue
+ max(0, input - cache_hit_i) / throughput # new prefill
if affinity instance exists:
gate 1: ongoing_tokens <= avg * overload_factor
gate 2: affinity_cost <= global_best * overload_factor
if both pass → affinity
else → global best
Reason for retirement: the additive cost was load-bearing on
decode_iteration_s and prefill_throughput constants. LMetric reproduces
the same ordering without those constants because P_tokens and BS already
capture both prefill queue and batch contention. The hybrid keeps the cheap
LMetric core and adds an explicit affinity gate only for high-cache cases.
Evolution Table (historical)
| Version | Description | ALL TTFT p90 | ALL E2E p90 | tok max/min |
|---|---|---|---|---|
| Baseline | linear routing | 16.058 | 52.292 | 2.7x |
| LMetric | P×BS, no affinity | 9.331 | 39.438 | 2.4x |
| v2 (bug) | unified, queue=prefill only | 23.339 | 66.307 | 10.3x |
| v3 | +decode in queue, +hard gate | 10.121 | 42.393 | 2.6x |
| A (elastic) | +num_requests contention | 7.638 | 39.044 | 3.5x |
| A (baseline, Approach A) | same routing, no Mooncake | 8.034 | 37.073 | — |
| Hybrid (current) | LMetric + high-cache affinity | see §8 re-run | see §8 re-run | — |
The current Hybrid row deliberately has no number: per
analysis/unified_routing_fix_review.md #8, the small (≤2%) improvements
need 3-5 paired multi-run trials before being quoted.
Open Questions / Next Steps
- #8 paired multi-run: 3-5 fresh trials of LMetric vs current hybrid on
identical trace + restart procedure, with full artifact bundle
(
config.json,metrics.jsonl,metrics.summary.json,breakdown.json,gpu_util.csv, per-instance APC snapshot, git commit hash). - #10 production-concurrency track: re-introduce a controlled-concurrency
flag (
--max-inflight-sessionsor equivalent) and rerun the comparison at 64 / 128 active sessions before drawing conclusions for production. - Hard affinity ceiling for the edge case where cache_ratio = 1.0 and the
affinity instance is overloaded: in that scenario the LMetric fallback's
tie-breaker re-elects the same instance because its
new_uncached_tokensis 0. Either add a hard num_requests ceiling or accept the behavior. Open in the testtest_hybrid_high_cache_breaks_on_overload(works only when the request has at least one uncached block).
Original "Rigorous Review Summary" (historical)
Independent review of the Approach A result above:
- CLEAN: Fair comparison (identical vLLM/proxy/trace/measurement)
- CLEAN: No reward hacking (improvement from algorithmic difference)
- WARNING: 2% mean improvement needs multi-run verification (3-5 runs)
- NOTE: Hardcoded constants (
0.05,7000) are hardware-specific but legitimate