Files

Gahow Wang 6a27f75337 Docs: reconcile routing docs with current hybrid direction

Per analysis/unified_routing_fix_review.md #2, several docs still
presented the retired single-argmin + PUSH-migration design as the
final algorithm. Mark them superseded and document the current hybrid
direction (commit 255c8e6).

- REPORT.md §1.1 / §3.9: add errata callout and section header noting
  the "Final Design" framing was retired after cc6e562 / 4c583f2;
  point readers to docs/migration-policy-design.md.

- docs/migration-policy-design.md: rewrite. Opens with the current
  hybrid algorithm (LMetric base + cache_ratio>0.5 affinity gate +
  tie-breaker), then a "What Was Retired" commit table, then the old
  Approach A numbers preserved as "Historical Baseline-Mode Comparison".

- analysis/research_findings.md §2.2 / §5: correct the LMetric framing.
  LMetric isn't "neutralized by affinity constraints" (pure --policy
  lmetric has no affinity at all); it converges to similar placements
  because P_tokens includes new_uncached_tokens, giving it implicit
  soft affinity.

- analysis/elastic_hypotheses.md: same LMetric correction in the
  "DOESN'T work" summary, plus a footer cross-referencing the current
  routing direction.

- analysis/unified_routing_fix_review.md: track this file (was
  untracked); it is the review handoff cited from the updated docs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-25 10:47:14 +08:00

7.2 KiB

Raw Blame History

Routing & Migration Policy: Design Log

This file is the active reference for the routing policy. It supersedes the "single argmin + PUSH migration" framing once described as the final design (see commit notes below and REPORT.md §3.9 errata).

Current Algorithm: Hybrid LMetric + High-Cache Affinity

Implemented in 255c8e6. Active under --policy unified in scripts/cache_aware_proxy.py.

# Step 1: affinity gate (only for sessions that have a recorded owner)
if session has affinity instance:
    cache_ratio = cache_hit_on_affinity / input_length
    gate_1: cache_ratio > 0.5
    gate_2: affinity.num_requests <= avg_num_requests * overload_factor
    if gate_1 AND gate_2:
        decision = "affinity"
        return affinity_instance

# Step 2: LMetric fallback with deterministic tie-breaker
for each instance i:
    score_i = (pending_prefill_tokens_i + new_uncached_tokens_i) * num_requests_i
            = P_tokens * BS                                     # primary
secondary key: new_uncached_tokens                              # prefer cache
tertiary key:  num_requests                                     # prefer idle
quaternary:    round-robin counter % n                          # break ties
return argmin

The pure --policy lmetric baseline stays affinity-free; the hybrid lives entirely under --policy unified. The round-robin counter is required because P_tokens * BS = 0 whenever BS = 0 for all instances (new sessions, cold start), which would otherwise pin every fresh session to instance 0.

Parameters: overload_factor=2.0 (default). The previously-introduced decode_iteration_s / prefill_throughput / rdma_overhead_s are kept in Settings but no longer drive routing — they were Approach-A inputs.

Why this shape

LMetric for load balance: P_tokens × BS is hyperparameter-free and captures both pending prefill work and current batch contention.
Implicit soft affinity from LMetric itself: P_tokens includes new_uncached_tokens = input - cache_hit. Later turns naturally prefer the instance that already cached the prefix, because their P_tokens are smaller there. This is the dominant reason explicit migration buys little.
Explicit affinity only for the long-cache case: when cache_ratio > 0.5, the placement cost of breaking sticky is large enough to justify a hard gate. Below that ratio, defer to LMetric.

What Was Retired and Why

Commit	Approach	Outcome
`6b255fa`	Single `argmin(queue+prefill+transfer)` over local/PUSH/cold	Initial design; numbers in REPORT §3.9
`5892739`	Soft affinity added (pure argmin overloaded cache owners)	Stabilized but tail still degraded
`2b9eae0`	Reported Unified v3 (116 PUSH migrations)	TTFT -25%/-32%, E2E p90 +12%, p99 +24%
`e991960`/`5772149`	Forced session migration triggers (Approach B)	57 migrations, HEAVY TTFT p90 15.9s → 59.1s
`cc6e562`	Revert Approach B	"overhead exceeds LB benefit"
`bf4469a`	Tighter push_cost + aligned hard gate	Triggered too few migrations to recover
`4c583f2`	Revert relaxed gate	134 offloads, E2E p90 37s → 82s
`255c8e6`	Current hybrid LMetric + high-cache affinity	Stable baseline

The shared lesson across the retired variants: PD-sep offload pays C_queue + C_prefill + RDMA + D_schedule + D_decode_start and the saved prefill time on D rarely amortizes this — especially because 92% of HEAVY requests are turn-1 cold (no source-side cache to migrate). See analysis/elastic_hypotheses.md H3-H9 for the per-variant evidence.

Historical Baseline-Mode Comparison (Approach A)

These numbers are from the additive-cost-model variant of Unified routing (before 255c8e6). Kept for reference; the hybrid currently lives on top of LMetric, not on this additive cost. The "Unified" column should not be cited as the current implementation.

Metric	LMetric	Unified (Approach A, historical)	Change
E2E mean	18.204	17.831	-2.0%
E2E p50	6.184	6.074	-1.8%
E2E p90	39.438	37.073	-6.0%
TTFT p90	9.331	8.034	-13.9%
Errors	0	0	—

Approach A (additive cost model, historical):

cost(instance_i) = num_requests_i × decode_iteration_s     # contention
                 + pending_prefill_tokens_i / throughput     # prefill queue
                 + max(0, input - cache_hit_i) / throughput  # new prefill
if affinity instance exists:
    gate 1: ongoing_tokens <= avg * overload_factor
    gate 2: affinity_cost <= global_best * overload_factor
    if both pass → affinity
    else        → global best

Reason for retirement: the additive cost was load-bearing on decode_iteration_s and prefill_throughput constants. LMetric reproduces the same ordering without those constants because P_tokens and BS already capture both prefill queue and batch contention. The hybrid keeps the cheap LMetric core and adds an explicit affinity gate only for high-cache cases.

Evolution Table (historical)

Version	Description	ALL TTFT p90	ALL E2E p90	tok max/min
Baseline	linear routing	16.058	52.292	2.7x
LMetric	P×BS, no affinity	9.331	39.438	2.4x
v2 (bug)	unified, queue=prefill only	23.339	66.307	10.3x
v3	+decode in queue, +hard gate	10.121	42.393	2.6x
A (elastic)	+num_requests contention	7.638	39.044	3.5x
A (baseline, Approach A)	same routing, no Mooncake	8.034	37.073	—
Hybrid (current)	LMetric + high-cache affinity	see §8 re-run	see §8 re-run	—

The current Hybrid row deliberately has no number: per analysis/unified_routing_fix_review.md #8, the small (≤2%) improvements need 3-5 paired multi-run trials before being quoted.

Open Questions / Next Steps

#8 paired multi-run: 3-5 fresh trials of LMetric vs current hybrid on identical trace + restart procedure, with full artifact bundle (config.json, metrics.jsonl, metrics.summary.json, breakdown.json, gpu_util.csv, per-instance APC snapshot, git commit hash).
#10 production-concurrency track: re-introduce a controlled-concurrency flag (--max-inflight-sessions or equivalent) and rerun the comparison at 64 / 128 active sessions before drawing conclusions for production.
Hard affinity ceiling for the edge case where cache_ratio = 1.0 and the affinity instance is overloaded: in that scenario the LMetric fallback's tie-breaker re-elects the same instance because its new_uncached_tokens is 0. Either add a hard num_requests ceiling or accept the behavior. Open in the test test_hybrid_high_cache_breaks_on_overload (works only when the request has at least one uncached block).

Original "Rigorous Review Summary" (historical)

Independent review of the Approach A result above:

CLEAN: Fair comparison (identical vLLM/proxy/trace/measurement)
CLEAN: No reward hacking (improvement from algorithmic difference)
WARNING: 2% mean improvement needs multi-run verification (3-5 runs)
NOTE: Hardcoded constants (0.05, 7000) are hardware-specific but legitimate

7.2 KiB Raw Blame History Unescape Escape