Files
agentic-kvc/docs/migration-policy-design.md
Gahow Wang 6a27f75337 Docs: reconcile routing docs with current hybrid direction
Per analysis/unified_routing_fix_review.md #2, several docs still
presented the retired single-argmin + PUSH-migration design as the
final algorithm. Mark them superseded and document the current hybrid
direction (commit 255c8e6).

- REPORT.md §1.1 / §3.9: add errata callout and section header noting
  the "Final Design" framing was retired after cc6e562 / 4c583f2;
  point readers to docs/migration-policy-design.md.

- docs/migration-policy-design.md: rewrite. Opens with the current
  hybrid algorithm (LMetric base + cache_ratio>0.5 affinity gate +
  tie-breaker), then a "What Was Retired" commit table, then the old
  Approach A numbers preserved as "Historical Baseline-Mode Comparison".

- analysis/research_findings.md §2.2 / §5: correct the LMetric framing.
  LMetric isn't "neutralized by affinity constraints" (pure --policy
  lmetric has no affinity at all); it converges to similar placements
  because P_tokens includes new_uncached_tokens, giving it implicit
  soft affinity.

- analysis/elastic_hypotheses.md: same LMetric correction in the
  "DOESN'T work" summary, plus a footer cross-referencing the current
  routing direction.

- analysis/unified_routing_fix_review.md: track this file (was
  untracked); it is the review handoff cited from the updated docs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 10:47:14 +08:00

7.2 KiB
Raw Blame History

Routing & Migration Policy: Design Log

This file is the active reference for the routing policy. It supersedes the "single argmin + PUSH migration" framing once described as the final design (see commit notes below and REPORT.md §3.9 errata).

Current Algorithm: Hybrid LMetric + High-Cache Affinity

Implemented in 255c8e6. Active under --policy unified in scripts/cache_aware_proxy.py.

# Step 1: affinity gate (only for sessions that have a recorded owner)
if session has affinity instance:
    cache_ratio = cache_hit_on_affinity / input_length
    gate_1: cache_ratio > 0.5
    gate_2: affinity.num_requests <= avg_num_requests * overload_factor
    if gate_1 AND gate_2:
        decision = "affinity"
        return affinity_instance

# Step 2: LMetric fallback with deterministic tie-breaker
for each instance i:
    score_i = (pending_prefill_tokens_i + new_uncached_tokens_i) * num_requests_i
            = P_tokens * BS                                     # primary
secondary key: new_uncached_tokens                              # prefer cache
tertiary key:  num_requests                                     # prefer idle
quaternary:    round-robin counter % n                          # break ties
return argmin

The pure --policy lmetric baseline stays affinity-free; the hybrid lives entirely under --policy unified. The round-robin counter is required because P_tokens * BS = 0 whenever BS = 0 for all instances (new sessions, cold start), which would otherwise pin every fresh session to instance 0.

Parameters: overload_factor=2.0 (default). The previously-introduced decode_iteration_s / prefill_throughput / rdma_overhead_s are kept in Settings but no longer drive routing — they were Approach-A inputs.

Why this shape

  • LMetric for load balance: P_tokens × BS is hyperparameter-free and captures both pending prefill work and current batch contention.
  • Implicit soft affinity from LMetric itself: P_tokens includes new_uncached_tokens = input - cache_hit. Later turns naturally prefer the instance that already cached the prefix, because their P_tokens are smaller there. This is the dominant reason explicit migration buys little.
  • Explicit affinity only for the long-cache case: when cache_ratio > 0.5, the placement cost of breaking sticky is large enough to justify a hard gate. Below that ratio, defer to LMetric.

What Was Retired and Why

Commit Approach Outcome
6b255fa Single argmin(queue+prefill+transfer) over local/PUSH/cold Initial design; numbers in REPORT §3.9
5892739 Soft affinity added (pure argmin overloaded cache owners) Stabilized but tail still degraded
2b9eae0 Reported Unified v3 (116 PUSH migrations) TTFT -25%/-32%, E2E p90 +12%, p99 +24%
e991960/5772149 Forced session migration triggers (Approach B) 57 migrations, HEAVY TTFT p90 15.9s → 59.1s
cc6e562 Revert Approach B "overhead exceeds LB benefit"
bf4469a Tighter push_cost + aligned hard gate Triggered too few migrations to recover
4c583f2 Revert relaxed gate 134 offloads, E2E p90 37s → 82s
255c8e6 Current hybrid LMetric + high-cache affinity Stable baseline

The shared lesson across the retired variants: PD-sep offload pays C_queue + C_prefill + RDMA + D_schedule + D_decode_start and the saved prefill time on D rarely amortizes this — especially because 92% of HEAVY requests are turn-1 cold (no source-side cache to migrate). See analysis/elastic_hypotheses.md H3-H9 for the per-variant evidence.

Historical Baseline-Mode Comparison (Approach A)

These numbers are from the additive-cost-model variant of Unified routing (before 255c8e6). Kept for reference; the hybrid currently lives on top of LMetric, not on this additive cost. The "Unified" column should not be cited as the current implementation.

Metric LMetric Unified (Approach A, historical) Change
E2E mean 18.204 17.831 -2.0%
E2E p50 6.184 6.074 -1.8%
E2E p90 39.438 37.073 -6.0%
TTFT p90 9.331 8.034 -13.9%
Errors 0 0

Approach A (additive cost model, historical):

cost(instance_i) = num_requests_i × decode_iteration_s     # contention
                 + pending_prefill_tokens_i / throughput     # prefill queue
                 + max(0, input - cache_hit_i) / throughput  # new prefill
if affinity instance exists:
    gate 1: ongoing_tokens <= avg * overload_factor
    gate 2: affinity_cost <= global_best * overload_factor
    if both pass  affinity
    else         global best

Reason for retirement: the additive cost was load-bearing on decode_iteration_s and prefill_throughput constants. LMetric reproduces the same ordering without those constants because P_tokens and BS already capture both prefill queue and batch contention. The hybrid keeps the cheap LMetric core and adds an explicit affinity gate only for high-cache cases.

Evolution Table (historical)

Version Description ALL TTFT p90 ALL E2E p90 tok max/min
Baseline linear routing 16.058 52.292 2.7x
LMetric P×BS, no affinity 9.331 39.438 2.4x
v2 (bug) unified, queue=prefill only 23.339 66.307 10.3x
v3 +decode in queue, +hard gate 10.121 42.393 2.6x
A (elastic) +num_requests contention 7.638 39.044 3.5x
A (baseline, Approach A) same routing, no Mooncake 8.034 37.073
Hybrid (current) LMetric + high-cache affinity see §8 re-run see §8 re-run

The current Hybrid row deliberately has no number: per analysis/unified_routing_fix_review.md #8, the small (≤2%) improvements need 3-5 paired multi-run trials before being quoted.

Open Questions / Next Steps

  • #8 paired multi-run: 3-5 fresh trials of LMetric vs current hybrid on identical trace + restart procedure, with full artifact bundle (config.json, metrics.jsonl, metrics.summary.json, breakdown.json, gpu_util.csv, per-instance APC snapshot, git commit hash).
  • #10 production-concurrency track: re-introduce a controlled-concurrency flag (--max-inflight-sessions or equivalent) and rerun the comparison at 64 / 128 active sessions before drawing conclusions for production.
  • Hard affinity ceiling for the edge case where cache_ratio = 1.0 and the affinity instance is overloaded: in that scenario the LMetric fallback's tie-breaker re-elects the same instance because its new_uncached_tokens is 0. Either add a hard num_requests ceiling or accept the behavior. Open in the test test_hybrid_high_cache_breaks_on_overload (works only when the request has at least one uncached block).

Original "Rigorous Review Summary" (historical)

Independent review of the Approach A result above:

  • CLEAN: Fair comparison (identical vLLM/proxy/trace/measurement)
  • CLEAN: No reward hacking (improvement from algorithmic difference)
  • WARNING: 2% mean improvement needs multi-run verification (3-5 runs)
  • NOTE: Hardcoded constants (0.05, 7000) are hardware-specific but legitimate