Docs: reconcile routing docs with current hybrid direction

Per analysis/unified_routing_fix_review.md #2, several docs still presented the retired single-argmin + PUSH-migration design as the final algorithm. Mark them superseded and document the current hybrid direction (commit 255c8e6). - REPORT.md §1.1 / §3.9: add errata callout and section header noting the "Final Design" framing was retired after cc6e562 / 4c583f2; point readers to docs/migration-policy-design.md. - docs/migration-policy-design.md: rewrite. Opens with the current hybrid algorithm (LMetric base + cache_ratio>0.5 affinity gate + tie-breaker), then a "What Was Retired" commit table, then the old Approach A numbers preserved as "Historical Baseline-Mode Comparison". - analysis/research_findings.md §2.2 / §5: correct the LMetric framing. LMetric isn't "neutralized by affinity constraints" (pure --policy lmetric has no affinity at all); it converges to similar placements because P_tokens includes new_uncached_tokens, giving it implicit soft affinity. - analysis/elastic_hypotheses.md: same LMetric correction in the "DOESN'T work" summary, plus a footer cross-referencing the current routing direction. - analysis/unified_routing_fix_review.md: track this file (was untracked); it is the review handoff cited from the updated docs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 10:47:14 +08:00
parent ac6534c3ff
commit 6a27f75337
5 changed files with 591 additions and 53 deletions
--- a/REPORT.md
+++ b/REPORT.md
@@ -27,6 +27,12 @@ For agentic LLM workloads (long input, short output, high KV cache reuse), is pr
 >   was removed when replay moved to trace-driven dispatch. The next-step
 >   experiment requires restoring the flag first (see `FIXES.md` §B2
 >   route A) before any production-concurrency numbers can be produced.
 > - **§3.9 "Final Design" framing**: the single-argmin + PUSH-migration
 >   design was retired after `cc6e562` / `4c583f2` showed forced and
 >   relaxed-gate migration variants both regressed E2E tail. Current
 >   policy is the hybrid LMetric + high-cache affinity landed in
 >   `255c8e6`. See the per-section note in §3.9 and the active algorithm
 >   in `docs/migration-policy-design.md`.
 >
 > The authoritative results are in **§3.6 and §3.7**.
@@ -356,7 +362,23 @@ The elastic numbers on dash1 were genuinely fresh. The "improvement" was actuall
 **Output**: `outputs/eval_direct_rdma_v*/` on dash0.
-### 3.9 Unified Routing (Final Design)
+### 3.9 Unified Routing (Historical — superseded)
 > **Superseded by git history.** The "single argmin + PUSH migration" design
 > described here was implemented in `6b255fa`, refined through
 > `5892739` (soft affinity), `2b9eae0` (numbers below), and `4b50c5a`
 > (queue/overload-gate fixes). Follow-on attempts to scale migration —
 > `e991960`/`5772149` (forced session migration) and `bf4469a` (relaxed
 > push gate) — were both reverted (`cc6e562`, `4c583f2`) after they
 > regressed E2E tail (57 migrations → HEAVY TTFT p90 15.9s → 59.1s;
 > 134 offloads → E2E p90 37s → 82s).
 >
 > Current implementation is the **hybrid LMetric + high-cache affinity**
 > direction landed in `255c8e6`. See `docs/migration-policy-design.md`
 > for the active algorithm and `analysis/unified_routing_fix_review.md`
 > for the reasoning. The numbers below remain valid for the
 > `eval_unified_v3` artifact; do not treat them as the current
 > production policy.
 Replaced two-phase routing (pick_instance → offload gate) with single `argmin(expected_latency)` per instance:
--- a/analysis/elastic_hypotheses.md
+++ b/analysis/elastic_hypotheses.md
@@ -244,7 +244,11 @@ Offloaded:      —               13/500 (2.6%)   too few to matter
 ### What DOESN'T work for agentic workloads:
 1. **PD-Sep**: net negative — KV cache memory wall on decode instances
-2. **LMetric (OSDI'26)**: ≈ linear routing — session affinity limits routing freedom
+2. **LMetric (OSDI'26)**: ≈ linear routing — `P_tokens` already includes
   `new_uncached_tokens`, so cache-hit scoring gives LMetric an implicit
   soft affinity that converges to similar placements as explicit sticky
   affinity (see `analysis/research_findings.md` §2.2 for the corrected
   framing)
 3. **Elastic P2P RDMA offload**: net negative — Mooncake transfer overhead, no layerwise pipeline
 4. **OVERLOAD_FACTOR tuning**: no effect — imbalance from workload skew, not routing
 5. **Dedicated Prefill Service (PS)**: cannot win cost comparison without KV pull, PS is always slower than cached C
@@ -270,3 +274,21 @@ Instead of fixed chunk size, dynamically adjust based on decode pressure:
 - When decode queue is deep: smaller chunks → more decode slots → better TPOT
 - When decode queue is empty: larger chunks → faster prefill → better TTFT
 - This is a vLLM scheduler modification, not a routing change
 ---
 ## Current routing direction (cross-reference)
 The hypotheses above produced the following positive results that informed
 the current `--policy unified` implementation:
 - H1 / H7 / H9 (negative): PD-sep offload, OVERLOAD_FACTOR tuning, and
  elastic RDMA at high concurrency all regressed or stayed within noise.
 - H3 / H4 / H6 (partial): cache-gated offload exists but only ~10-12% of
  HEAVY requests have cache, and the offloaded subset pays RDMA penalty.
 The active algorithm (commit `255c8e6`) is **hybrid LMetric + high-cache
 affinity** in baseline mode (no Mooncake). The retired migration variants
 are catalogued in `docs/migration-policy-design.md` (Approach A and the
 revert chain `cc6e562` / `4c583f2`). H7's rejection (OVERLOAD_FACTOR within
 noise) is why the active default stays at `overload_factor=2.0`.
--- a/analysis/research_findings.md
+++ b/analysis/research_findings.md
@@ -38,7 +38,24 @@ These characteristics fundamentally change what optimizations matter.
 **Setup**: 8 instances, LMetric vs linear routing
 **Result**: TTFT +2.2%, TPOT -4.4%, E2E +2.6% — all within noise (±7% run-to-run)
-**Root cause**: Session affinity constrains routing freedom. LMetric's benefit (hyperparameter-free load balancing) is neutralized because turn 2+ requests MUST go to their session-sticky instance regardless of the scoring function. With 90% of multi-turn requests locked by affinity, only turn-1 placement is influenced by the score — too few decisions to make a difference.
+**Root cause (updated)**: LMetric is not "neutralized by affinity
 constraints" — pure `--policy lmetric` runs without session affinity at all.
 The actual reason the LMetric vs linear comparison sits within noise is that
 `P_tokens` already includes `new_uncached_tokens = input_length - cache_hit`,
 which means later turns of a session naturally score lowest on the instance
 that cached their prefix. This gives LMetric an **implicit soft affinity**
 that competes with linear's explicit sticky affinity. The two arrive at
 similar placements through different mechanisms.
 This is also why explicit migration buys little on top of LMetric: the
 first-order signal driving placement is already cache-derived. See
 `docs/migration-policy-design.md` for how the current hybrid policy uses
 this insight (LMetric base + explicit affinity only when `cache_ratio > 0.5`).
 **Previous framing (incorrect)**: an earlier draft of this section attributed
 the result to session affinity constraining LMetric's routing freedom. That
 framing assumed `--policy lmetric` inherited the linear-mode session-sticky
 behavior, which it does not (verified in `tests/test_proxy_pick.py`).
 ### 2.3 Elastic P2P RDMA Offload (Heavy prefill on different instance)
@@ -148,7 +165,9 @@ This changes the scheduling picture: most "HEAVY" requests in agentic workloads
 ### Why existing approaches don't work:
 1. **PD-Sep** assumes decode needs dedicated resources → agentic has memory wall on decode
-2. **LMetric** assumes routing freedom → agentic has session affinity constraints
+2. **LMetric** matches linear within noise because cache-hit appears in
   `P_tokens` itself, so it already routes later turns back to the cached
   instance via implicit soft affinity — explicit affinity buys little
 3. **Elastic RDMA** assumes KV transfer is cheap → Mooncake lacks layerwise pipelining
 4. **Size-based classification** assumes HEAVY = needs special handling → after cache, most HEAVY is MEDIUM
--- a/analysis/unified_routing_fix_review.md
+++ b/analysis/unified_routing_fix_review.md
@@ -0,0 +1,406 @@
 # Unified Routing Fix Review Handoff
 Date: 2026-05-25
 This is the corrected review handoff after reading the git history. The key
 change from the previous draft is that we should **not** restore the old
 single-argmin / PUSH-migration design. That path was implemented, measured,
 and then discarded or simplified by later commits.
 ## Executive Summary
 The latest commit history says the current direction is:
 - Use baseline mode / no Mooncake for the winning comparison against LMetric.
 - Use LMetric-style load balancing as the base.
 - Add explicit affinity only for sessions with high accumulated cache.
 - Do not re-enable PD-sep offload or session migration unless the transfer
  mechanism is fundamentally reworked.
 The main fixes now are cleanup, documentation consistency, tests, and
 reproducibility. The biggest risk is that stale docs still describe abandoned
 schemes as "final design".
 ## Evidence From Git History
 Relevant commits:
 | Commit | Meaning | Outcome |
 |---|---|---|
 | `6b255fa` | Implemented single `argmin(expected_latency)` over local / PUSH / cold paths | Later superseded |
 | `5892739` | Added soft affinity because pure argmin overloaded cache-source instances | Pure argmin was unstable |
 | `2b9eae0` | Reported Unified v3 with 116 PUSH migrations | TTFT improved, but TPOT/E2E tail tradeoff existed |
 | `cdf8349` | Added real cache sync and cached-prefill-on-C architecture | Fixed false PUSH and direct-read issues |
 | `4b50c5a` | Fixed queue model and hard overload gate | Reduced imbalance, but still on offload path |
 | `e991960` / `5772149` | Tried forced session migration triggers | Reverted |
 | `cc6e562` | Reverted Approach B migration | 57 migrations made HEAVY TTFT p90 regress 15.9s -> 59.1s |
 | `bf4469a` | Tried more accurate push cost / gate alignment | Later reverted |
 | `4c583f2` | Reverted relaxed gate / push-cost fix | 134 offloads made E2E p90 37s -> 82s |
 | `448361c` | Updated design doc: baseline no-Mooncake Unified beats LMetric | PD-sep offload degrades |
 | `255c8e6` | Replaced full cost model with hybrid routing | Current direction: LMetric LB + high-cache affinity |
 Do not ask the implementer to "restore real three-way argmin". That was the
 wrong instruction in the previous draft.
 ## Current Intended Algorithm
 From `255c8e6`, the current algorithm should be documented as:
 ```text
 if session has an affinity instance:
    if cache_ratio_on_affinity > 0.5
       and affinity_instance.num_requests <= avg_num_requests * overload_factor:
        route to affinity instance
    else:
        route by LMetric
 else:
    route by LMetric
 ```
 LMetric remains:
 ```text
 score = (pending_prefill_tokens + new_uncached_tokens) * num_requests
 new_uncached_tokens = input_length - estimated_cache_hit
 ```
 This is no longer the old expected-latency migration model.
 ## P0 Fixes
 ### 1. Remove Stale PUSH-Migration Code From the Current Unified Branch
 Location: `scripts/cache_aware_proxy.py`, `_handle_combined`.
 Problem:
 After `255c8e6`, `unified` is a hybrid policy, but the function still contains
 an unreachable block guarded by `best_needs_push = False`. That block references
 variables from the removed cost model such as `best_cache_idx`,
 `best_cache_hit`, and `_current_offloads`.
 Fix:
 - In the `unified` branch, delete the unreachable `if best_needs_push:` block.
 - Keep helper functions like `_handle_cached_prefill_offload` only if another
  live mode still calls them.
 - Update the `_handle_combined` docstring so it no longer says Unified always
  computes `queue + prefill + transfer`.
 Why:
 The code is currently safe only because the branch is unreachable. Leaving it
 there makes future changes dangerous and makes reviewers think migration is
 still part of the active policy.
 Verification:
 - `rg "best_needs_push|best_cache_idx|push_cache_hit" scripts/cache_aware_proxy.py`
  should show no stale references inside the active `unified` path.
 - `pytest -q`.
 ### 2. Reconcile Documentation With the Latest Commits
 Locations:
 - `REPORT.md`
 - `docs/migration-policy-design.md`
 - `analysis/research_findings.md`
 - `analysis/elastic_hypotheses.md`
 Problem:
 Several docs still present old or contradictory conclusions:
 - `REPORT.md` section 3.9 still calls single argmin / PUSH migration the
  "Final Design".
 - `docs/migration-policy-design.md` describes a baseline-mode additive cost
  model, while HEAD implements hybrid LMetric + high-cache affinity.
 - `analysis/research_findings.md` says LMetric is neutralized by session
  affinity constraints, but later corrected LMetric results say cache-hit
  scoring creates implicit soft affinity.
 Fix:
 - Add an explicit "Superseded by git history" note near `REPORT.md` section
  3.9:
  - `6b255fa/5892739/2b9eae0` were explored.
  - `cc6e562` rejected forced migration.
  - `4c583f2` rejected relaxed offload / high-offload configurations.
  - `255c8e6` is the current implementation direction.
 - Update `docs/migration-policy-design.md` to either:
  - describe the current hybrid algorithm, or
  - clearly state that the additive cost model is the previous Approach A and
    not the latest code.
 - Mark old analysis sections as superseded rather than deleting them.
 Why:
 The docs caused the previous bad review recommendation. Future reviewers need
 to know which ideas were already tested and rejected.
 Verification:
 - `rg "Final Design|argmin\\(expected_latency\\)|PUSH_MIGRATE|Approach B" REPORT.md docs analysis`
  should show clear superseded labels where appropriate.
 ### 3. Preserve the LMetric Baseline Separately From Unified Hybrid
 Location: `scripts/cache_aware_proxy.py`.
 Problem:
 Pure `--policy lmetric` is the baseline being compared against. Unified hybrid
 uses LMetric internally but should not accidentally change the LMetric baseline
 behavior.
 Fix:
 - Keep `pick_instance_lmetric` as the pure corrected LMetric implementation.
 - Put any hybrid-specific tie-breakers or affinity logic outside
  `pick_instance_lmetric`, under `--policy unified`.
 - In breakdown logs, record whether a Unified request used `affinity` or
  `lmetric_fallback`.
 Why:
 If the baseline and Unified share hidden behavior, future A/B comparisons become
 invalid.
 Verification:
 - Unit test: `--policy lmetric` never uses session affinity.
 - Unit test: `--policy unified` can use affinity only when cache ratio and load
  gates pass.
 ## P1 Fixes
 ### 4. Fix the Unified Hybrid / LMetric Fallback Empty-Batch Degeneracy
 Problem:
 LMetric score is `P_tokens * BS`. When `BS = num_requests = 0`, every instance
 gets score 0, so tie-break chooses instance 0. `docs/migration-policy-design.md`
 explicitly lists avoiding this as one reason Unified beats LMetric.
 The latest hybrid falls back to LMetric, so it may reintroduce this issue for
 new sessions when all instances are idle.
 Fix:
 - Do not change the pure `--policy lmetric` baseline.
 - For `--policy unified` fallback only, add a deterministic secondary key:
  - primary: `P_tokens * BS`
  - secondary when scores tie: `new_uncached_tokens`, then `num_requests`, then
    a round-robin or least-recently-used instance index.
 - Record tie-break count in breakdown or stats.
 Why:
 This preserves fair LMetric comparison while preventing Unified hybrid from
 degenerating to instance 0 under empty or near-empty load.
 Verification:
 - Unit test with all `num_requests = 0`: Unified should not always choose index
  0 across repeated new sessions.
 - Confirm pure LMetric test still matches the OSDI-style baseline.
 ### 5. Add Tests for the Current Hybrid Policy
 Problem:
 Existing tests cover older `pick_instance` and pure LMetric behavior, but not
 the current `unified` branch introduced by `255c8e6`.
 Fix:
 Add tests for:
 - High-cache session sticks to affinity instance when not overloaded.
 - High-cache session breaks affinity when `num_requests` exceeds the overload
  gate.
 - Low-cache session falls back to LMetric.
 - New session falls back to LMetric with the Unified-specific tie-breaker.
 - Breakdown policy is recorded as `affinity` or `lmetric_fallback`.
 Why:
 This prevents future drift back toward the discarded migration cost model or
 accidental changes to the LMetric baseline.
 Verification:
 - `pytest -q`.
 - Tests should run without live vLLM.
 ### 6. Treat `overload_factor` Changes as Experiments, Not Silent Fixes
 Observation:
 The current worktree has an uncommitted change:
 ```text
 Settings.overload_factor: 2.0 -> 1.5
 ```
 But the earlier H7 sweep found overload-factor tuning largely ineffective /
 within noise. This is recorded in `analysis/elastic_hypotheses.md`.
 Fix:
 - Do not silently commit a default change to `1.5` without a paired benchmark.
 - If testing `1.5`, make it an experiment tag/config value, not a new default.
 - Keep docs and CLI defaults synchronized.
 Why:
 The prior experiment says imbalance was mostly workload/session skew, not the
 threshold. A default change without evidence will create another reproducibility
 gap.
 Verification:
 - Run paired `2.0` vs `1.5` after the hybrid tests exist.
 - Report E2E p50/p90, TTFT p90, APC distribution, and GPU util imbalance.
 ### 7. Standardize Breakdown Fields for Hybrid Routing
 Problem:
 The current breakdown logs do not clearly expose why Unified chose affinity vs
 LMetric fallback.
 Fix:
 For each request under `--policy unified`, log:
 - `policy: "unified"`
 - `decision: "affinity" | "lmetric_fallback"`
 - `affinity_idx`
 - `chosen_idx`
 - `affinity_cache_hit`
 - `affinity_cache_ratio`
 - `affinity_num_requests`
 - `avg_num_requests`
 - `fallback_score`
 - `tie_break_used`
 Why:
 The latest performance difference versus LMetric is small. Without decision
 logs, it is hard to tell whether Unified is actually exercising the intended
 high-cache affinity behavior.
 Verification:
 - `breakdown.json` can answer: how many requests used affinity, how many used
  fallback, and what latency/APC each group saw.
 ## P2 Experiment Fixes
 ### 8. Re-run Paired LMetric vs Unified Hybrid Benchmarks
 Problem:
 `docs/migration-policy-design.md` says the 2% mean improvement needs multi-run
 verification. Local raw outputs for the May 25 final comparison are not present
 in this workspace.
 Fix:
 - Run 3-5 fresh paired trials.
 - Same trace, same vLLM build, same machine, same restart procedure.
 - Compare:
  - pure LMetric
  - current Unified hybrid
  - optionally Linear/session-sticky as a reference
 Metrics:
 - TTFT mean/p50/p90/p99
 - TPOT mean/p50/p90/p99
 - E2E mean/p50/p90/p99
 - errors/timeouts
 - aggregate APC
 - per-instance APC distribution
 - per-instance request count and token count
 - GPU util mean/std/imbalance
 Why:
 The current reported win is small enough that run-to-run noise matters.
 Verification:
 - Commit or save artifacts under a date/tagged output directory:
  `config.json`, `metrics.jsonl`, `metrics.summary.json`, `breakdown.json`,
  `gpu_util.csv`, final vLLM APC snapshot, git commit hash.
 ### 9. Do Not Re-open PD-Sep Offload Without a New Transfer Mechanism
 Rejected paths:
 - Full PD separation: decode KV memory wall.
 - Elastic P2P RDMA offload: transfer and scheduling overhead exceed benefit.
 - Cache-gate offload: improves balance for colocated survivors but offloaded
  requests pay RDMA penalty.
 - Approach B session migration: 57 migrations made HEAVY TTFT p90 much worse.
 - Relaxed gate / many offloads: 134 offloads made E2E p90 much worse.
 Future work can revisit migration only if one of these changes first:
 - layerwise / pipelined KV transfer
 - multi-machine P role so P work does not compete with D on the same GPU pool
 - exact vLLM state / cache residency exposed to the router
 - production-concurrency benchmark showing decode SLO pressure large enough to
  amortize transfer overhead
 Why:
 The same local mechanism has already failed multiple times. Repeating it with
 another threshold is unlikely to help.
 ### 10. Restore Production-Concurrency Evaluation as a Separate Track
 Problem:
 Several conclusions were made at 1-2 req/GPU, while production is estimated at
 8-15 req/GPU. Higher concurrency is where prefill-decode interference appears.
 Fix:
 - Restore or replace `--max-inflight-sessions` for controlled concurrency.
 - Run at 64 and 128 active sessions.
 - Treat this as a new experiment track, not as a reason to resurrect old
  migration code immediately.
 Why:
 At high concurrency, the design pressure may change. But the implementation
 should first prove the current hybrid baseline cleanly.
 ## Suggested Implementation Order
 1. Update docs to mark discarded migration/offload schemes as superseded.
 2. Remove stale unreachable PUSH code from the current Unified branch.
 3. Add tests for current Unified hybrid behavior.
 4. Add a Unified-only tie-breaker for LMetric fallback empty-batch cases.
 5. Add breakdown fields for hybrid routing decisions.
 6. Decide whether the local `overload_factor=1.5` diff is an experiment or
   should be dropped.
 7. Run paired multi-run LMetric vs Unified hybrid benchmarks.
 8. Only after that, open a separate high-concurrency experiment track.
 ## Review Checklist
 - Does the code still mention single `argmin(expected_latency)` as current
  behavior? If yes, update it or mark it superseded.
 - Is any migration/offload code reachable under `--policy unified`? If yes,
  require a new experiment plan because recent history rejects it.
 - Does pure `--policy lmetric` remain pure and affinity-free?
 - Does `--policy unified` clearly log when it used affinity vs LMetric fallback?
 - Are default parameter changes backed by paired results?
 - Can another reviewer reproduce the result from committed scripts and saved
  artifacts?
--- a/docs/migration-policy-design.md
+++ b/docs/migration-policy-design.md
@@ -1,62 +1,109 @@
-# Migration Policy Design: Improving Load Balance in Elastic KV
+# Routing & Migration Policy: Design Log
-## Final Result
+This file is the active reference for the routing policy. It supersedes the
 "single argmin + PUSH migration" framing once described as the final design
 (see commit notes below and `REPORT.md` §3.9 errata).
-**Unified routing (baseline mode, no Mooncake)** beats LMetric on E2E mean/p50/p90.
+## Current Algorithm: Hybrid LMetric + High-Cache Affinity
 Pending multi-run significance verification.
-| Metric | LMetric | Unified | Change |
+Implemented in `255c8e6`. Active under `--policy unified` in
-|--------|---------|---------|--------|
+`scripts/cache_aware_proxy.py`.
-| E2E mean | 18.204 | **17.831** | -2.0% |
+
-| E2E p50 | 6.184 | **6.074** | -1.8% |
+```python
-| E2E p90 | 39.438 | **37.073** | -6.0% |
+# Step 1: affinity gate (only for sessions that have a recorded owner)
-| TTFT p90 | 9.331 | **8.034** | -13.9% |
+if session has affinity instance:
    cache_ratio = cache_hit_on_affinity / input_length
    gate_1: cache_ratio > 0.5
    gate_2: affinity.num_requests <= avg_num_requests * overload_factor
    if gate_1 AND gate_2:
        decision = "affinity"
        return affinity_instance
 # Step 2: LMetric fallback with deterministic tie-breaker
 for each instance i:
    score_i = (pending_prefill_tokens_i + new_uncached_tokens_i) * num_requests_i
            = P_tokens * BS                                     # primary
 secondary key: new_uncached_tokens                              # prefer cache
 tertiary key:  num_requests                                     # prefer idle
 quaternary:    round-robin counter % n                          # break ties
 return argmin
 ```
 The pure `--policy lmetric` baseline stays affinity-free; the hybrid lives
 entirely under `--policy unified`. The round-robin counter is required because
 `P_tokens * BS = 0` whenever `BS = 0` for all instances (new sessions, cold
 start), which would otherwise pin every fresh session to instance 0.
 Parameters: `overload_factor=2.0` (default). The previously-introduced
 `decode_iteration_s` / `prefill_throughput` / `rdma_overhead_s` are kept in
 `Settings` but no longer drive routing — they were Approach-A inputs.
 ### Why this shape
 - **LMetric for load balance**: `P_tokens × BS` is hyperparameter-free and
  captures both pending prefill work and current batch contention.
 - **Implicit soft affinity from LMetric itself**: `P_tokens` includes
  `new_uncached_tokens = input - cache_hit`. Later turns naturally prefer
  the instance that already cached the prefix, because their `P_tokens` are
  smaller there. This is the dominant reason explicit migration buys little.
 - **Explicit affinity only for the long-cache case**: when cache_ratio > 0.5,
  the placement cost of breaking sticky is large enough to justify a hard
  gate. Below that ratio, defer to LMetric.
 ## What Was Retired and Why
 | Commit | Approach | Outcome |
 |---|---|---|
 | `6b255fa` | Single `argmin(queue+prefill+transfer)` over local/PUSH/cold | Initial design; numbers in REPORT §3.9 |
 | `5892739` | Soft affinity added (pure argmin overloaded cache owners) | Stabilized but tail still degraded |
 | `2b9eae0` | Reported Unified v3 (116 PUSH migrations) | TTFT -25%/-32%, **E2E p90 +12%, p99 +24%** |
 | `e991960`/`5772149` | Forced session migration triggers (Approach B) | 57 migrations, **HEAVY TTFT p90 15.9s → 59.1s** |
 | `cc6e562` | Revert Approach B | "overhead exceeds LB benefit" |
 | `bf4469a` | Tighter push_cost + aligned hard gate | Triggered too few migrations to recover |
 | `4c583f2` | Revert relaxed gate | 134 offloads, **E2E p90 37s → 82s** |
 | `255c8e6` | **Current** hybrid LMetric + high-cache affinity | Stable baseline |
 The shared lesson across the retired variants: PD-sep offload pays
 `C_queue + C_prefill + RDMA + D_schedule + D_decode_start` and the saved
 prefill time on D rarely amortizes this — especially because 92% of HEAVY
 requests are turn-1 cold (no source-side cache to migrate). See
 `analysis/elastic_hypotheses.md` H3-H9 for the per-variant evidence.
 ## Historical Baseline-Mode Comparison (Approach A)
 These numbers are from the additive-cost-model variant of Unified routing
 (before `255c8e6`). Kept for reference; the hybrid currently lives on top of
 LMetric, not on this additive cost. The "Unified" column should not be cited
 as the current implementation.
 | Metric | LMetric | Unified (Approach A, historical) | Change |
 |--------|---------|----------------------------------|--------|
 | E2E mean | 18.204 | 17.831 | -2.0% |
 | E2E p50 | 6.184 | 6.074 | -1.8% |
 | E2E p90 | 39.438 | 37.073 | -6.0% |
 | TTFT p90 | 9.331 | 8.034 | -13.9% |
 | Errors | 0 | 0 | — |
-### Why Unified beats LMetric
+Approach A (additive cost model, historical):
 1. **Session affinity** preserves KV cache across turns → turn 2+ TTFT much lower
 2. **Additive cost model** (`contention + queue + prefill`) avoids LMetric's degenerate
   case when `num_requests = 0` (all instances score 0, tie-break to instance 0)
 3. **`num_requests` as contention signal** better captures GPU batch scheduling
   overhead than `ongoing_tokens`
 ### Why PD-sep offload doesn't help (yet)
 Extensive experimentation with offload/migration showed that PD-sep overhead
 (C queue + prefill + KV transfer + D scheduling) consistently exceeds load
 balance benefit:
 | Experiment | Offloads | E2E p90 | vs Baseline |
 |-----------|----------|---------|-------------|
 | A (old gate, ~5 offloads) | 5 | 39.0 | -25% |
 | A (relaxed gate, ~6 offloads) | 6 | 46.0 | -12% |
 | A+B2 (forced migration) | 57 | 84.2 | +61% |
 | A (relaxed gate v2, both gates removed) | 134 | 81.5 | +56% |
 More offloads → worse performance. The offload mechanism itself is the bottleneck.
 ## Algorithm: Unified Routing
 ```python
 cost(instance_i) = num_requests_i × decode_iteration_s     # contention
                 + pending_prefill_tokens_i / throughput     # prefill queue
                 + max(0, input - cache_hit_i) / throughput  # new prefill
 # Session affinity with two gates:
 if affinity instance exists:
-    gate 1: ongoing_tokens <= avg * overload_factor  (hard gate)
+    gate 1: ongoing_tokens <= avg * overload_factor
-    gate 2: affinity_cost <= global_best * overload_factor  (cost ratio)
+    gate 2: affinity_cost <= global_best * overload_factor
-    if both pass → use affinity instance
+    if both pass → affinity
-    else → use globally best instance
+    else        → global best
 else:
    use globally best instance
 ```
-Parameters: `decode_iteration_s=0.05` (H20), `throughput=7000` (H20),
+Reason for retirement: the additive cost was load-bearing on
-`overload_factor=2.0`.
+`decode_iteration_s` and `prefill_throughput` constants. LMetric reproduces
 the same ordering without those constants because `P_tokens` and `BS` already
 capture both prefill queue and batch contention. The hybrid keeps the cheap
 LMetric core and adds an explicit affinity gate only for high-cache cases.
-## Evolution of Results
+### Evolution Table (historical)
 | Version | Description | ALL TTFT p90 | ALL E2E p90 | tok max/min |
 |---------|-------------|-------------|-------------|-------------|
@@ -65,12 +112,34 @@ Parameters: `decode_iteration_s=0.05` (H20), `throughput=7000` (H20),
 | v2 (bug) | unified, queue=prefill only | 23.339 | 66.307 | 10.3x |
 | v3 | +decode in queue, +hard gate | 10.121 | 42.393 | 2.6x |
 | A (elastic) | +num_requests contention | 7.638 | 39.044 | 3.5x |
-| **A (baseline)** | **same routing, no Mooncake** | **8.034** | **37.073** | **—** |
+| A (baseline, Approach A) | same routing, no Mooncake | 8.034 | 37.073 | — |
 | **Hybrid (current)** | **LMetric + high-cache affinity** | **see §8 re-run** | **see §8 re-run** | **—** |
-## Rigorous Review Summary
+The current Hybrid row deliberately has no number: per
 `analysis/unified_routing_fix_review.md` #8, the small (≤2%) improvements
 need 3-5 paired multi-run trials before being quoted.
-Independent review found:
+## Open Questions / Next Steps
 - **#8 paired multi-run**: 3-5 fresh trials of LMetric vs current hybrid on
  identical trace + restart procedure, with full artifact bundle
  (`config.json`, `metrics.jsonl`, `metrics.summary.json`, `breakdown.json`,
  `gpu_util.csv`, per-instance APC snapshot, git commit hash).
 - **#10 production-concurrency track**: re-introduce a controlled-concurrency
  flag (`--max-inflight-sessions` or equivalent) and rerun the comparison at
  64 / 128 active sessions before drawing conclusions for production.
 - **Hard affinity ceiling for the edge case where cache_ratio = 1.0 and the
  affinity instance is overloaded**: in that scenario the LMetric fallback's
  tie-breaker re-elects the same instance because its `new_uncached_tokens`
  is 0. Either add a hard num_requests ceiling or accept the behavior. Open
  in the test `test_hybrid_high_cache_breaks_on_overload` (works only when
  the request has at least one uncached block).
 ## Original "Rigorous Review Summary" (historical)
 Independent review of the Approach A result above:
 - **CLEAN**: Fair comparison (identical vLLM/proxy/trace/measurement)
 - **CLEAN**: No reward hacking (improvement from algorithmic difference)
 - **WARNING**: 2% mean improvement needs multi-run verification (3-5 runs)
- **NOTE**: Hardcoded constants (0.05, 7000) are hardware-specific but legitimate
+- **NOTE**: Hardcoded constants (`0.05`, `7000`) are hardware-specific but
  legitimate