Docs: reconcile routing docs with current hybrid direction

Per analysis/unified_routing_fix_review.md #2, several docs still presented the retired single-argmin + PUSH-migration design as the final algorithm. Mark them superseded and document the current hybrid direction (commit 255c8e6). - REPORT.md §1.1 / §3.9: add errata callout and section header noting the "Final Design" framing was retired after cc6e562 / 4c583f2; point readers to docs/migration-policy-design.md. - docs/migration-policy-design.md: rewrite. Opens with the current hybrid algorithm (LMetric base + cache_ratio>0.5 affinity gate + tie-breaker), then a "What Was Retired" commit table, then the old Approach A numbers preserved as "Historical Baseline-Mode Comparison". - analysis/research_findings.md §2.2 / §5: correct the LMetric framing. LMetric isn't "neutralized by affinity constraints" (pure --policy lmetric has no affinity at all); it converges to similar placements because P_tokens includes new_uncached_tokens, giving it implicit soft affinity. - analysis/elastic_hypotheses.md: same LMetric correction in the "DOESN'T work" summary, plus a footer cross-referencing the current routing direction. - analysis/unified_routing_fix_review.md: track this file (was untracked); it is the review handoff cited from the updated docs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 10:47:14 +08:00
parent ac6534c3ff
commit 6a27f75337
5 changed files with 591 additions and 53 deletions
--- a/REPORT.md
+++ b/REPORT.md
@@ -27,6 +27,12 @@ For agentic LLM workloads (long input, short output, high KV cache reuse), is pr
 >   was removed when replay moved to trace-driven dispatch. The next-step
 >   experiment requires restoring the flag first (see `FIXES.md` §B2
 >   route A) before any production-concurrency numbers can be produced.
+> - **§3.9 "Final Design" framing**: the single-argmin + PUSH-migration
+>   design was retired after `cc6e562` / `4c583f2` showed forced and
+>   relaxed-gate migration variants both regressed E2E tail. Current
+>   policy is the hybrid LMetric + high-cache affinity landed in
+>   `255c8e6`. See the per-section note in §3.9 and the active algorithm
+>   in `docs/migration-policy-design.md`.
 >
 > The authoritative results are in **§3.6 and §3.7**.

@@ -356,7 +362,23 @@ The elastic numbers on dash1 were genuinely fresh. The "improvement" was actuall

 **Output**: `outputs/eval_direct_rdma_v*/` on dash0.

-### 3.9 Unified Routing (Final Design)
+### 3.9 Unified Routing (Historical — superseded)
+
+> **Superseded by git history.** The "single argmin + PUSH migration" design
+> described here was implemented in `6b255fa`, refined through
+> `5892739` (soft affinity), `2b9eae0` (numbers below), and `4b50c5a`
+> (queue/overload-gate fixes). Follow-on attempts to scale migration —
+> `e991960`/`5772149` (forced session migration) and `bf4469a` (relaxed
+> push gate) — were both reverted (`cc6e562`, `4c583f2`) after they
+> regressed E2E tail (57 migrations → HEAVY TTFT p90 15.9s → 59.1s;
+> 134 offloads → E2E p90 37s → 82s).
+>
+> Current implementation is the **hybrid LMetric + high-cache affinity**
+> direction landed in `255c8e6`. See `docs/migration-policy-design.md`
+> for the active algorithm and `analysis/unified_routing_fix_review.md`
+> for the reasoning. The numbers below remain valid for the
+> `eval_unified_v3` artifact; do not treat them as the current
+> production policy.

 Replaced two-phase routing (pick_instance → offload gate) with single `argmin(expected_latency)` per instance:

--- a/analysis/elastic_hypotheses.md
+++ b/analysis/elastic_hypotheses.md
@@ -244,7 +244,11 @@ Offloaded:      —               13/500 (2.6%)   too few to matter
 ### What DOESN'T work for agentic workloads:

 1. **PD-Sep**: net negative — KV cache memory wall on decode instances
-2. **LMetric (OSDI'26)**: ≈ linear routing — session affinity limits routing freedom
+2. **LMetric (OSDI'26)**: ≈ linear routing — `P_tokens` already includes
+   `new_uncached_tokens`, so cache-hit scoring gives LMetric an implicit
+   soft affinity that converges to similar placements as explicit sticky
+   affinity (see `analysis/research_findings.md` §2.2 for the corrected
+   framing)
 3. **Elastic P2P RDMA offload**: net negative — Mooncake transfer overhead, no layerwise pipeline
 4. **OVERLOAD_FACTOR tuning**: no effect — imbalance from workload skew, not routing
 5. **Dedicated Prefill Service (PS)**: cannot win cost comparison without KV pull, PS is always slower than cached C
@@ -270,3 +274,21 @@ Instead of fixed chunk size, dynamically adjust based on decode pressure:
 - When decode queue is deep: smaller chunks → more decode slots → better TPOT
 - When decode queue is empty: larger chunks → faster prefill → better TTFT
 - This is a vLLM scheduler modification, not a routing change
+
+---
+
+## Current routing direction (cross-reference)
+
+The hypotheses above produced the following positive results that informed
+the current `--policy unified` implementation:
+
+- H1 / H7 / H9 (negative): PD-sep offload, OVERLOAD_FACTOR tuning, and
+  elastic RDMA at high concurrency all regressed or stayed within noise.
+- H3 / H4 / H6 (partial): cache-gated offload exists but only ~10-12% of
+  HEAVY requests have cache, and the offloaded subset pays RDMA penalty.
+
+The active algorithm (commit `255c8e6`) is **hybrid LMetric + high-cache
+affinity** in baseline mode (no Mooncake). The retired migration variants
+are catalogued in `docs/migration-policy-design.md` (Approach A and the
+revert chain `cc6e562` / `4c583f2`). H7's rejection (OVERLOAD_FACTOR within
+noise) is why the active default stays at `overload_factor=2.0`.
--- a/analysis/research_findings.md
+++ b/analysis/research_findings.md
@@ -38,7 +38,24 @@ These characteristics fundamentally change what optimizations matter.
 **Setup**: 8 instances, LMetric vs linear routing
 **Result**: TTFT +2.2%, TPOT -4.4%, E2E +2.6% — all within noise (±7% run-to-run)

-**Root cause**: Session affinity constrains routing freedom. LMetric's benefit (hyperparameter-free load balancing) is neutralized because turn 2+ requests MUST go to their session-sticky instance regardless of the scoring function. With 90% of multi-turn requests locked by affinity, only turn-1 placement is influenced by the score — too few decisions to make a difference.
+**Root cause (updated)**: LMetric is not "neutralized by affinity
+constraints" — pure `--policy lmetric` runs without session affinity at all.
+The actual reason the LMetric vs linear comparison sits within noise is that
+`P_tokens` already includes `new_uncached_tokens = input_length - cache_hit`,
+which means later turns of a session naturally score lowest on the instance
+that cached their prefix. This gives LMetric an **implicit soft affinity**
+that competes with linear's explicit sticky affinity. The two arrive at
+similar placements through different mechanisms.
+
+This is also why explicit migration buys little on top of LMetric: the
+first-order signal driving placement is already cache-derived. See
+`docs/migration-policy-design.md` for how the current hybrid policy uses
+this insight (LMetric base + explicit affinity only when `cache_ratio > 0.5`).
+
+**Previous framing (incorrect)**: an earlier draft of this section attributed
+the result to session affinity constraining LMetric's routing freedom. That
+framing assumed `--policy lmetric` inherited the linear-mode session-sticky
+behavior, which it does not (verified in `tests/test_proxy_pick.py`).

 ### 2.3 Elastic P2P RDMA Offload (Heavy prefill on different instance)

@@ -148,7 +165,9 @@ This changes the scheduling picture: most "HEAVY" requests in agentic workloads

 ### Why existing approaches don't work:
 1. **PD-Sep** assumes decode needs dedicated resources → agentic has memory wall on decode
-2. **LMetric** assumes routing freedom → agentic has session affinity constraints
+2. **LMetric** matches linear within noise because cache-hit appears in
+   `P_tokens` itself, so it already routes later turns back to the cached
+   instance via implicit soft affinity — explicit affinity buys little
 3. **Elastic RDMA** assumes KV transfer is cheap → Mooncake lacks layerwise pipelining
 4. **Size-based classification** assumes HEAVY = needs special handling → after cache, most HEAVY is MEDIUM

--- a/analysis/unified_routing_fix_review.md
+++ b/analysis/unified_routing_fix_review.md
@@ -0,0 +1,406 @@
+# Unified Routing Fix Review Handoff
+
+Date: 2026-05-25
+
+This is the corrected review handoff after reading the git history. The key
+change from the previous draft is that we should **not** restore the old
+single-argmin / PUSH-migration design. That path was implemented, measured,
+and then discarded or simplified by later commits.
+
+## Executive Summary
+
+The latest commit history says the current direction is:
+
+- Use baseline mode / no Mooncake for the winning comparison against LMetric.
+- Use LMetric-style load balancing as the base.
+- Add explicit affinity only for sessions with high accumulated cache.
+- Do not re-enable PD-sep offload or session migration unless the transfer
+  mechanism is fundamentally reworked.
+
+The main fixes now are cleanup, documentation consistency, tests, and
+reproducibility. The biggest risk is that stale docs still describe abandoned
+schemes as "final design".
+
+## Evidence From Git History
+
+Relevant commits:
+
+| Commit | Meaning | Outcome |
+|---|---|---|
+| `6b255fa` | Implemented single `argmin(expected_latency)` over local / PUSH / cold paths | Later superseded |
+| `5892739` | Added soft affinity because pure argmin overloaded cache-source instances | Pure argmin was unstable |
+| `2b9eae0` | Reported Unified v3 with 116 PUSH migrations | TTFT improved, but TPOT/E2E tail tradeoff existed |
+| `cdf8349` | Added real cache sync and cached-prefill-on-C architecture | Fixed false PUSH and direct-read issues |
+| `4b50c5a` | Fixed queue model and hard overload gate | Reduced imbalance, but still on offload path |
+| `e991960` / `5772149` | Tried forced session migration triggers | Reverted |
+| `cc6e562` | Reverted Approach B migration | 57 migrations made HEAVY TTFT p90 regress 15.9s -> 59.1s |
+| `bf4469a` | Tried more accurate push cost / gate alignment | Later reverted |
+| `4c583f2` | Reverted relaxed gate / push-cost fix | 134 offloads made E2E p90 37s -> 82s |
+| `448361c` | Updated design doc: baseline no-Mooncake Unified beats LMetric | PD-sep offload degrades |
+| `255c8e6` | Replaced full cost model with hybrid routing | Current direction: LMetric LB + high-cache affinity |
+
+Do not ask the implementer to "restore real three-way argmin". That was the
+wrong instruction in the previous draft.
+
+## Current Intended Algorithm
+
+From `255c8e6`, the current algorithm should be documented as:
+
+```text
+if session has an affinity instance:
+    if cache_ratio_on_affinity > 0.5
+       and affinity_instance.num_requests <= avg_num_requests * overload_factor:
+        route to affinity instance
+    else:
+        route by LMetric
+else:
+    route by LMetric
+```
+
+LMetric remains:
+
+```text
+score = (pending_prefill_tokens + new_uncached_tokens) * num_requests
+new_uncached_tokens = input_length - estimated_cache_hit
+```
+
+This is no longer the old expected-latency migration model.
+
+## P0 Fixes
+
+### 1. Remove Stale PUSH-Migration Code From the Current Unified Branch
+
+Location: `scripts/cache_aware_proxy.py`, `_handle_combined`.
+
+Problem:
+
+After `255c8e6`, `unified` is a hybrid policy, but the function still contains
+an unreachable block guarded by `best_needs_push = False`. That block references
+variables from the removed cost model such as `best_cache_idx`,
+`best_cache_hit`, and `_current_offloads`.
+
+Fix:
+
+- In the `unified` branch, delete the unreachable `if best_needs_push:` block.
+- Keep helper functions like `_handle_cached_prefill_offload` only if another
+  live mode still calls them.
+- Update the `_handle_combined` docstring so it no longer says Unified always
+  computes `queue + prefill + transfer`.
+
+Why:
+
+The code is currently safe only because the branch is unreachable. Leaving it
+there makes future changes dangerous and makes reviewers think migration is
+still part of the active policy.
+
+Verification:
+
+- `rg "best_needs_push|best_cache_idx|push_cache_hit" scripts/cache_aware_proxy.py`
+  should show no stale references inside the active `unified` path.
+- `pytest -q`.
+
+### 2. Reconcile Documentation With the Latest Commits
+
+Locations:
+
+- `REPORT.md`
+- `docs/migration-policy-design.md`
+- `analysis/research_findings.md`
+- `analysis/elastic_hypotheses.md`
+
+Problem:
+
+Several docs still present old or contradictory conclusions:
+
+- `REPORT.md` section 3.9 still calls single argmin / PUSH migration the
+  "Final Design".
+- `docs/migration-policy-design.md` describes a baseline-mode additive cost
+  model, while HEAD implements hybrid LMetric + high-cache affinity.
+- `analysis/research_findings.md` says LMetric is neutralized by session
+  affinity constraints, but later corrected LMetric results say cache-hit
+  scoring creates implicit soft affinity.
+
+Fix:
+
+- Add an explicit "Superseded by git history" note near `REPORT.md` section
+  3.9:
+  - `6b255fa/5892739/2b9eae0` were explored.
+  - `cc6e562` rejected forced migration.
+  - `4c583f2` rejected relaxed offload / high-offload configurations.
+  - `255c8e6` is the current implementation direction.
+- Update `docs/migration-policy-design.md` to either:
+  - describe the current hybrid algorithm, or
+  - clearly state that the additive cost model is the previous Approach A and
+    not the latest code.
+- Mark old analysis sections as superseded rather than deleting them.
+
+Why:
+
+The docs caused the previous bad review recommendation. Future reviewers need
+to know which ideas were already tested and rejected.
+
+Verification:
+
+- `rg "Final Design|argmin\\(expected_latency\\)|PUSH_MIGRATE|Approach B" REPORT.md docs analysis`
+  should show clear superseded labels where appropriate.
+
+### 3. Preserve the LMetric Baseline Separately From Unified Hybrid
+
+Location: `scripts/cache_aware_proxy.py`.
+
+Problem:
+
+Pure `--policy lmetric` is the baseline being compared against. Unified hybrid
+uses LMetric internally but should not accidentally change the LMetric baseline
+behavior.
+
+Fix:
+
+- Keep `pick_instance_lmetric` as the pure corrected LMetric implementation.
+- Put any hybrid-specific tie-breakers or affinity logic outside
+  `pick_instance_lmetric`, under `--policy unified`.
+- In breakdown logs, record whether a Unified request used `affinity` or
+  `lmetric_fallback`.
+
+Why:
+
+If the baseline and Unified share hidden behavior, future A/B comparisons become
+invalid.
+
+Verification:
+
+- Unit test: `--policy lmetric` never uses session affinity.
+- Unit test: `--policy unified` can use affinity only when cache ratio and load
+  gates pass.
+
+## P1 Fixes
+
+### 4. Fix the Unified Hybrid / LMetric Fallback Empty-Batch Degeneracy
+
+Problem:
+
+LMetric score is `P_tokens * BS`. When `BS = num_requests = 0`, every instance
+gets score 0, so tie-break chooses instance 0. `docs/migration-policy-design.md`
+explicitly lists avoiding this as one reason Unified beats LMetric.
+
+The latest hybrid falls back to LMetric, so it may reintroduce this issue for
+new sessions when all instances are idle.
+
+Fix:
+
+- Do not change the pure `--policy lmetric` baseline.
+- For `--policy unified` fallback only, add a deterministic secondary key:
+  - primary: `P_tokens * BS`
+  - secondary when scores tie: `new_uncached_tokens`, then `num_requests`, then
+    a round-robin or least-recently-used instance index.
+- Record tie-break count in breakdown or stats.
+
+Why:
+
+This preserves fair LMetric comparison while preventing Unified hybrid from
+degenerating to instance 0 under empty or near-empty load.
+
+Verification:
+
+- Unit test with all `num_requests = 0`: Unified should not always choose index
+  0 across repeated new sessions.
+- Confirm pure LMetric test still matches the OSDI-style baseline.
+
+### 5. Add Tests for the Current Hybrid Policy
+
+Problem:
+
+Existing tests cover older `pick_instance` and pure LMetric behavior, but not
+the current `unified` branch introduced by `255c8e6`.
+
+Fix:
+
+Add tests for:
+
+- High-cache session sticks to affinity instance when not overloaded.
+- High-cache session breaks affinity when `num_requests` exceeds the overload
+  gate.
+- Low-cache session falls back to LMetric.
+- New session falls back to LMetric with the Unified-specific tie-breaker.
+- Breakdown policy is recorded as `affinity` or `lmetric_fallback`.
+
+Why:
+
+This prevents future drift back toward the discarded migration cost model or
+accidental changes to the LMetric baseline.
+
+Verification:
+
+- `pytest -q`.
+- Tests should run without live vLLM.
+
+### 6. Treat `overload_factor` Changes as Experiments, Not Silent Fixes
+
+Observation:
+
+The current worktree has an uncommitted change:
+
+```text
+Settings.overload_factor: 2.0 -> 1.5
+```
+
+But the earlier H7 sweep found overload-factor tuning largely ineffective /
+within noise. This is recorded in `analysis/elastic_hypotheses.md`.
+
+Fix:
+
+- Do not silently commit a default change to `1.5` without a paired benchmark.
+- If testing `1.5`, make it an experiment tag/config value, not a new default.
+- Keep docs and CLI defaults synchronized.
+
+Why:
+
+The prior experiment says imbalance was mostly workload/session skew, not the
+threshold. A default change without evidence will create another reproducibility
+gap.
+
+Verification:
+
+- Run paired `2.0` vs `1.5` after the hybrid tests exist.
+- Report E2E p50/p90, TTFT p90, APC distribution, and GPU util imbalance.
+
+### 7. Standardize Breakdown Fields for Hybrid Routing
+
+Problem:
+
+The current breakdown logs do not clearly expose why Unified chose affinity vs
+LMetric fallback.
+
+Fix:
+
+For each request under `--policy unified`, log:
+
+- `policy: "unified"`
+- `decision: "affinity" | "lmetric_fallback"`
+- `affinity_idx`
+- `chosen_idx`
+- `affinity_cache_hit`
+- `affinity_cache_ratio`
+- `affinity_num_requests`
+- `avg_num_requests`
+- `fallback_score`
+- `tie_break_used`
+
+Why:
+
+The latest performance difference versus LMetric is small. Without decision
+logs, it is hard to tell whether Unified is actually exercising the intended
+high-cache affinity behavior.
+
+Verification:
+
+- `breakdown.json` can answer: how many requests used affinity, how many used
+  fallback, and what latency/APC each group saw.
+
+## P2 Experiment Fixes
+
+### 8. Re-run Paired LMetric vs Unified Hybrid Benchmarks
+
+Problem:
+
+`docs/migration-policy-design.md` says the 2% mean improvement needs multi-run
+verification. Local raw outputs for the May 25 final comparison are not present
+in this workspace.
+
+Fix:
+
+- Run 3-5 fresh paired trials.
+- Same trace, same vLLM build, same machine, same restart procedure.
+- Compare:
+  - pure LMetric
+  - current Unified hybrid
+  - optionally Linear/session-sticky as a reference
+
+Metrics:
+
+- TTFT mean/p50/p90/p99
+- TPOT mean/p50/p90/p99
+- E2E mean/p50/p90/p99
+- errors/timeouts
+- aggregate APC
+- per-instance APC distribution
+- per-instance request count and token count
+- GPU util mean/std/imbalance
+
+Why:
+
+The current reported win is small enough that run-to-run noise matters.
+
+Verification:
+
+- Commit or save artifacts under a date/tagged output directory:
+  `config.json`, `metrics.jsonl`, `metrics.summary.json`, `breakdown.json`,
+  `gpu_util.csv`, final vLLM APC snapshot, git commit hash.
+
+### 9. Do Not Re-open PD-Sep Offload Without a New Transfer Mechanism
+
+Rejected paths:
+
+- Full PD separation: decode KV memory wall.
+- Elastic P2P RDMA offload: transfer and scheduling overhead exceed benefit.
+- Cache-gate offload: improves balance for colocated survivors but offloaded
+  requests pay RDMA penalty.
+- Approach B session migration: 57 migrations made HEAVY TTFT p90 much worse.
+- Relaxed gate / many offloads: 134 offloads made E2E p90 much worse.
+
+Future work can revisit migration only if one of these changes first:
+
+- layerwise / pipelined KV transfer
+- multi-machine P role so P work does not compete with D on the same GPU pool
+- exact vLLM state / cache residency exposed to the router
+- production-concurrency benchmark showing decode SLO pressure large enough to
+  amortize transfer overhead
+
+Why:
+
+The same local mechanism has already failed multiple times. Repeating it with
+another threshold is unlikely to help.
+
+### 10. Restore Production-Concurrency Evaluation as a Separate Track
+
+Problem:
+
+Several conclusions were made at 1-2 req/GPU, while production is estimated at
+8-15 req/GPU. Higher concurrency is where prefill-decode interference appears.
+
+Fix:
+
+- Restore or replace `--max-inflight-sessions` for controlled concurrency.
+- Run at 64 and 128 active sessions.
+- Treat this as a new experiment track, not as a reason to resurrect old
+  migration code immediately.
+
+Why:
+
+At high concurrency, the design pressure may change. But the implementation
+should first prove the current hybrid baseline cleanly.
+
+## Suggested Implementation Order
+
+1. Update docs to mark discarded migration/offload schemes as superseded.
+2. Remove stale unreachable PUSH code from the current Unified branch.
+3. Add tests for current Unified hybrid behavior.
+4. Add a Unified-only tie-breaker for LMetric fallback empty-batch cases.
+5. Add breakdown fields for hybrid routing decisions.
+6. Decide whether the local `overload_factor=1.5` diff is an experiment or
+   should be dropped.
+7. Run paired multi-run LMetric vs Unified hybrid benchmarks.
+8. Only after that, open a separate high-concurrency experiment track.
+
+## Review Checklist
+
+- Does the code still mention single `argmin(expected_latency)` as current
+  behavior? If yes, update it or mark it superseded.
+- Is any migration/offload code reachable under `--policy unified`? If yes,
+  require a new experiment plan because recent history rejects it.
+- Does pure `--policy lmetric` remain pure and affinity-free?
+- Does `--policy unified` clearly log when it used affinity vs LMetric fallback?
+- Are default parameter changes backed by paired results?
+- Can another reviewer reproduce the result from committed scripts and saved
+  artifacts?
+
--- a/docs/migration-policy-design.md
+++ b/docs/migration-policy-design.md
@@ -1,62 +1,109 @@
-# Migration Policy Design: Improving Load Balance in Elastic KV
+# Routing & Migration Policy: Design Log

-## Final Result
+This file is the active reference for the routing policy. It supersedes the
+"single argmin + PUSH migration" framing once described as the final design
+(see commit notes below and `REPORT.md` §3.9 errata).

-**Unified routing (baseline mode, no Mooncake)** beats LMetric on E2E mean/p50/p90.
-Pending multi-run significance verification.
+## Current Algorithm: Hybrid LMetric + High-Cache Affinity

-| Metric | LMetric | Unified | Change |
-|--------|---------|---------|--------|
-| E2E mean | 18.204 | **17.831** | -2.0% |
-| E2E p50 | 6.184 | **6.074** | -1.8% |
-| E2E p90 | 39.438 | **37.073** | -6.0% |
-| TTFT p90 | 9.331 | **8.034** | -13.9% |
+Implemented in `255c8e6`. Active under `--policy unified` in
+`scripts/cache_aware_proxy.py`.
+
+```python
+# Step 1: affinity gate (only for sessions that have a recorded owner)
+if session has affinity instance:
+    cache_ratio = cache_hit_on_affinity / input_length
+    gate_1: cache_ratio > 0.5
+    gate_2: affinity.num_requests <= avg_num_requests * overload_factor
+    if gate_1 AND gate_2:
+        decision = "affinity"
+        return affinity_instance
+
+# Step 2: LMetric fallback with deterministic tie-breaker
+for each instance i:
+    score_i = (pending_prefill_tokens_i + new_uncached_tokens_i) * num_requests_i
+            = P_tokens * BS                                     # primary
+secondary key: new_uncached_tokens                              # prefer cache
+tertiary key:  num_requests                                     # prefer idle
+quaternary:    round-robin counter % n                          # break ties
+return argmin
+```
+
+The pure `--policy lmetric` baseline stays affinity-free; the hybrid lives
+entirely under `--policy unified`. The round-robin counter is required because
+`P_tokens * BS = 0` whenever `BS = 0` for all instances (new sessions, cold
+start), which would otherwise pin every fresh session to instance 0.
+
+Parameters: `overload_factor=2.0` (default). The previously-introduced
+`decode_iteration_s` / `prefill_throughput` / `rdma_overhead_s` are kept in
+`Settings` but no longer drive routing — they were Approach-A inputs.
+
+### Why this shape
+
+- **LMetric for load balance**: `P_tokens × BS` is hyperparameter-free and
+  captures both pending prefill work and current batch contention.
+- **Implicit soft affinity from LMetric itself**: `P_tokens` includes
+  `new_uncached_tokens = input - cache_hit`. Later turns naturally prefer
+  the instance that already cached the prefix, because their `P_tokens` are
+  smaller there. This is the dominant reason explicit migration buys little.
+- **Explicit affinity only for the long-cache case**: when cache_ratio > 0.5,
+  the placement cost of breaking sticky is large enough to justify a hard
+  gate. Below that ratio, defer to LMetric.
+
+## What Was Retired and Why
+
+| Commit | Approach | Outcome |
+|---|---|---|
+| `6b255fa` | Single `argmin(queue+prefill+transfer)` over local/PUSH/cold | Initial design; numbers in REPORT §3.9 |
+| `5892739` | Soft affinity added (pure argmin overloaded cache owners) | Stabilized but tail still degraded |
+| `2b9eae0` | Reported Unified v3 (116 PUSH migrations) | TTFT -25%/-32%, **E2E p90 +12%, p99 +24%** |
+| `e991960`/`5772149` | Forced session migration triggers (Approach B) | 57 migrations, **HEAVY TTFT p90 15.9s → 59.1s** |
+| `cc6e562` | Revert Approach B | "overhead exceeds LB benefit" |
+| `bf4469a` | Tighter push_cost + aligned hard gate | Triggered too few migrations to recover |
+| `4c583f2` | Revert relaxed gate | 134 offloads, **E2E p90 37s → 82s** |
+| `255c8e6` | **Current** hybrid LMetric + high-cache affinity | Stable baseline |
+
+The shared lesson across the retired variants: PD-sep offload pays
+`C_queue + C_prefill + RDMA + D_schedule + D_decode_start` and the saved
+prefill time on D rarely amortizes this — especially because 92% of HEAVY
+requests are turn-1 cold (no source-side cache to migrate). See
+`analysis/elastic_hypotheses.md` H3-H9 for the per-variant evidence.
+
+## Historical Baseline-Mode Comparison (Approach A)
+
+These numbers are from the additive-cost-model variant of Unified routing
+(before `255c8e6`). Kept for reference; the hybrid currently lives on top of
+LMetric, not on this additive cost. The "Unified" column should not be cited
+as the current implementation.
+
+| Metric | LMetric | Unified (Approach A, historical) | Change |
+|--------|---------|----------------------------------|--------|
+| E2E mean | 18.204 | 17.831 | -2.0% |
+| E2E p50 | 6.184 | 6.074 | -1.8% |
+| E2E p90 | 39.438 | 37.073 | -6.0% |
+| TTFT p90 | 9.331 | 8.034 | -13.9% |
 | Errors | 0 | 0 | — |

-### Why Unified beats LMetric
-
-1. **Session affinity** preserves KV cache across turns → turn 2+ TTFT much lower
-2. **Additive cost model** (`contention + queue + prefill`) avoids LMetric's degenerate
-   case when `num_requests = 0` (all instances score 0, tie-break to instance 0)
-3. **`num_requests` as contention signal** better captures GPU batch scheduling
-   overhead than `ongoing_tokens`
-
-### Why PD-sep offload doesn't help (yet)
-
-Extensive experimentation with offload/migration showed that PD-sep overhead
-(C queue + prefill + KV transfer + D scheduling) consistently exceeds load
-balance benefit:
-
-| Experiment | Offloads | E2E p90 | vs Baseline |
-|-----------|----------|---------|-------------|
-| A (old gate, ~5 offloads) | 5 | 39.0 | -25% |
-| A (relaxed gate, ~6 offloads) | 6 | 46.0 | -12% |
-| A+B2 (forced migration) | 57 | 84.2 | +61% |
-| A (relaxed gate v2, both gates removed) | 134 | 81.5 | +56% |
-
-More offloads → worse performance. The offload mechanism itself is the bottleneck.
-
-## Algorithm: Unified Routing
+Approach A (additive cost model, historical):

 ```python
 cost(instance_i) = num_requests_i × decode_iteration_s     # contention
                 + pending_prefill_tokens_i / throughput     # prefill queue
                 + max(0, input - cache_hit_i) / throughput  # new prefill
-
-# Session affinity with two gates:
 if affinity instance exists:
-    gate 1: ongoing_tokens <= avg * overload_factor  (hard gate)
-    gate 2: affinity_cost <= global_best * overload_factor  (cost ratio)
-    if both pass → use affinity instance
-    else → use globally best instance
-else:
-    use globally best instance
+    gate 1: ongoing_tokens <= avg * overload_factor
+    gate 2: affinity_cost <= global_best * overload_factor
+    if both pass → affinity
+    else        → global best
 ```

-Parameters: `decode_iteration_s=0.05` (H20), `throughput=7000` (H20),
-`overload_factor=2.0`.
+Reason for retirement: the additive cost was load-bearing on
+`decode_iteration_s` and `prefill_throughput` constants. LMetric reproduces
+the same ordering without those constants because `P_tokens` and `BS` already
+capture both prefill queue and batch contention. The hybrid keeps the cheap
+LMetric core and adds an explicit affinity gate only for high-cache cases.

-## Evolution of Results
+### Evolution Table (historical)

 | Version | Description | ALL TTFT p90 | ALL E2E p90 | tok max/min |
 |---------|-------------|-------------|-------------|-------------|
@@ -65,12 +112,34 @@ Parameters: `decode_iteration_s=0.05` (H20), `throughput=7000` (H20),
 | v2 (bug) | unified, queue=prefill only | 23.339 | 66.307 | 10.3x |
 | v3 | +decode in queue, +hard gate | 10.121 | 42.393 | 2.6x |
 | A (elastic) | +num_requests contention | 7.638 | 39.044 | 3.5x |
-| **A (baseline)** | **same routing, no Mooncake** | **8.034** | **37.073** | **—** |
+| A (baseline, Approach A) | same routing, no Mooncake | 8.034 | 37.073 | — |
+| **Hybrid (current)** | **LMetric + high-cache affinity** | **see §8 re-run** | **see §8 re-run** | **—** |

-## Rigorous Review Summary
+The current Hybrid row deliberately has no number: per
+`analysis/unified_routing_fix_review.md` #8, the small (≤2%) improvements
+need 3-5 paired multi-run trials before being quoted.

-Independent review found:
+## Open Questions / Next Steps
+
+- **#8 paired multi-run**: 3-5 fresh trials of LMetric vs current hybrid on
+  identical trace + restart procedure, with full artifact bundle
+  (`config.json`, `metrics.jsonl`, `metrics.summary.json`, `breakdown.json`,
+  `gpu_util.csv`, per-instance APC snapshot, git commit hash).
+- **#10 production-concurrency track**: re-introduce a controlled-concurrency
+  flag (`--max-inflight-sessions` or equivalent) and rerun the comparison at
+  64 / 128 active sessions before drawing conclusions for production.
+- **Hard affinity ceiling for the edge case where cache_ratio = 1.0 and the
+  affinity instance is overloaded**: in that scenario the LMetric fallback's
+  tie-breaker re-elects the same instance because its `new_uncached_tokens`
+  is 0. Either add a hard num_requests ceiling or accept the behavior. Open
+  in the test `test_hybrid_high_cache_breaks_on_overload` (works only when
+  the request has at least one uncached block).
+
+## Original "Rigorous Review Summary" (historical)
+
+Independent review of the Approach A result above:
 - **CLEAN**: Fair comparison (identical vLLM/proxy/trace/measurement)
 - **CLEAN**: No reward hacking (improvement from algorithmic difference)
 - **WARNING**: 2% mean improvement needs multi-run verification (3-5 runs)
- **NOTE**: Hardcoded constants (0.05, 7000) are hardware-specific but legitimate
+- **NOTE**: Hardcoded constants (`0.05`, `7000`) are hardware-specific but
+  legitimate