Docs: reconcile routing docs with current hybrid direction

Per analysis/unified_routing_fix_review.md #2, several docs still presented the retired single-argmin + PUSH-migration design as the final algorithm. Mark them superseded and document the current hybrid direction (commit 255c8e6). - REPORT.md §1.1 / §3.9: add errata callout and section header noting the "Final Design" framing was retired after cc6e562 / 4c583f2; point readers to docs/migration-policy-design.md. - docs/migration-policy-design.md: rewrite. Opens with the current hybrid algorithm (LMetric base + cache_ratio>0.5 affinity gate + tie-breaker), then a "What Was Retired" commit table, then the old Approach A numbers preserved as "Historical Baseline-Mode Comparison". - analysis/research_findings.md §2.2 / §5: correct the LMetric framing. LMetric isn't "neutralized by affinity constraints" (pure --policy lmetric has no affinity at all); it converges to similar placements because P_tokens includes new_uncached_tokens, giving it implicit soft affinity. - analysis/elastic_hypotheses.md: same LMetric correction in the "DOESN'T work" summary, plus a footer cross-referencing the current routing direction. - analysis/unified_routing_fix_review.md: track this file (was untracked); it is the review handoff cited from the updated docs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 10:47:14 +08:00
parent ac6534c3ff
commit 6a27f75337
5 changed files with 591 additions and 53 deletions
--- a/analysis/unified_routing_fix_review.md
+++ b/analysis/unified_routing_fix_review.md
@@ -0,0 +1,406 @@
+# Unified Routing Fix Review Handoff
+
+Date: 2026-05-25
+
+This is the corrected review handoff after reading the git history. The key
+change from the previous draft is that we should **not** restore the old
+single-argmin / PUSH-migration design. That path was implemented, measured,
+and then discarded or simplified by later commits.
+
+## Executive Summary
+
+The latest commit history says the current direction is:
+
+- Use baseline mode / no Mooncake for the winning comparison against LMetric.
+- Use LMetric-style load balancing as the base.
+- Add explicit affinity only for sessions with high accumulated cache.
+- Do not re-enable PD-sep offload or session migration unless the transfer
+  mechanism is fundamentally reworked.
+
+The main fixes now are cleanup, documentation consistency, tests, and
+reproducibility. The biggest risk is that stale docs still describe abandoned
+schemes as "final design".
+
+## Evidence From Git History
+
+Relevant commits:
+
+| Commit | Meaning | Outcome |
+|---|---|---|
+| `6b255fa` | Implemented single `argmin(expected_latency)` over local / PUSH / cold paths | Later superseded |
+| `5892739` | Added soft affinity because pure argmin overloaded cache-source instances | Pure argmin was unstable |
+| `2b9eae0` | Reported Unified v3 with 116 PUSH migrations | TTFT improved, but TPOT/E2E tail tradeoff existed |
+| `cdf8349` | Added real cache sync and cached-prefill-on-C architecture | Fixed false PUSH and direct-read issues |
+| `4b50c5a` | Fixed queue model and hard overload gate | Reduced imbalance, but still on offload path |
+| `e991960` / `5772149` | Tried forced session migration triggers | Reverted |
+| `cc6e562` | Reverted Approach B migration | 57 migrations made HEAVY TTFT p90 regress 15.9s -> 59.1s |
+| `bf4469a` | Tried more accurate push cost / gate alignment | Later reverted |
+| `4c583f2` | Reverted relaxed gate / push-cost fix | 134 offloads made E2E p90 37s -> 82s |
+| `448361c` | Updated design doc: baseline no-Mooncake Unified beats LMetric | PD-sep offload degrades |
+| `255c8e6` | Replaced full cost model with hybrid routing | Current direction: LMetric LB + high-cache affinity |
+
+Do not ask the implementer to "restore real three-way argmin". That was the
+wrong instruction in the previous draft.
+
+## Current Intended Algorithm
+
+From `255c8e6`, the current algorithm should be documented as:
+
+```text
+if session has an affinity instance:
+    if cache_ratio_on_affinity > 0.5
+       and affinity_instance.num_requests <= avg_num_requests * overload_factor:
+        route to affinity instance
+    else:
+        route by LMetric
+else:
+    route by LMetric
+```
+
+LMetric remains:
+
+```text
+score = (pending_prefill_tokens + new_uncached_tokens) * num_requests
+new_uncached_tokens = input_length - estimated_cache_hit
+```
+
+This is no longer the old expected-latency migration model.
+
+## P0 Fixes
+
+### 1. Remove Stale PUSH-Migration Code From the Current Unified Branch
+
+Location: `scripts/cache_aware_proxy.py`, `_handle_combined`.
+
+Problem:
+
+After `255c8e6`, `unified` is a hybrid policy, but the function still contains
+an unreachable block guarded by `best_needs_push = False`. That block references
+variables from the removed cost model such as `best_cache_idx`,
+`best_cache_hit`, and `_current_offloads`.
+
+Fix:
+
+- In the `unified` branch, delete the unreachable `if best_needs_push:` block.
+- Keep helper functions like `_handle_cached_prefill_offload` only if another
+  live mode still calls them.
+- Update the `_handle_combined` docstring so it no longer says Unified always
+  computes `queue + prefill + transfer`.
+
+Why:
+
+The code is currently safe only because the branch is unreachable. Leaving it
+there makes future changes dangerous and makes reviewers think migration is
+still part of the active policy.
+
+Verification:
+
+- `rg "best_needs_push|best_cache_idx|push_cache_hit" scripts/cache_aware_proxy.py`
+  should show no stale references inside the active `unified` path.
+- `pytest -q`.
+
+### 2. Reconcile Documentation With the Latest Commits
+
+Locations:
+
+- `REPORT.md`
+- `docs/migration-policy-design.md`
+- `analysis/research_findings.md`
+- `analysis/elastic_hypotheses.md`
+
+Problem:
+
+Several docs still present old or contradictory conclusions:
+
+- `REPORT.md` section 3.9 still calls single argmin / PUSH migration the
+  "Final Design".
+- `docs/migration-policy-design.md` describes a baseline-mode additive cost
+  model, while HEAD implements hybrid LMetric + high-cache affinity.
+- `analysis/research_findings.md` says LMetric is neutralized by session
+  affinity constraints, but later corrected LMetric results say cache-hit
+  scoring creates implicit soft affinity.
+
+Fix:
+
+- Add an explicit "Superseded by git history" note near `REPORT.md` section
+  3.9:
+  - `6b255fa/5892739/2b9eae0` were explored.
+  - `cc6e562` rejected forced migration.
+  - `4c583f2` rejected relaxed offload / high-offload configurations.
+  - `255c8e6` is the current implementation direction.
+- Update `docs/migration-policy-design.md` to either:
+  - describe the current hybrid algorithm, or
+  - clearly state that the additive cost model is the previous Approach A and
+    not the latest code.
+- Mark old analysis sections as superseded rather than deleting them.
+
+Why:
+
+The docs caused the previous bad review recommendation. Future reviewers need
+to know which ideas were already tested and rejected.
+
+Verification:
+
+- `rg "Final Design|argmin\\(expected_latency\\)|PUSH_MIGRATE|Approach B" REPORT.md docs analysis`
+  should show clear superseded labels where appropriate.
+
+### 3. Preserve the LMetric Baseline Separately From Unified Hybrid
+
+Location: `scripts/cache_aware_proxy.py`.
+
+Problem:
+
+Pure `--policy lmetric` is the baseline being compared against. Unified hybrid
+uses LMetric internally but should not accidentally change the LMetric baseline
+behavior.
+
+Fix:
+
+- Keep `pick_instance_lmetric` as the pure corrected LMetric implementation.
+- Put any hybrid-specific tie-breakers or affinity logic outside
+  `pick_instance_lmetric`, under `--policy unified`.
+- In breakdown logs, record whether a Unified request used `affinity` or
+  `lmetric_fallback`.
+
+Why:
+
+If the baseline and Unified share hidden behavior, future A/B comparisons become
+invalid.
+
+Verification:
+
+- Unit test: `--policy lmetric` never uses session affinity.
+- Unit test: `--policy unified` can use affinity only when cache ratio and load
+  gates pass.
+
+## P1 Fixes
+
+### 4. Fix the Unified Hybrid / LMetric Fallback Empty-Batch Degeneracy
+
+Problem:
+
+LMetric score is `P_tokens * BS`. When `BS = num_requests = 0`, every instance
+gets score 0, so tie-break chooses instance 0. `docs/migration-policy-design.md`
+explicitly lists avoiding this as one reason Unified beats LMetric.
+
+The latest hybrid falls back to LMetric, so it may reintroduce this issue for
+new sessions when all instances are idle.
+
+Fix:
+
+- Do not change the pure `--policy lmetric` baseline.
+- For `--policy unified` fallback only, add a deterministic secondary key:
+  - primary: `P_tokens * BS`
+  - secondary when scores tie: `new_uncached_tokens`, then `num_requests`, then
+    a round-robin or least-recently-used instance index.
+- Record tie-break count in breakdown or stats.
+
+Why:
+
+This preserves fair LMetric comparison while preventing Unified hybrid from
+degenerating to instance 0 under empty or near-empty load.
+
+Verification:
+
+- Unit test with all `num_requests = 0`: Unified should not always choose index
+  0 across repeated new sessions.
+- Confirm pure LMetric test still matches the OSDI-style baseline.
+
+### 5. Add Tests for the Current Hybrid Policy
+
+Problem:
+
+Existing tests cover older `pick_instance` and pure LMetric behavior, but not
+the current `unified` branch introduced by `255c8e6`.
+
+Fix:
+
+Add tests for:
+
+- High-cache session sticks to affinity instance when not overloaded.
+- High-cache session breaks affinity when `num_requests` exceeds the overload
+  gate.
+- Low-cache session falls back to LMetric.
+- New session falls back to LMetric with the Unified-specific tie-breaker.
+- Breakdown policy is recorded as `affinity` or `lmetric_fallback`.
+
+Why:
+
+This prevents future drift back toward the discarded migration cost model or
+accidental changes to the LMetric baseline.
+
+Verification:
+
+- `pytest -q`.
+- Tests should run without live vLLM.
+
+### 6. Treat `overload_factor` Changes as Experiments, Not Silent Fixes
+
+Observation:
+
+The current worktree has an uncommitted change:
+
+```text
+Settings.overload_factor: 2.0 -> 1.5
+```
+
+But the earlier H7 sweep found overload-factor tuning largely ineffective /
+within noise. This is recorded in `analysis/elastic_hypotheses.md`.
+
+Fix:
+
+- Do not silently commit a default change to `1.5` without a paired benchmark.
+- If testing `1.5`, make it an experiment tag/config value, not a new default.
+- Keep docs and CLI defaults synchronized.
+
+Why:
+
+The prior experiment says imbalance was mostly workload/session skew, not the
+threshold. A default change without evidence will create another reproducibility
+gap.
+
+Verification:
+
+- Run paired `2.0` vs `1.5` after the hybrid tests exist.
+- Report E2E p50/p90, TTFT p90, APC distribution, and GPU util imbalance.
+
+### 7. Standardize Breakdown Fields for Hybrid Routing
+
+Problem:
+
+The current breakdown logs do not clearly expose why Unified chose affinity vs
+LMetric fallback.
+
+Fix:
+
+For each request under `--policy unified`, log:
+
+- `policy: "unified"`
+- `decision: "affinity" | "lmetric_fallback"`
+- `affinity_idx`
+- `chosen_idx`
+- `affinity_cache_hit`
+- `affinity_cache_ratio`
+- `affinity_num_requests`
+- `avg_num_requests`
+- `fallback_score`
+- `tie_break_used`
+
+Why:
+
+The latest performance difference versus LMetric is small. Without decision
+logs, it is hard to tell whether Unified is actually exercising the intended
+high-cache affinity behavior.
+
+Verification:
+
+- `breakdown.json` can answer: how many requests used affinity, how many used
+  fallback, and what latency/APC each group saw.
+
+## P2 Experiment Fixes
+
+### 8. Re-run Paired LMetric vs Unified Hybrid Benchmarks
+
+Problem:
+
+`docs/migration-policy-design.md` says the 2% mean improvement needs multi-run
+verification. Local raw outputs for the May 25 final comparison are not present
+in this workspace.
+
+Fix:
+
+- Run 3-5 fresh paired trials.
+- Same trace, same vLLM build, same machine, same restart procedure.
+- Compare:
+  - pure LMetric
+  - current Unified hybrid
+  - optionally Linear/session-sticky as a reference
+
+Metrics:
+
+- TTFT mean/p50/p90/p99
+- TPOT mean/p50/p90/p99
+- E2E mean/p50/p90/p99
+- errors/timeouts
+- aggregate APC
+- per-instance APC distribution
+- per-instance request count and token count
+- GPU util mean/std/imbalance
+
+Why:
+
+The current reported win is small enough that run-to-run noise matters.
+
+Verification:
+
+- Commit or save artifacts under a date/tagged output directory:
+  `config.json`, `metrics.jsonl`, `metrics.summary.json`, `breakdown.json`,
+  `gpu_util.csv`, final vLLM APC snapshot, git commit hash.
+
+### 9. Do Not Re-open PD-Sep Offload Without a New Transfer Mechanism
+
+Rejected paths:
+
+- Full PD separation: decode KV memory wall.
+- Elastic P2P RDMA offload: transfer and scheduling overhead exceed benefit.
+- Cache-gate offload: improves balance for colocated survivors but offloaded
+  requests pay RDMA penalty.
+- Approach B session migration: 57 migrations made HEAVY TTFT p90 much worse.
+- Relaxed gate / many offloads: 134 offloads made E2E p90 much worse.
+
+Future work can revisit migration only if one of these changes first:
+
+- layerwise / pipelined KV transfer
+- multi-machine P role so P work does not compete with D on the same GPU pool
+- exact vLLM state / cache residency exposed to the router
+- production-concurrency benchmark showing decode SLO pressure large enough to
+  amortize transfer overhead
+
+Why:
+
+The same local mechanism has already failed multiple times. Repeating it with
+another threshold is unlikely to help.
+
+### 10. Restore Production-Concurrency Evaluation as a Separate Track
+
+Problem:
+
+Several conclusions were made at 1-2 req/GPU, while production is estimated at
+8-15 req/GPU. Higher concurrency is where prefill-decode interference appears.
+
+Fix:
+
+- Restore or replace `--max-inflight-sessions` for controlled concurrency.
+- Run at 64 and 128 active sessions.
+- Treat this as a new experiment track, not as a reason to resurrect old
+  migration code immediately.
+
+Why:
+
+At high concurrency, the design pressure may change. But the implementation
+should first prove the current hybrid baseline cleanly.
+
+## Suggested Implementation Order
+
+1. Update docs to mark discarded migration/offload schemes as superseded.
+2. Remove stale unreachable PUSH code from the current Unified branch.
+3. Add tests for current Unified hybrid behavior.
+4. Add a Unified-only tie-breaker for LMetric fallback empty-batch cases.
+5. Add breakdown fields for hybrid routing decisions.
+6. Decide whether the local `overload_factor=1.5` diff is an experiment or
+   should be dropped.
+7. Run paired multi-run LMetric vs Unified hybrid benchmarks.
+8. Only after that, open a separate high-concurrency experiment track.
+
+## Review Checklist
+
+- Does the code still mention single `argmin(expected_latency)` as current
+  behavior? If yes, update it or mark it superseded.
+- Is any migration/offload code reachable under `--policy unified`? If yes,
+  require a new experiment plan because recent history rejects it.
+- Does pure `--policy lmetric` remain pure and affinity-free?
+- Does `--policy unified` clearly log when it used affinity vs LMetric fallback?
+- Are default parameter changes backed by paired results?
+- Can another reviewer reproduce the result from committed scripts and saved
+  artifacts?
+