Per analysis/unified_routing_fix_review.md #2, several docs still presented the retired single-argmin + PUSH-migration design as the final algorithm. Mark them superseded and document the current hybrid direction (commit255c8e6). - REPORT.md §1.1 / §3.9: add errata callout and section header noting the "Final Design" framing was retired aftercc6e562/ 4c583f2; point readers to docs/migration-policy-design.md. - docs/migration-policy-design.md: rewrite. Opens with the current hybrid algorithm (LMetric base + cache_ratio>0.5 affinity gate + tie-breaker), then a "What Was Retired" commit table, then the old Approach A numbers preserved as "Historical Baseline-Mode Comparison". - analysis/research_findings.md §2.2 / §5: correct the LMetric framing. LMetric isn't "neutralized by affinity constraints" (pure --policy lmetric has no affinity at all); it converges to similar placements because P_tokens includes new_uncached_tokens, giving it implicit soft affinity. - analysis/elastic_hypotheses.md: same LMetric correction in the "DOESN'T work" summary, plus a footer cross-referencing the current routing direction. - analysis/unified_routing_fix_review.md: track this file (was untracked); it is the review handoff cited from the updated docs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
407 lines
13 KiB
Markdown
407 lines
13 KiB
Markdown
# Unified Routing Fix Review Handoff
|
|
|
|
Date: 2026-05-25
|
|
|
|
This is the corrected review handoff after reading the git history. The key
|
|
change from the previous draft is that we should **not** restore the old
|
|
single-argmin / PUSH-migration design. That path was implemented, measured,
|
|
and then discarded or simplified by later commits.
|
|
|
|
## Executive Summary
|
|
|
|
The latest commit history says the current direction is:
|
|
|
|
- Use baseline mode / no Mooncake for the winning comparison against LMetric.
|
|
- Use LMetric-style load balancing as the base.
|
|
- Add explicit affinity only for sessions with high accumulated cache.
|
|
- Do not re-enable PD-sep offload or session migration unless the transfer
|
|
mechanism is fundamentally reworked.
|
|
|
|
The main fixes now are cleanup, documentation consistency, tests, and
|
|
reproducibility. The biggest risk is that stale docs still describe abandoned
|
|
schemes as "final design".
|
|
|
|
## Evidence From Git History
|
|
|
|
Relevant commits:
|
|
|
|
| Commit | Meaning | Outcome |
|
|
|---|---|---|
|
|
| `6b255fa` | Implemented single `argmin(expected_latency)` over local / PUSH / cold paths | Later superseded |
|
|
| `5892739` | Added soft affinity because pure argmin overloaded cache-source instances | Pure argmin was unstable |
|
|
| `2b9eae0` | Reported Unified v3 with 116 PUSH migrations | TTFT improved, but TPOT/E2E tail tradeoff existed |
|
|
| `cdf8349` | Added real cache sync and cached-prefill-on-C architecture | Fixed false PUSH and direct-read issues |
|
|
| `4b50c5a` | Fixed queue model and hard overload gate | Reduced imbalance, but still on offload path |
|
|
| `e991960` / `5772149` | Tried forced session migration triggers | Reverted |
|
|
| `cc6e562` | Reverted Approach B migration | 57 migrations made HEAVY TTFT p90 regress 15.9s -> 59.1s |
|
|
| `bf4469a` | Tried more accurate push cost / gate alignment | Later reverted |
|
|
| `4c583f2` | Reverted relaxed gate / push-cost fix | 134 offloads made E2E p90 37s -> 82s |
|
|
| `448361c` | Updated design doc: baseline no-Mooncake Unified beats LMetric | PD-sep offload degrades |
|
|
| `255c8e6` | Replaced full cost model with hybrid routing | Current direction: LMetric LB + high-cache affinity |
|
|
|
|
Do not ask the implementer to "restore real three-way argmin". That was the
|
|
wrong instruction in the previous draft.
|
|
|
|
## Current Intended Algorithm
|
|
|
|
From `255c8e6`, the current algorithm should be documented as:
|
|
|
|
```text
|
|
if session has an affinity instance:
|
|
if cache_ratio_on_affinity > 0.5
|
|
and affinity_instance.num_requests <= avg_num_requests * overload_factor:
|
|
route to affinity instance
|
|
else:
|
|
route by LMetric
|
|
else:
|
|
route by LMetric
|
|
```
|
|
|
|
LMetric remains:
|
|
|
|
```text
|
|
score = (pending_prefill_tokens + new_uncached_tokens) * num_requests
|
|
new_uncached_tokens = input_length - estimated_cache_hit
|
|
```
|
|
|
|
This is no longer the old expected-latency migration model.
|
|
|
|
## P0 Fixes
|
|
|
|
### 1. Remove Stale PUSH-Migration Code From the Current Unified Branch
|
|
|
|
Location: `scripts/cache_aware_proxy.py`, `_handle_combined`.
|
|
|
|
Problem:
|
|
|
|
After `255c8e6`, `unified` is a hybrid policy, but the function still contains
|
|
an unreachable block guarded by `best_needs_push = False`. That block references
|
|
variables from the removed cost model such as `best_cache_idx`,
|
|
`best_cache_hit`, and `_current_offloads`.
|
|
|
|
Fix:
|
|
|
|
- In the `unified` branch, delete the unreachable `if best_needs_push:` block.
|
|
- Keep helper functions like `_handle_cached_prefill_offload` only if another
|
|
live mode still calls them.
|
|
- Update the `_handle_combined` docstring so it no longer says Unified always
|
|
computes `queue + prefill + transfer`.
|
|
|
|
Why:
|
|
|
|
The code is currently safe only because the branch is unreachable. Leaving it
|
|
there makes future changes dangerous and makes reviewers think migration is
|
|
still part of the active policy.
|
|
|
|
Verification:
|
|
|
|
- `rg "best_needs_push|best_cache_idx|push_cache_hit" scripts/cache_aware_proxy.py`
|
|
should show no stale references inside the active `unified` path.
|
|
- `pytest -q`.
|
|
|
|
### 2. Reconcile Documentation With the Latest Commits
|
|
|
|
Locations:
|
|
|
|
- `REPORT.md`
|
|
- `docs/migration-policy-design.md`
|
|
- `analysis/research_findings.md`
|
|
- `analysis/elastic_hypotheses.md`
|
|
|
|
Problem:
|
|
|
|
Several docs still present old or contradictory conclusions:
|
|
|
|
- `REPORT.md` section 3.9 still calls single argmin / PUSH migration the
|
|
"Final Design".
|
|
- `docs/migration-policy-design.md` describes a baseline-mode additive cost
|
|
model, while HEAD implements hybrid LMetric + high-cache affinity.
|
|
- `analysis/research_findings.md` says LMetric is neutralized by session
|
|
affinity constraints, but later corrected LMetric results say cache-hit
|
|
scoring creates implicit soft affinity.
|
|
|
|
Fix:
|
|
|
|
- Add an explicit "Superseded by git history" note near `REPORT.md` section
|
|
3.9:
|
|
- `6b255fa/5892739/2b9eae0` were explored.
|
|
- `cc6e562` rejected forced migration.
|
|
- `4c583f2` rejected relaxed offload / high-offload configurations.
|
|
- `255c8e6` is the current implementation direction.
|
|
- Update `docs/migration-policy-design.md` to either:
|
|
- describe the current hybrid algorithm, or
|
|
- clearly state that the additive cost model is the previous Approach A and
|
|
not the latest code.
|
|
- Mark old analysis sections as superseded rather than deleting them.
|
|
|
|
Why:
|
|
|
|
The docs caused the previous bad review recommendation. Future reviewers need
|
|
to know which ideas were already tested and rejected.
|
|
|
|
Verification:
|
|
|
|
- `rg "Final Design|argmin\\(expected_latency\\)|PUSH_MIGRATE|Approach B" REPORT.md docs analysis`
|
|
should show clear superseded labels where appropriate.
|
|
|
|
### 3. Preserve the LMetric Baseline Separately From Unified Hybrid
|
|
|
|
Location: `scripts/cache_aware_proxy.py`.
|
|
|
|
Problem:
|
|
|
|
Pure `--policy lmetric` is the baseline being compared against. Unified hybrid
|
|
uses LMetric internally but should not accidentally change the LMetric baseline
|
|
behavior.
|
|
|
|
Fix:
|
|
|
|
- Keep `pick_instance_lmetric` as the pure corrected LMetric implementation.
|
|
- Put any hybrid-specific tie-breakers or affinity logic outside
|
|
`pick_instance_lmetric`, under `--policy unified`.
|
|
- In breakdown logs, record whether a Unified request used `affinity` or
|
|
`lmetric_fallback`.
|
|
|
|
Why:
|
|
|
|
If the baseline and Unified share hidden behavior, future A/B comparisons become
|
|
invalid.
|
|
|
|
Verification:
|
|
|
|
- Unit test: `--policy lmetric` never uses session affinity.
|
|
- Unit test: `--policy unified` can use affinity only when cache ratio and load
|
|
gates pass.
|
|
|
|
## P1 Fixes
|
|
|
|
### 4. Fix the Unified Hybrid / LMetric Fallback Empty-Batch Degeneracy
|
|
|
|
Problem:
|
|
|
|
LMetric score is `P_tokens * BS`. When `BS = num_requests = 0`, every instance
|
|
gets score 0, so tie-break chooses instance 0. `docs/migration-policy-design.md`
|
|
explicitly lists avoiding this as one reason Unified beats LMetric.
|
|
|
|
The latest hybrid falls back to LMetric, so it may reintroduce this issue for
|
|
new sessions when all instances are idle.
|
|
|
|
Fix:
|
|
|
|
- Do not change the pure `--policy lmetric` baseline.
|
|
- For `--policy unified` fallback only, add a deterministic secondary key:
|
|
- primary: `P_tokens * BS`
|
|
- secondary when scores tie: `new_uncached_tokens`, then `num_requests`, then
|
|
a round-robin or least-recently-used instance index.
|
|
- Record tie-break count in breakdown or stats.
|
|
|
|
Why:
|
|
|
|
This preserves fair LMetric comparison while preventing Unified hybrid from
|
|
degenerating to instance 0 under empty or near-empty load.
|
|
|
|
Verification:
|
|
|
|
- Unit test with all `num_requests = 0`: Unified should not always choose index
|
|
0 across repeated new sessions.
|
|
- Confirm pure LMetric test still matches the OSDI-style baseline.
|
|
|
|
### 5. Add Tests for the Current Hybrid Policy
|
|
|
|
Problem:
|
|
|
|
Existing tests cover older `pick_instance` and pure LMetric behavior, but not
|
|
the current `unified` branch introduced by `255c8e6`.
|
|
|
|
Fix:
|
|
|
|
Add tests for:
|
|
|
|
- High-cache session sticks to affinity instance when not overloaded.
|
|
- High-cache session breaks affinity when `num_requests` exceeds the overload
|
|
gate.
|
|
- Low-cache session falls back to LMetric.
|
|
- New session falls back to LMetric with the Unified-specific tie-breaker.
|
|
- Breakdown policy is recorded as `affinity` or `lmetric_fallback`.
|
|
|
|
Why:
|
|
|
|
This prevents future drift back toward the discarded migration cost model or
|
|
accidental changes to the LMetric baseline.
|
|
|
|
Verification:
|
|
|
|
- `pytest -q`.
|
|
- Tests should run without live vLLM.
|
|
|
|
### 6. Treat `overload_factor` Changes as Experiments, Not Silent Fixes
|
|
|
|
Observation:
|
|
|
|
The current worktree has an uncommitted change:
|
|
|
|
```text
|
|
Settings.overload_factor: 2.0 -> 1.5
|
|
```
|
|
|
|
But the earlier H7 sweep found overload-factor tuning largely ineffective /
|
|
within noise. This is recorded in `analysis/elastic_hypotheses.md`.
|
|
|
|
Fix:
|
|
|
|
- Do not silently commit a default change to `1.5` without a paired benchmark.
|
|
- If testing `1.5`, make it an experiment tag/config value, not a new default.
|
|
- Keep docs and CLI defaults synchronized.
|
|
|
|
Why:
|
|
|
|
The prior experiment says imbalance was mostly workload/session skew, not the
|
|
threshold. A default change without evidence will create another reproducibility
|
|
gap.
|
|
|
|
Verification:
|
|
|
|
- Run paired `2.0` vs `1.5` after the hybrid tests exist.
|
|
- Report E2E p50/p90, TTFT p90, APC distribution, and GPU util imbalance.
|
|
|
|
### 7. Standardize Breakdown Fields for Hybrid Routing
|
|
|
|
Problem:
|
|
|
|
The current breakdown logs do not clearly expose why Unified chose affinity vs
|
|
LMetric fallback.
|
|
|
|
Fix:
|
|
|
|
For each request under `--policy unified`, log:
|
|
|
|
- `policy: "unified"`
|
|
- `decision: "affinity" | "lmetric_fallback"`
|
|
- `affinity_idx`
|
|
- `chosen_idx`
|
|
- `affinity_cache_hit`
|
|
- `affinity_cache_ratio`
|
|
- `affinity_num_requests`
|
|
- `avg_num_requests`
|
|
- `fallback_score`
|
|
- `tie_break_used`
|
|
|
|
Why:
|
|
|
|
The latest performance difference versus LMetric is small. Without decision
|
|
logs, it is hard to tell whether Unified is actually exercising the intended
|
|
high-cache affinity behavior.
|
|
|
|
Verification:
|
|
|
|
- `breakdown.json` can answer: how many requests used affinity, how many used
|
|
fallback, and what latency/APC each group saw.
|
|
|
|
## P2 Experiment Fixes
|
|
|
|
### 8. Re-run Paired LMetric vs Unified Hybrid Benchmarks
|
|
|
|
Problem:
|
|
|
|
`docs/migration-policy-design.md` says the 2% mean improvement needs multi-run
|
|
verification. Local raw outputs for the May 25 final comparison are not present
|
|
in this workspace.
|
|
|
|
Fix:
|
|
|
|
- Run 3-5 fresh paired trials.
|
|
- Same trace, same vLLM build, same machine, same restart procedure.
|
|
- Compare:
|
|
- pure LMetric
|
|
- current Unified hybrid
|
|
- optionally Linear/session-sticky as a reference
|
|
|
|
Metrics:
|
|
|
|
- TTFT mean/p50/p90/p99
|
|
- TPOT mean/p50/p90/p99
|
|
- E2E mean/p50/p90/p99
|
|
- errors/timeouts
|
|
- aggregate APC
|
|
- per-instance APC distribution
|
|
- per-instance request count and token count
|
|
- GPU util mean/std/imbalance
|
|
|
|
Why:
|
|
|
|
The current reported win is small enough that run-to-run noise matters.
|
|
|
|
Verification:
|
|
|
|
- Commit or save artifacts under a date/tagged output directory:
|
|
`config.json`, `metrics.jsonl`, `metrics.summary.json`, `breakdown.json`,
|
|
`gpu_util.csv`, final vLLM APC snapshot, git commit hash.
|
|
|
|
### 9. Do Not Re-open PD-Sep Offload Without a New Transfer Mechanism
|
|
|
|
Rejected paths:
|
|
|
|
- Full PD separation: decode KV memory wall.
|
|
- Elastic P2P RDMA offload: transfer and scheduling overhead exceed benefit.
|
|
- Cache-gate offload: improves balance for colocated survivors but offloaded
|
|
requests pay RDMA penalty.
|
|
- Approach B session migration: 57 migrations made HEAVY TTFT p90 much worse.
|
|
- Relaxed gate / many offloads: 134 offloads made E2E p90 much worse.
|
|
|
|
Future work can revisit migration only if one of these changes first:
|
|
|
|
- layerwise / pipelined KV transfer
|
|
- multi-machine P role so P work does not compete with D on the same GPU pool
|
|
- exact vLLM state / cache residency exposed to the router
|
|
- production-concurrency benchmark showing decode SLO pressure large enough to
|
|
amortize transfer overhead
|
|
|
|
Why:
|
|
|
|
The same local mechanism has already failed multiple times. Repeating it with
|
|
another threshold is unlikely to help.
|
|
|
|
### 10. Restore Production-Concurrency Evaluation as a Separate Track
|
|
|
|
Problem:
|
|
|
|
Several conclusions were made at 1-2 req/GPU, while production is estimated at
|
|
8-15 req/GPU. Higher concurrency is where prefill-decode interference appears.
|
|
|
|
Fix:
|
|
|
|
- Restore or replace `--max-inflight-sessions` for controlled concurrency.
|
|
- Run at 64 and 128 active sessions.
|
|
- Treat this as a new experiment track, not as a reason to resurrect old
|
|
migration code immediately.
|
|
|
|
Why:
|
|
|
|
At high concurrency, the design pressure may change. But the implementation
|
|
should first prove the current hybrid baseline cleanly.
|
|
|
|
## Suggested Implementation Order
|
|
|
|
1. Update docs to mark discarded migration/offload schemes as superseded.
|
|
2. Remove stale unreachable PUSH code from the current Unified branch.
|
|
3. Add tests for current Unified hybrid behavior.
|
|
4. Add a Unified-only tie-breaker for LMetric fallback empty-batch cases.
|
|
5. Add breakdown fields for hybrid routing decisions.
|
|
6. Decide whether the local `overload_factor=1.5` diff is an experiment or
|
|
should be dropped.
|
|
7. Run paired multi-run LMetric vs Unified hybrid benchmarks.
|
|
8. Only after that, open a separate high-concurrency experiment track.
|
|
|
|
## Review Checklist
|
|
|
|
- Does the code still mention single `argmin(expected_latency)` as current
|
|
behavior? If yes, update it or mark it superseded.
|
|
- Is any migration/offload code reachable under `--policy unified`? If yes,
|
|
require a new experiment plan because recent history rejects it.
|
|
- Does pure `--policy lmetric` remain pure and affinity-free?
|
|
- Does `--policy unified` clearly log when it used affinity vs LMetric fallback?
|
|
- Are default parameter changes backed by paired results?
|
|
- Can another reviewer reproduce the result from committed scripts and saved
|
|
artifacts?
|
|
|