Files
agentic-kvc/analysis/unified_routing_fix_review.md
Gahow Wang 6a27f75337 Docs: reconcile routing docs with current hybrid direction
Per analysis/unified_routing_fix_review.md #2, several docs still
presented the retired single-argmin + PUSH-migration design as the
final algorithm. Mark them superseded and document the current hybrid
direction (commit 255c8e6).

- REPORT.md §1.1 / §3.9: add errata callout and section header noting
  the "Final Design" framing was retired after cc6e562 / 4c583f2;
  point readers to docs/migration-policy-design.md.

- docs/migration-policy-design.md: rewrite. Opens with the current
  hybrid algorithm (LMetric base + cache_ratio>0.5 affinity gate +
  tie-breaker), then a "What Was Retired" commit table, then the old
  Approach A numbers preserved as "Historical Baseline-Mode Comparison".

- analysis/research_findings.md §2.2 / §5: correct the LMetric framing.
  LMetric isn't "neutralized by affinity constraints" (pure --policy
  lmetric has no affinity at all); it converges to similar placements
  because P_tokens includes new_uncached_tokens, giving it implicit
  soft affinity.

- analysis/elastic_hypotheses.md: same LMetric correction in the
  "DOESN'T work" summary, plus a footer cross-referencing the current
  routing direction.

- analysis/unified_routing_fix_review.md: track this file (was
  untracked); it is the review handoff cited from the updated docs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 10:47:14 +08:00

13 KiB

Unified Routing Fix Review Handoff

Date: 2026-05-25

This is the corrected review handoff after reading the git history. The key change from the previous draft is that we should not restore the old single-argmin / PUSH-migration design. That path was implemented, measured, and then discarded or simplified by later commits.

Executive Summary

The latest commit history says the current direction is:

  • Use baseline mode / no Mooncake for the winning comparison against LMetric.
  • Use LMetric-style load balancing as the base.
  • Add explicit affinity only for sessions with high accumulated cache.
  • Do not re-enable PD-sep offload or session migration unless the transfer mechanism is fundamentally reworked.

The main fixes now are cleanup, documentation consistency, tests, and reproducibility. The biggest risk is that stale docs still describe abandoned schemes as "final design".

Evidence From Git History

Relevant commits:

Commit Meaning Outcome
6b255fa Implemented single argmin(expected_latency) over local / PUSH / cold paths Later superseded
5892739 Added soft affinity because pure argmin overloaded cache-source instances Pure argmin was unstable
2b9eae0 Reported Unified v3 with 116 PUSH migrations TTFT improved, but TPOT/E2E tail tradeoff existed
cdf8349 Added real cache sync and cached-prefill-on-C architecture Fixed false PUSH and direct-read issues
4b50c5a Fixed queue model and hard overload gate Reduced imbalance, but still on offload path
e991960 / 5772149 Tried forced session migration triggers Reverted
cc6e562 Reverted Approach B migration 57 migrations made HEAVY TTFT p90 regress 15.9s -> 59.1s
bf4469a Tried more accurate push cost / gate alignment Later reverted
4c583f2 Reverted relaxed gate / push-cost fix 134 offloads made E2E p90 37s -> 82s
448361c Updated design doc: baseline no-Mooncake Unified beats LMetric PD-sep offload degrades
255c8e6 Replaced full cost model with hybrid routing Current direction: LMetric LB + high-cache affinity

Do not ask the implementer to "restore real three-way argmin". That was the wrong instruction in the previous draft.

Current Intended Algorithm

From 255c8e6, the current algorithm should be documented as:

if session has an affinity instance:
    if cache_ratio_on_affinity > 0.5
       and affinity_instance.num_requests <= avg_num_requests * overload_factor:
        route to affinity instance
    else:
        route by LMetric
else:
    route by LMetric

LMetric remains:

score = (pending_prefill_tokens + new_uncached_tokens) * num_requests
new_uncached_tokens = input_length - estimated_cache_hit

This is no longer the old expected-latency migration model.

P0 Fixes

1. Remove Stale PUSH-Migration Code From the Current Unified Branch

Location: scripts/cache_aware_proxy.py, _handle_combined.

Problem:

After 255c8e6, unified is a hybrid policy, but the function still contains an unreachable block guarded by best_needs_push = False. That block references variables from the removed cost model such as best_cache_idx, best_cache_hit, and _current_offloads.

Fix:

  • In the unified branch, delete the unreachable if best_needs_push: block.
  • Keep helper functions like _handle_cached_prefill_offload only if another live mode still calls them.
  • Update the _handle_combined docstring so it no longer says Unified always computes queue + prefill + transfer.

Why:

The code is currently safe only because the branch is unreachable. Leaving it there makes future changes dangerous and makes reviewers think migration is still part of the active policy.

Verification:

  • rg "best_needs_push|best_cache_idx|push_cache_hit" scripts/cache_aware_proxy.py should show no stale references inside the active unified path.
  • pytest -q.

2. Reconcile Documentation With the Latest Commits

Locations:

  • REPORT.md
  • docs/migration-policy-design.md
  • analysis/research_findings.md
  • analysis/elastic_hypotheses.md

Problem:

Several docs still present old or contradictory conclusions:

  • REPORT.md section 3.9 still calls single argmin / PUSH migration the "Final Design".
  • docs/migration-policy-design.md describes a baseline-mode additive cost model, while HEAD implements hybrid LMetric + high-cache affinity.
  • analysis/research_findings.md says LMetric is neutralized by session affinity constraints, but later corrected LMetric results say cache-hit scoring creates implicit soft affinity.

Fix:

  • Add an explicit "Superseded by git history" note near REPORT.md section 3.9:
    • 6b255fa/5892739/2b9eae0 were explored.
    • cc6e562 rejected forced migration.
    • 4c583f2 rejected relaxed offload / high-offload configurations.
    • 255c8e6 is the current implementation direction.
  • Update docs/migration-policy-design.md to either:
    • describe the current hybrid algorithm, or
    • clearly state that the additive cost model is the previous Approach A and not the latest code.
  • Mark old analysis sections as superseded rather than deleting them.

Why:

The docs caused the previous bad review recommendation. Future reviewers need to know which ideas were already tested and rejected.

Verification:

  • rg "Final Design|argmin\\(expected_latency\\)|PUSH_MIGRATE|Approach B" REPORT.md docs analysis should show clear superseded labels where appropriate.

3. Preserve the LMetric Baseline Separately From Unified Hybrid

Location: scripts/cache_aware_proxy.py.

Problem:

Pure --policy lmetric is the baseline being compared against. Unified hybrid uses LMetric internally but should not accidentally change the LMetric baseline behavior.

Fix:

  • Keep pick_instance_lmetric as the pure corrected LMetric implementation.
  • Put any hybrid-specific tie-breakers or affinity logic outside pick_instance_lmetric, under --policy unified.
  • In breakdown logs, record whether a Unified request used affinity or lmetric_fallback.

Why:

If the baseline and Unified share hidden behavior, future A/B comparisons become invalid.

Verification:

  • Unit test: --policy lmetric never uses session affinity.
  • Unit test: --policy unified can use affinity only when cache ratio and load gates pass.

P1 Fixes

4. Fix the Unified Hybrid / LMetric Fallback Empty-Batch Degeneracy

Problem:

LMetric score is P_tokens * BS. When BS = num_requests = 0, every instance gets score 0, so tie-break chooses instance 0. docs/migration-policy-design.md explicitly lists avoiding this as one reason Unified beats LMetric.

The latest hybrid falls back to LMetric, so it may reintroduce this issue for new sessions when all instances are idle.

Fix:

  • Do not change the pure --policy lmetric baseline.
  • For --policy unified fallback only, add a deterministic secondary key:
    • primary: P_tokens * BS
    • secondary when scores tie: new_uncached_tokens, then num_requests, then a round-robin or least-recently-used instance index.
  • Record tie-break count in breakdown or stats.

Why:

This preserves fair LMetric comparison while preventing Unified hybrid from degenerating to instance 0 under empty or near-empty load.

Verification:

  • Unit test with all num_requests = 0: Unified should not always choose index 0 across repeated new sessions.
  • Confirm pure LMetric test still matches the OSDI-style baseline.

5. Add Tests for the Current Hybrid Policy

Problem:

Existing tests cover older pick_instance and pure LMetric behavior, but not the current unified branch introduced by 255c8e6.

Fix:

Add tests for:

  • High-cache session sticks to affinity instance when not overloaded.
  • High-cache session breaks affinity when num_requests exceeds the overload gate.
  • Low-cache session falls back to LMetric.
  • New session falls back to LMetric with the Unified-specific tie-breaker.
  • Breakdown policy is recorded as affinity or lmetric_fallback.

Why:

This prevents future drift back toward the discarded migration cost model or accidental changes to the LMetric baseline.

Verification:

  • pytest -q.
  • Tests should run without live vLLM.

6. Treat overload_factor Changes as Experiments, Not Silent Fixes

Observation:

The current worktree has an uncommitted change:

Settings.overload_factor: 2.0 -> 1.5

But the earlier H7 sweep found overload-factor tuning largely ineffective / within noise. This is recorded in analysis/elastic_hypotheses.md.

Fix:

  • Do not silently commit a default change to 1.5 without a paired benchmark.
  • If testing 1.5, make it an experiment tag/config value, not a new default.
  • Keep docs and CLI defaults synchronized.

Why:

The prior experiment says imbalance was mostly workload/session skew, not the threshold. A default change without evidence will create another reproducibility gap.

Verification:

  • Run paired 2.0 vs 1.5 after the hybrid tests exist.
  • Report E2E p50/p90, TTFT p90, APC distribution, and GPU util imbalance.

7. Standardize Breakdown Fields for Hybrid Routing

Problem:

The current breakdown logs do not clearly expose why Unified chose affinity vs LMetric fallback.

Fix:

For each request under --policy unified, log:

  • policy: "unified"
  • decision: "affinity" | "lmetric_fallback"
  • affinity_idx
  • chosen_idx
  • affinity_cache_hit
  • affinity_cache_ratio
  • affinity_num_requests
  • avg_num_requests
  • fallback_score
  • tie_break_used

Why:

The latest performance difference versus LMetric is small. Without decision logs, it is hard to tell whether Unified is actually exercising the intended high-cache affinity behavior.

Verification:

  • breakdown.json can answer: how many requests used affinity, how many used fallback, and what latency/APC each group saw.

P2 Experiment Fixes

8. Re-run Paired LMetric vs Unified Hybrid Benchmarks

Problem:

docs/migration-policy-design.md says the 2% mean improvement needs multi-run verification. Local raw outputs for the May 25 final comparison are not present in this workspace.

Fix:

  • Run 3-5 fresh paired trials.
  • Same trace, same vLLM build, same machine, same restart procedure.
  • Compare:
    • pure LMetric
    • current Unified hybrid
    • optionally Linear/session-sticky as a reference

Metrics:

  • TTFT mean/p50/p90/p99
  • TPOT mean/p50/p90/p99
  • E2E mean/p50/p90/p99
  • errors/timeouts
  • aggregate APC
  • per-instance APC distribution
  • per-instance request count and token count
  • GPU util mean/std/imbalance

Why:

The current reported win is small enough that run-to-run noise matters.

Verification:

  • Commit or save artifacts under a date/tagged output directory: config.json, metrics.jsonl, metrics.summary.json, breakdown.json, gpu_util.csv, final vLLM APC snapshot, git commit hash.

9. Do Not Re-open PD-Sep Offload Without a New Transfer Mechanism

Rejected paths:

  • Full PD separation: decode KV memory wall.
  • Elastic P2P RDMA offload: transfer and scheduling overhead exceed benefit.
  • Cache-gate offload: improves balance for colocated survivors but offloaded requests pay RDMA penalty.
  • Approach B session migration: 57 migrations made HEAVY TTFT p90 much worse.
  • Relaxed gate / many offloads: 134 offloads made E2E p90 much worse.

Future work can revisit migration only if one of these changes first:

  • layerwise / pipelined KV transfer
  • multi-machine P role so P work does not compete with D on the same GPU pool
  • exact vLLM state / cache residency exposed to the router
  • production-concurrency benchmark showing decode SLO pressure large enough to amortize transfer overhead

Why:

The same local mechanism has already failed multiple times. Repeating it with another threshold is unlikely to help.

10. Restore Production-Concurrency Evaluation as a Separate Track

Problem:

Several conclusions were made at 1-2 req/GPU, while production is estimated at 8-15 req/GPU. Higher concurrency is where prefill-decode interference appears.

Fix:

  • Restore or replace --max-inflight-sessions for controlled concurrency.
  • Run at 64 and 128 active sessions.
  • Treat this as a new experiment track, not as a reason to resurrect old migration code immediately.

Why:

At high concurrency, the design pressure may change. But the implementation should first prove the current hybrid baseline cleanly.

Suggested Implementation Order

  1. Update docs to mark discarded migration/offload schemes as superseded.
  2. Remove stale unreachable PUSH code from the current Unified branch.
  3. Add tests for current Unified hybrid behavior.
  4. Add a Unified-only tie-breaker for LMetric fallback empty-batch cases.
  5. Add breakdown fields for hybrid routing decisions.
  6. Decide whether the local overload_factor=1.5 diff is an experiment or should be dropped.
  7. Run paired multi-run LMetric vs Unified hybrid benchmarks.
  8. Only after that, open a separate high-concurrency experiment track.

Review Checklist

  • Does the code still mention single argmin(expected_latency) as current behavior? If yes, update it or mark it superseded.
  • Is any migration/offload code reachable under --policy unified? If yes, require a new experiment plan because recent history rejects it.
  • Does pure --policy lmetric remain pure and affinity-free?
  • Does --policy unified clearly log when it used affinity vs LMetric fallback?
  • Are default parameter changes backed by paired results?
  • Can another reviewer reproduce the result from committed scripts and saved artifacts?