Docs: reconcile routing docs with current hybrid direction
Per analysis/unified_routing_fix_review.md #2, several docs still presented the retired single-argmin + PUSH-migration design as the final algorithm. Mark them superseded and document the current hybrid direction (commit255c8e6). - REPORT.md §1.1 / §3.9: add errata callout and section header noting the "Final Design" framing was retired aftercc6e562/ 4c583f2; point readers to docs/migration-policy-design.md. - docs/migration-policy-design.md: rewrite. Opens with the current hybrid algorithm (LMetric base + cache_ratio>0.5 affinity gate + tie-breaker), then a "What Was Retired" commit table, then the old Approach A numbers preserved as "Historical Baseline-Mode Comparison". - analysis/research_findings.md §2.2 / §5: correct the LMetric framing. LMetric isn't "neutralized by affinity constraints" (pure --policy lmetric has no affinity at all); it converges to similar placements because P_tokens includes new_uncached_tokens, giving it implicit soft affinity. - analysis/elastic_hypotheses.md: same LMetric correction in the "DOESN'T work" summary, plus a footer cross-referencing the current routing direction. - analysis/unified_routing_fix_review.md: track this file (was untracked); it is the review handoff cited from the updated docs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
24
REPORT.md
24
REPORT.md
@@ -27,6 +27,12 @@ For agentic LLM workloads (long input, short output, high KV cache reuse), is pr
|
||||
> was removed when replay moved to trace-driven dispatch. The next-step
|
||||
> experiment requires restoring the flag first (see `FIXES.md` §B2
|
||||
> route A) before any production-concurrency numbers can be produced.
|
||||
> - **§3.9 "Final Design" framing**: the single-argmin + PUSH-migration
|
||||
> design was retired after `cc6e562` / `4c583f2` showed forced and
|
||||
> relaxed-gate migration variants both regressed E2E tail. Current
|
||||
> policy is the hybrid LMetric + high-cache affinity landed in
|
||||
> `255c8e6`. See the per-section note in §3.9 and the active algorithm
|
||||
> in `docs/migration-policy-design.md`.
|
||||
>
|
||||
> The authoritative results are in **§3.6 and §3.7**.
|
||||
|
||||
@@ -356,7 +362,23 @@ The elastic numbers on dash1 were genuinely fresh. The "improvement" was actuall
|
||||
|
||||
**Output**: `outputs/eval_direct_rdma_v*/` on dash0.
|
||||
|
||||
### 3.9 Unified Routing (Final Design)
|
||||
### 3.9 Unified Routing (Historical — superseded)
|
||||
|
||||
> **Superseded by git history.** The "single argmin + PUSH migration" design
|
||||
> described here was implemented in `6b255fa`, refined through
|
||||
> `5892739` (soft affinity), `2b9eae0` (numbers below), and `4b50c5a`
|
||||
> (queue/overload-gate fixes). Follow-on attempts to scale migration —
|
||||
> `e991960`/`5772149` (forced session migration) and `bf4469a` (relaxed
|
||||
> push gate) — were both reverted (`cc6e562`, `4c583f2`) after they
|
||||
> regressed E2E tail (57 migrations → HEAVY TTFT p90 15.9s → 59.1s;
|
||||
> 134 offloads → E2E p90 37s → 82s).
|
||||
>
|
||||
> Current implementation is the **hybrid LMetric + high-cache affinity**
|
||||
> direction landed in `255c8e6`. See `docs/migration-policy-design.md`
|
||||
> for the active algorithm and `analysis/unified_routing_fix_review.md`
|
||||
> for the reasoning. The numbers below remain valid for the
|
||||
> `eval_unified_v3` artifact; do not treat them as the current
|
||||
> production policy.
|
||||
|
||||
Replaced two-phase routing (pick_instance → offload gate) with single `argmin(expected_latency)` per instance:
|
||||
|
||||
|
||||
@@ -244,7 +244,11 @@ Offloaded: — 13/500 (2.6%) too few to matter
|
||||
### What DOESN'T work for agentic workloads:
|
||||
|
||||
1. **PD-Sep**: net negative — KV cache memory wall on decode instances
|
||||
2. **LMetric (OSDI'26)**: ≈ linear routing — session affinity limits routing freedom
|
||||
2. **LMetric (OSDI'26)**: ≈ linear routing — `P_tokens` already includes
|
||||
`new_uncached_tokens`, so cache-hit scoring gives LMetric an implicit
|
||||
soft affinity that converges to similar placements as explicit sticky
|
||||
affinity (see `analysis/research_findings.md` §2.2 for the corrected
|
||||
framing)
|
||||
3. **Elastic P2P RDMA offload**: net negative — Mooncake transfer overhead, no layerwise pipeline
|
||||
4. **OVERLOAD_FACTOR tuning**: no effect — imbalance from workload skew, not routing
|
||||
5. **Dedicated Prefill Service (PS)**: cannot win cost comparison without KV pull, PS is always slower than cached C
|
||||
@@ -270,3 +274,21 @@ Instead of fixed chunk size, dynamically adjust based on decode pressure:
|
||||
- When decode queue is deep: smaller chunks → more decode slots → better TPOT
|
||||
- When decode queue is empty: larger chunks → faster prefill → better TTFT
|
||||
- This is a vLLM scheduler modification, not a routing change
|
||||
|
||||
---
|
||||
|
||||
## Current routing direction (cross-reference)
|
||||
|
||||
The hypotheses above produced the following positive results that informed
|
||||
the current `--policy unified` implementation:
|
||||
|
||||
- H1 / H7 / H9 (negative): PD-sep offload, OVERLOAD_FACTOR tuning, and
|
||||
elastic RDMA at high concurrency all regressed or stayed within noise.
|
||||
- H3 / H4 / H6 (partial): cache-gated offload exists but only ~10-12% of
|
||||
HEAVY requests have cache, and the offloaded subset pays RDMA penalty.
|
||||
|
||||
The active algorithm (commit `255c8e6`) is **hybrid LMetric + high-cache
|
||||
affinity** in baseline mode (no Mooncake). The retired migration variants
|
||||
are catalogued in `docs/migration-policy-design.md` (Approach A and the
|
||||
revert chain `cc6e562` / `4c583f2`). H7's rejection (OVERLOAD_FACTOR within
|
||||
noise) is why the active default stays at `overload_factor=2.0`.
|
||||
|
||||
@@ -38,7 +38,24 @@ These characteristics fundamentally change what optimizations matter.
|
||||
**Setup**: 8 instances, LMetric vs linear routing
|
||||
**Result**: TTFT +2.2%, TPOT -4.4%, E2E +2.6% — all within noise (±7% run-to-run)
|
||||
|
||||
**Root cause**: Session affinity constrains routing freedom. LMetric's benefit (hyperparameter-free load balancing) is neutralized because turn 2+ requests MUST go to their session-sticky instance regardless of the scoring function. With 90% of multi-turn requests locked by affinity, only turn-1 placement is influenced by the score — too few decisions to make a difference.
|
||||
**Root cause (updated)**: LMetric is not "neutralized by affinity
|
||||
constraints" — pure `--policy lmetric` runs without session affinity at all.
|
||||
The actual reason the LMetric vs linear comparison sits within noise is that
|
||||
`P_tokens` already includes `new_uncached_tokens = input_length - cache_hit`,
|
||||
which means later turns of a session naturally score lowest on the instance
|
||||
that cached their prefix. This gives LMetric an **implicit soft affinity**
|
||||
that competes with linear's explicit sticky affinity. The two arrive at
|
||||
similar placements through different mechanisms.
|
||||
|
||||
This is also why explicit migration buys little on top of LMetric: the
|
||||
first-order signal driving placement is already cache-derived. See
|
||||
`docs/migration-policy-design.md` for how the current hybrid policy uses
|
||||
this insight (LMetric base + explicit affinity only when `cache_ratio > 0.5`).
|
||||
|
||||
**Previous framing (incorrect)**: an earlier draft of this section attributed
|
||||
the result to session affinity constraining LMetric's routing freedom. That
|
||||
framing assumed `--policy lmetric` inherited the linear-mode session-sticky
|
||||
behavior, which it does not (verified in `tests/test_proxy_pick.py`).
|
||||
|
||||
### 2.3 Elastic P2P RDMA Offload (Heavy prefill on different instance)
|
||||
|
||||
@@ -148,7 +165,9 @@ This changes the scheduling picture: most "HEAVY" requests in agentic workloads
|
||||
|
||||
### Why existing approaches don't work:
|
||||
1. **PD-Sep** assumes decode needs dedicated resources → agentic has memory wall on decode
|
||||
2. **LMetric** assumes routing freedom → agentic has session affinity constraints
|
||||
2. **LMetric** matches linear within noise because cache-hit appears in
|
||||
`P_tokens` itself, so it already routes later turns back to the cached
|
||||
instance via implicit soft affinity — explicit affinity buys little
|
||||
3. **Elastic RDMA** assumes KV transfer is cheap → Mooncake lacks layerwise pipelining
|
||||
4. **Size-based classification** assumes HEAVY = needs special handling → after cache, most HEAVY is MEDIUM
|
||||
|
||||
|
||||
406
analysis/unified_routing_fix_review.md
Normal file
406
analysis/unified_routing_fix_review.md
Normal file
@@ -0,0 +1,406 @@
|
||||
# Unified Routing Fix Review Handoff
|
||||
|
||||
Date: 2026-05-25
|
||||
|
||||
This is the corrected review handoff after reading the git history. The key
|
||||
change from the previous draft is that we should **not** restore the old
|
||||
single-argmin / PUSH-migration design. That path was implemented, measured,
|
||||
and then discarded or simplified by later commits.
|
||||
|
||||
## Executive Summary
|
||||
|
||||
The latest commit history says the current direction is:
|
||||
|
||||
- Use baseline mode / no Mooncake for the winning comparison against LMetric.
|
||||
- Use LMetric-style load balancing as the base.
|
||||
- Add explicit affinity only for sessions with high accumulated cache.
|
||||
- Do not re-enable PD-sep offload or session migration unless the transfer
|
||||
mechanism is fundamentally reworked.
|
||||
|
||||
The main fixes now are cleanup, documentation consistency, tests, and
|
||||
reproducibility. The biggest risk is that stale docs still describe abandoned
|
||||
schemes as "final design".
|
||||
|
||||
## Evidence From Git History
|
||||
|
||||
Relevant commits:
|
||||
|
||||
| Commit | Meaning | Outcome |
|
||||
|---|---|---|
|
||||
| `6b255fa` | Implemented single `argmin(expected_latency)` over local / PUSH / cold paths | Later superseded |
|
||||
| `5892739` | Added soft affinity because pure argmin overloaded cache-source instances | Pure argmin was unstable |
|
||||
| `2b9eae0` | Reported Unified v3 with 116 PUSH migrations | TTFT improved, but TPOT/E2E tail tradeoff existed |
|
||||
| `cdf8349` | Added real cache sync and cached-prefill-on-C architecture | Fixed false PUSH and direct-read issues |
|
||||
| `4b50c5a` | Fixed queue model and hard overload gate | Reduced imbalance, but still on offload path |
|
||||
| `e991960` / `5772149` | Tried forced session migration triggers | Reverted |
|
||||
| `cc6e562` | Reverted Approach B migration | 57 migrations made HEAVY TTFT p90 regress 15.9s -> 59.1s |
|
||||
| `bf4469a` | Tried more accurate push cost / gate alignment | Later reverted |
|
||||
| `4c583f2` | Reverted relaxed gate / push-cost fix | 134 offloads made E2E p90 37s -> 82s |
|
||||
| `448361c` | Updated design doc: baseline no-Mooncake Unified beats LMetric | PD-sep offload degrades |
|
||||
| `255c8e6` | Replaced full cost model with hybrid routing | Current direction: LMetric LB + high-cache affinity |
|
||||
|
||||
Do not ask the implementer to "restore real three-way argmin". That was the
|
||||
wrong instruction in the previous draft.
|
||||
|
||||
## Current Intended Algorithm
|
||||
|
||||
From `255c8e6`, the current algorithm should be documented as:
|
||||
|
||||
```text
|
||||
if session has an affinity instance:
|
||||
if cache_ratio_on_affinity > 0.5
|
||||
and affinity_instance.num_requests <= avg_num_requests * overload_factor:
|
||||
route to affinity instance
|
||||
else:
|
||||
route by LMetric
|
||||
else:
|
||||
route by LMetric
|
||||
```
|
||||
|
||||
LMetric remains:
|
||||
|
||||
```text
|
||||
score = (pending_prefill_tokens + new_uncached_tokens) * num_requests
|
||||
new_uncached_tokens = input_length - estimated_cache_hit
|
||||
```
|
||||
|
||||
This is no longer the old expected-latency migration model.
|
||||
|
||||
## P0 Fixes
|
||||
|
||||
### 1. Remove Stale PUSH-Migration Code From the Current Unified Branch
|
||||
|
||||
Location: `scripts/cache_aware_proxy.py`, `_handle_combined`.
|
||||
|
||||
Problem:
|
||||
|
||||
After `255c8e6`, `unified` is a hybrid policy, but the function still contains
|
||||
an unreachable block guarded by `best_needs_push = False`. That block references
|
||||
variables from the removed cost model such as `best_cache_idx`,
|
||||
`best_cache_hit`, and `_current_offloads`.
|
||||
|
||||
Fix:
|
||||
|
||||
- In the `unified` branch, delete the unreachable `if best_needs_push:` block.
|
||||
- Keep helper functions like `_handle_cached_prefill_offload` only if another
|
||||
live mode still calls them.
|
||||
- Update the `_handle_combined` docstring so it no longer says Unified always
|
||||
computes `queue + prefill + transfer`.
|
||||
|
||||
Why:
|
||||
|
||||
The code is currently safe only because the branch is unreachable. Leaving it
|
||||
there makes future changes dangerous and makes reviewers think migration is
|
||||
still part of the active policy.
|
||||
|
||||
Verification:
|
||||
|
||||
- `rg "best_needs_push|best_cache_idx|push_cache_hit" scripts/cache_aware_proxy.py`
|
||||
should show no stale references inside the active `unified` path.
|
||||
- `pytest -q`.
|
||||
|
||||
### 2. Reconcile Documentation With the Latest Commits
|
||||
|
||||
Locations:
|
||||
|
||||
- `REPORT.md`
|
||||
- `docs/migration-policy-design.md`
|
||||
- `analysis/research_findings.md`
|
||||
- `analysis/elastic_hypotheses.md`
|
||||
|
||||
Problem:
|
||||
|
||||
Several docs still present old or contradictory conclusions:
|
||||
|
||||
- `REPORT.md` section 3.9 still calls single argmin / PUSH migration the
|
||||
"Final Design".
|
||||
- `docs/migration-policy-design.md` describes a baseline-mode additive cost
|
||||
model, while HEAD implements hybrid LMetric + high-cache affinity.
|
||||
- `analysis/research_findings.md` says LMetric is neutralized by session
|
||||
affinity constraints, but later corrected LMetric results say cache-hit
|
||||
scoring creates implicit soft affinity.
|
||||
|
||||
Fix:
|
||||
|
||||
- Add an explicit "Superseded by git history" note near `REPORT.md` section
|
||||
3.9:
|
||||
- `6b255fa/5892739/2b9eae0` were explored.
|
||||
- `cc6e562` rejected forced migration.
|
||||
- `4c583f2` rejected relaxed offload / high-offload configurations.
|
||||
- `255c8e6` is the current implementation direction.
|
||||
- Update `docs/migration-policy-design.md` to either:
|
||||
- describe the current hybrid algorithm, or
|
||||
- clearly state that the additive cost model is the previous Approach A and
|
||||
not the latest code.
|
||||
- Mark old analysis sections as superseded rather than deleting them.
|
||||
|
||||
Why:
|
||||
|
||||
The docs caused the previous bad review recommendation. Future reviewers need
|
||||
to know which ideas were already tested and rejected.
|
||||
|
||||
Verification:
|
||||
|
||||
- `rg "Final Design|argmin\\(expected_latency\\)|PUSH_MIGRATE|Approach B" REPORT.md docs analysis`
|
||||
should show clear superseded labels where appropriate.
|
||||
|
||||
### 3. Preserve the LMetric Baseline Separately From Unified Hybrid
|
||||
|
||||
Location: `scripts/cache_aware_proxy.py`.
|
||||
|
||||
Problem:
|
||||
|
||||
Pure `--policy lmetric` is the baseline being compared against. Unified hybrid
|
||||
uses LMetric internally but should not accidentally change the LMetric baseline
|
||||
behavior.
|
||||
|
||||
Fix:
|
||||
|
||||
- Keep `pick_instance_lmetric` as the pure corrected LMetric implementation.
|
||||
- Put any hybrid-specific tie-breakers or affinity logic outside
|
||||
`pick_instance_lmetric`, under `--policy unified`.
|
||||
- In breakdown logs, record whether a Unified request used `affinity` or
|
||||
`lmetric_fallback`.
|
||||
|
||||
Why:
|
||||
|
||||
If the baseline and Unified share hidden behavior, future A/B comparisons become
|
||||
invalid.
|
||||
|
||||
Verification:
|
||||
|
||||
- Unit test: `--policy lmetric` never uses session affinity.
|
||||
- Unit test: `--policy unified` can use affinity only when cache ratio and load
|
||||
gates pass.
|
||||
|
||||
## P1 Fixes
|
||||
|
||||
### 4. Fix the Unified Hybrid / LMetric Fallback Empty-Batch Degeneracy
|
||||
|
||||
Problem:
|
||||
|
||||
LMetric score is `P_tokens * BS`. When `BS = num_requests = 0`, every instance
|
||||
gets score 0, so tie-break chooses instance 0. `docs/migration-policy-design.md`
|
||||
explicitly lists avoiding this as one reason Unified beats LMetric.
|
||||
|
||||
The latest hybrid falls back to LMetric, so it may reintroduce this issue for
|
||||
new sessions when all instances are idle.
|
||||
|
||||
Fix:
|
||||
|
||||
- Do not change the pure `--policy lmetric` baseline.
|
||||
- For `--policy unified` fallback only, add a deterministic secondary key:
|
||||
- primary: `P_tokens * BS`
|
||||
- secondary when scores tie: `new_uncached_tokens`, then `num_requests`, then
|
||||
a round-robin or least-recently-used instance index.
|
||||
- Record tie-break count in breakdown or stats.
|
||||
|
||||
Why:
|
||||
|
||||
This preserves fair LMetric comparison while preventing Unified hybrid from
|
||||
degenerating to instance 0 under empty or near-empty load.
|
||||
|
||||
Verification:
|
||||
|
||||
- Unit test with all `num_requests = 0`: Unified should not always choose index
|
||||
0 across repeated new sessions.
|
||||
- Confirm pure LMetric test still matches the OSDI-style baseline.
|
||||
|
||||
### 5. Add Tests for the Current Hybrid Policy
|
||||
|
||||
Problem:
|
||||
|
||||
Existing tests cover older `pick_instance` and pure LMetric behavior, but not
|
||||
the current `unified` branch introduced by `255c8e6`.
|
||||
|
||||
Fix:
|
||||
|
||||
Add tests for:
|
||||
|
||||
- High-cache session sticks to affinity instance when not overloaded.
|
||||
- High-cache session breaks affinity when `num_requests` exceeds the overload
|
||||
gate.
|
||||
- Low-cache session falls back to LMetric.
|
||||
- New session falls back to LMetric with the Unified-specific tie-breaker.
|
||||
- Breakdown policy is recorded as `affinity` or `lmetric_fallback`.
|
||||
|
||||
Why:
|
||||
|
||||
This prevents future drift back toward the discarded migration cost model or
|
||||
accidental changes to the LMetric baseline.
|
||||
|
||||
Verification:
|
||||
|
||||
- `pytest -q`.
|
||||
- Tests should run without live vLLM.
|
||||
|
||||
### 6. Treat `overload_factor` Changes as Experiments, Not Silent Fixes
|
||||
|
||||
Observation:
|
||||
|
||||
The current worktree has an uncommitted change:
|
||||
|
||||
```text
|
||||
Settings.overload_factor: 2.0 -> 1.5
|
||||
```
|
||||
|
||||
But the earlier H7 sweep found overload-factor tuning largely ineffective /
|
||||
within noise. This is recorded in `analysis/elastic_hypotheses.md`.
|
||||
|
||||
Fix:
|
||||
|
||||
- Do not silently commit a default change to `1.5` without a paired benchmark.
|
||||
- If testing `1.5`, make it an experiment tag/config value, not a new default.
|
||||
- Keep docs and CLI defaults synchronized.
|
||||
|
||||
Why:
|
||||
|
||||
The prior experiment says imbalance was mostly workload/session skew, not the
|
||||
threshold. A default change without evidence will create another reproducibility
|
||||
gap.
|
||||
|
||||
Verification:
|
||||
|
||||
- Run paired `2.0` vs `1.5` after the hybrid tests exist.
|
||||
- Report E2E p50/p90, TTFT p90, APC distribution, and GPU util imbalance.
|
||||
|
||||
### 7. Standardize Breakdown Fields for Hybrid Routing
|
||||
|
||||
Problem:
|
||||
|
||||
The current breakdown logs do not clearly expose why Unified chose affinity vs
|
||||
LMetric fallback.
|
||||
|
||||
Fix:
|
||||
|
||||
For each request under `--policy unified`, log:
|
||||
|
||||
- `policy: "unified"`
|
||||
- `decision: "affinity" | "lmetric_fallback"`
|
||||
- `affinity_idx`
|
||||
- `chosen_idx`
|
||||
- `affinity_cache_hit`
|
||||
- `affinity_cache_ratio`
|
||||
- `affinity_num_requests`
|
||||
- `avg_num_requests`
|
||||
- `fallback_score`
|
||||
- `tie_break_used`
|
||||
|
||||
Why:
|
||||
|
||||
The latest performance difference versus LMetric is small. Without decision
|
||||
logs, it is hard to tell whether Unified is actually exercising the intended
|
||||
high-cache affinity behavior.
|
||||
|
||||
Verification:
|
||||
|
||||
- `breakdown.json` can answer: how many requests used affinity, how many used
|
||||
fallback, and what latency/APC each group saw.
|
||||
|
||||
## P2 Experiment Fixes
|
||||
|
||||
### 8. Re-run Paired LMetric vs Unified Hybrid Benchmarks
|
||||
|
||||
Problem:
|
||||
|
||||
`docs/migration-policy-design.md` says the 2% mean improvement needs multi-run
|
||||
verification. Local raw outputs for the May 25 final comparison are not present
|
||||
in this workspace.
|
||||
|
||||
Fix:
|
||||
|
||||
- Run 3-5 fresh paired trials.
|
||||
- Same trace, same vLLM build, same machine, same restart procedure.
|
||||
- Compare:
|
||||
- pure LMetric
|
||||
- current Unified hybrid
|
||||
- optionally Linear/session-sticky as a reference
|
||||
|
||||
Metrics:
|
||||
|
||||
- TTFT mean/p50/p90/p99
|
||||
- TPOT mean/p50/p90/p99
|
||||
- E2E mean/p50/p90/p99
|
||||
- errors/timeouts
|
||||
- aggregate APC
|
||||
- per-instance APC distribution
|
||||
- per-instance request count and token count
|
||||
- GPU util mean/std/imbalance
|
||||
|
||||
Why:
|
||||
|
||||
The current reported win is small enough that run-to-run noise matters.
|
||||
|
||||
Verification:
|
||||
|
||||
- Commit or save artifacts under a date/tagged output directory:
|
||||
`config.json`, `metrics.jsonl`, `metrics.summary.json`, `breakdown.json`,
|
||||
`gpu_util.csv`, final vLLM APC snapshot, git commit hash.
|
||||
|
||||
### 9. Do Not Re-open PD-Sep Offload Without a New Transfer Mechanism
|
||||
|
||||
Rejected paths:
|
||||
|
||||
- Full PD separation: decode KV memory wall.
|
||||
- Elastic P2P RDMA offload: transfer and scheduling overhead exceed benefit.
|
||||
- Cache-gate offload: improves balance for colocated survivors but offloaded
|
||||
requests pay RDMA penalty.
|
||||
- Approach B session migration: 57 migrations made HEAVY TTFT p90 much worse.
|
||||
- Relaxed gate / many offloads: 134 offloads made E2E p90 much worse.
|
||||
|
||||
Future work can revisit migration only if one of these changes first:
|
||||
|
||||
- layerwise / pipelined KV transfer
|
||||
- multi-machine P role so P work does not compete with D on the same GPU pool
|
||||
- exact vLLM state / cache residency exposed to the router
|
||||
- production-concurrency benchmark showing decode SLO pressure large enough to
|
||||
amortize transfer overhead
|
||||
|
||||
Why:
|
||||
|
||||
The same local mechanism has already failed multiple times. Repeating it with
|
||||
another threshold is unlikely to help.
|
||||
|
||||
### 10. Restore Production-Concurrency Evaluation as a Separate Track
|
||||
|
||||
Problem:
|
||||
|
||||
Several conclusions were made at 1-2 req/GPU, while production is estimated at
|
||||
8-15 req/GPU. Higher concurrency is where prefill-decode interference appears.
|
||||
|
||||
Fix:
|
||||
|
||||
- Restore or replace `--max-inflight-sessions` for controlled concurrency.
|
||||
- Run at 64 and 128 active sessions.
|
||||
- Treat this as a new experiment track, not as a reason to resurrect old
|
||||
migration code immediately.
|
||||
|
||||
Why:
|
||||
|
||||
At high concurrency, the design pressure may change. But the implementation
|
||||
should first prove the current hybrid baseline cleanly.
|
||||
|
||||
## Suggested Implementation Order
|
||||
|
||||
1. Update docs to mark discarded migration/offload schemes as superseded.
|
||||
2. Remove stale unreachable PUSH code from the current Unified branch.
|
||||
3. Add tests for current Unified hybrid behavior.
|
||||
4. Add a Unified-only tie-breaker for LMetric fallback empty-batch cases.
|
||||
5. Add breakdown fields for hybrid routing decisions.
|
||||
6. Decide whether the local `overload_factor=1.5` diff is an experiment or
|
||||
should be dropped.
|
||||
7. Run paired multi-run LMetric vs Unified hybrid benchmarks.
|
||||
8. Only after that, open a separate high-concurrency experiment track.
|
||||
|
||||
## Review Checklist
|
||||
|
||||
- Does the code still mention single `argmin(expected_latency)` as current
|
||||
behavior? If yes, update it or mark it superseded.
|
||||
- Is any migration/offload code reachable under `--policy unified`? If yes,
|
||||
require a new experiment plan because recent history rejects it.
|
||||
- Does pure `--policy lmetric` remain pure and affinity-free?
|
||||
- Does `--policy unified` clearly log when it used affinity vs LMetric fallback?
|
||||
- Are default parameter changes backed by paired results?
|
||||
- Can another reviewer reproduce the result from committed scripts and saved
|
||||
artifacts?
|
||||
|
||||
@@ -1,62 +1,109 @@
|
||||
# Migration Policy Design: Improving Load Balance in Elastic KV
|
||||
# Routing & Migration Policy: Design Log
|
||||
|
||||
## Final Result
|
||||
This file is the active reference for the routing policy. It supersedes the
|
||||
"single argmin + PUSH migration" framing once described as the final design
|
||||
(see commit notes below and `REPORT.md` §3.9 errata).
|
||||
|
||||
**Unified routing (baseline mode, no Mooncake)** beats LMetric on E2E mean/p50/p90.
|
||||
Pending multi-run significance verification.
|
||||
## Current Algorithm: Hybrid LMetric + High-Cache Affinity
|
||||
|
||||
| Metric | LMetric | Unified | Change |
|
||||
|--------|---------|---------|--------|
|
||||
| E2E mean | 18.204 | **17.831** | -2.0% |
|
||||
| E2E p50 | 6.184 | **6.074** | -1.8% |
|
||||
| E2E p90 | 39.438 | **37.073** | -6.0% |
|
||||
| TTFT p90 | 9.331 | **8.034** | -13.9% |
|
||||
Implemented in `255c8e6`. Active under `--policy unified` in
|
||||
`scripts/cache_aware_proxy.py`.
|
||||
|
||||
```python
|
||||
# Step 1: affinity gate (only for sessions that have a recorded owner)
|
||||
if session has affinity instance:
|
||||
cache_ratio = cache_hit_on_affinity / input_length
|
||||
gate_1: cache_ratio > 0.5
|
||||
gate_2: affinity.num_requests <= avg_num_requests * overload_factor
|
||||
if gate_1 AND gate_2:
|
||||
decision = "affinity"
|
||||
return affinity_instance
|
||||
|
||||
# Step 2: LMetric fallback with deterministic tie-breaker
|
||||
for each instance i:
|
||||
score_i = (pending_prefill_tokens_i + new_uncached_tokens_i) * num_requests_i
|
||||
= P_tokens * BS # primary
|
||||
secondary key: new_uncached_tokens # prefer cache
|
||||
tertiary key: num_requests # prefer idle
|
||||
quaternary: round-robin counter % n # break ties
|
||||
return argmin
|
||||
```
|
||||
|
||||
The pure `--policy lmetric` baseline stays affinity-free; the hybrid lives
|
||||
entirely under `--policy unified`. The round-robin counter is required because
|
||||
`P_tokens * BS = 0` whenever `BS = 0` for all instances (new sessions, cold
|
||||
start), which would otherwise pin every fresh session to instance 0.
|
||||
|
||||
Parameters: `overload_factor=2.0` (default). The previously-introduced
|
||||
`decode_iteration_s` / `prefill_throughput` / `rdma_overhead_s` are kept in
|
||||
`Settings` but no longer drive routing — they were Approach-A inputs.
|
||||
|
||||
### Why this shape
|
||||
|
||||
- **LMetric for load balance**: `P_tokens × BS` is hyperparameter-free and
|
||||
captures both pending prefill work and current batch contention.
|
||||
- **Implicit soft affinity from LMetric itself**: `P_tokens` includes
|
||||
`new_uncached_tokens = input - cache_hit`. Later turns naturally prefer
|
||||
the instance that already cached the prefix, because their `P_tokens` are
|
||||
smaller there. This is the dominant reason explicit migration buys little.
|
||||
- **Explicit affinity only for the long-cache case**: when cache_ratio > 0.5,
|
||||
the placement cost of breaking sticky is large enough to justify a hard
|
||||
gate. Below that ratio, defer to LMetric.
|
||||
|
||||
## What Was Retired and Why
|
||||
|
||||
| Commit | Approach | Outcome |
|
||||
|---|---|---|
|
||||
| `6b255fa` | Single `argmin(queue+prefill+transfer)` over local/PUSH/cold | Initial design; numbers in REPORT §3.9 |
|
||||
| `5892739` | Soft affinity added (pure argmin overloaded cache owners) | Stabilized but tail still degraded |
|
||||
| `2b9eae0` | Reported Unified v3 (116 PUSH migrations) | TTFT -25%/-32%, **E2E p90 +12%, p99 +24%** |
|
||||
| `e991960`/`5772149` | Forced session migration triggers (Approach B) | 57 migrations, **HEAVY TTFT p90 15.9s → 59.1s** |
|
||||
| `cc6e562` | Revert Approach B | "overhead exceeds LB benefit" |
|
||||
| `bf4469a` | Tighter push_cost + aligned hard gate | Triggered too few migrations to recover |
|
||||
| `4c583f2` | Revert relaxed gate | 134 offloads, **E2E p90 37s → 82s** |
|
||||
| `255c8e6` | **Current** hybrid LMetric + high-cache affinity | Stable baseline |
|
||||
|
||||
The shared lesson across the retired variants: PD-sep offload pays
|
||||
`C_queue + C_prefill + RDMA + D_schedule + D_decode_start` and the saved
|
||||
prefill time on D rarely amortizes this — especially because 92% of HEAVY
|
||||
requests are turn-1 cold (no source-side cache to migrate). See
|
||||
`analysis/elastic_hypotheses.md` H3-H9 for the per-variant evidence.
|
||||
|
||||
## Historical Baseline-Mode Comparison (Approach A)
|
||||
|
||||
These numbers are from the additive-cost-model variant of Unified routing
|
||||
(before `255c8e6`). Kept for reference; the hybrid currently lives on top of
|
||||
LMetric, not on this additive cost. The "Unified" column should not be cited
|
||||
as the current implementation.
|
||||
|
||||
| Metric | LMetric | Unified (Approach A, historical) | Change |
|
||||
|--------|---------|----------------------------------|--------|
|
||||
| E2E mean | 18.204 | 17.831 | -2.0% |
|
||||
| E2E p50 | 6.184 | 6.074 | -1.8% |
|
||||
| E2E p90 | 39.438 | 37.073 | -6.0% |
|
||||
| TTFT p90 | 9.331 | 8.034 | -13.9% |
|
||||
| Errors | 0 | 0 | — |
|
||||
|
||||
### Why Unified beats LMetric
|
||||
|
||||
1. **Session affinity** preserves KV cache across turns → turn 2+ TTFT much lower
|
||||
2. **Additive cost model** (`contention + queue + prefill`) avoids LMetric's degenerate
|
||||
case when `num_requests = 0` (all instances score 0, tie-break to instance 0)
|
||||
3. **`num_requests` as contention signal** better captures GPU batch scheduling
|
||||
overhead than `ongoing_tokens`
|
||||
|
||||
### Why PD-sep offload doesn't help (yet)
|
||||
|
||||
Extensive experimentation with offload/migration showed that PD-sep overhead
|
||||
(C queue + prefill + KV transfer + D scheduling) consistently exceeds load
|
||||
balance benefit:
|
||||
|
||||
| Experiment | Offloads | E2E p90 | vs Baseline |
|
||||
|-----------|----------|---------|-------------|
|
||||
| A (old gate, ~5 offloads) | 5 | 39.0 | -25% |
|
||||
| A (relaxed gate, ~6 offloads) | 6 | 46.0 | -12% |
|
||||
| A+B2 (forced migration) | 57 | 84.2 | +61% |
|
||||
| A (relaxed gate v2, both gates removed) | 134 | 81.5 | +56% |
|
||||
|
||||
More offloads → worse performance. The offload mechanism itself is the bottleneck.
|
||||
|
||||
## Algorithm: Unified Routing
|
||||
Approach A (additive cost model, historical):
|
||||
|
||||
```python
|
||||
cost(instance_i) = num_requests_i × decode_iteration_s # contention
|
||||
+ pending_prefill_tokens_i / throughput # prefill queue
|
||||
+ max(0, input - cache_hit_i) / throughput # new prefill
|
||||
|
||||
# Session affinity with two gates:
|
||||
if affinity instance exists:
|
||||
gate 1: ongoing_tokens <= avg * overload_factor (hard gate)
|
||||
gate 2: affinity_cost <= global_best * overload_factor (cost ratio)
|
||||
if both pass → use affinity instance
|
||||
else → use globally best instance
|
||||
else:
|
||||
use globally best instance
|
||||
gate 1: ongoing_tokens <= avg * overload_factor
|
||||
gate 2: affinity_cost <= global_best * overload_factor
|
||||
if both pass → affinity
|
||||
else → global best
|
||||
```
|
||||
|
||||
Parameters: `decode_iteration_s=0.05` (H20), `throughput=7000` (H20),
|
||||
`overload_factor=2.0`.
|
||||
Reason for retirement: the additive cost was load-bearing on
|
||||
`decode_iteration_s` and `prefill_throughput` constants. LMetric reproduces
|
||||
the same ordering without those constants because `P_tokens` and `BS` already
|
||||
capture both prefill queue and batch contention. The hybrid keeps the cheap
|
||||
LMetric core and adds an explicit affinity gate only for high-cache cases.
|
||||
|
||||
## Evolution of Results
|
||||
### Evolution Table (historical)
|
||||
|
||||
| Version | Description | ALL TTFT p90 | ALL E2E p90 | tok max/min |
|
||||
|---------|-------------|-------------|-------------|-------------|
|
||||
@@ -65,12 +112,34 @@ Parameters: `decode_iteration_s=0.05` (H20), `throughput=7000` (H20),
|
||||
| v2 (bug) | unified, queue=prefill only | 23.339 | 66.307 | 10.3x |
|
||||
| v3 | +decode in queue, +hard gate | 10.121 | 42.393 | 2.6x |
|
||||
| A (elastic) | +num_requests contention | 7.638 | 39.044 | 3.5x |
|
||||
| **A (baseline)** | **same routing, no Mooncake** | **8.034** | **37.073** | **—** |
|
||||
| A (baseline, Approach A) | same routing, no Mooncake | 8.034 | 37.073 | — |
|
||||
| **Hybrid (current)** | **LMetric + high-cache affinity** | **see §8 re-run** | **see §8 re-run** | **—** |
|
||||
|
||||
## Rigorous Review Summary
|
||||
The current Hybrid row deliberately has no number: per
|
||||
`analysis/unified_routing_fix_review.md` #8, the small (≤2%) improvements
|
||||
need 3-5 paired multi-run trials before being quoted.
|
||||
|
||||
Independent review found:
|
||||
## Open Questions / Next Steps
|
||||
|
||||
- **#8 paired multi-run**: 3-5 fresh trials of LMetric vs current hybrid on
|
||||
identical trace + restart procedure, with full artifact bundle
|
||||
(`config.json`, `metrics.jsonl`, `metrics.summary.json`, `breakdown.json`,
|
||||
`gpu_util.csv`, per-instance APC snapshot, git commit hash).
|
||||
- **#10 production-concurrency track**: re-introduce a controlled-concurrency
|
||||
flag (`--max-inflight-sessions` or equivalent) and rerun the comparison at
|
||||
64 / 128 active sessions before drawing conclusions for production.
|
||||
- **Hard affinity ceiling for the edge case where cache_ratio = 1.0 and the
|
||||
affinity instance is overloaded**: in that scenario the LMetric fallback's
|
||||
tie-breaker re-elects the same instance because its `new_uncached_tokens`
|
||||
is 0. Either add a hard num_requests ceiling or accept the behavior. Open
|
||||
in the test `test_hybrid_high_cache_breaks_on_overload` (works only when
|
||||
the request has at least one uncached block).
|
||||
|
||||
## Original "Rigorous Review Summary" (historical)
|
||||
|
||||
Independent review of the Approach A result above:
|
||||
- **CLEAN**: Fair comparison (identical vLLM/proxy/trace/measurement)
|
||||
- **CLEAN**: No reward hacking (improvement from algorithmic difference)
|
||||
- **WARNING**: 2% mean improvement needs multi-run verification (3-5 runs)
|
||||
- **NOTE**: Hardcoded constants (0.05, 7000) are hardware-specific but legitimate
|
||||
- **NOTE**: Hardcoded constants (`0.05`, `7000`) are hardware-specific but
|
||||
legitimate
|
||||
|
||||
Reference in New Issue
Block a user