diff --git a/REPORT.md b/REPORT.md index 82135ce..9fa8914 100644 --- a/REPORT.md +++ b/REPORT.md @@ -27,6 +27,12 @@ For agentic LLM workloads (long input, short output, high KV cache reuse), is pr > was removed when replay moved to trace-driven dispatch. The next-step > experiment requires restoring the flag first (see `FIXES.md` §B2 > route A) before any production-concurrency numbers can be produced. +> - **§3.9 "Final Design" framing**: the single-argmin + PUSH-migration +> design was retired after `cc6e562` / `4c583f2` showed forced and +> relaxed-gate migration variants both regressed E2E tail. Current +> policy is the hybrid LMetric + high-cache affinity landed in +> `255c8e6`. See the per-section note in §3.9 and the active algorithm +> in `docs/migration-policy-design.md`. > > The authoritative results are in **§3.6 and §3.7**. @@ -356,7 +362,23 @@ The elastic numbers on dash1 were genuinely fresh. The "improvement" was actuall **Output**: `outputs/eval_direct_rdma_v*/` on dash0. -### 3.9 Unified Routing (Final Design) +### 3.9 Unified Routing (Historical — superseded) + +> **Superseded by git history.** The "single argmin + PUSH migration" design +> described here was implemented in `6b255fa`, refined through +> `5892739` (soft affinity), `2b9eae0` (numbers below), and `4b50c5a` +> (queue/overload-gate fixes). Follow-on attempts to scale migration — +> `e991960`/`5772149` (forced session migration) and `bf4469a` (relaxed +> push gate) — were both reverted (`cc6e562`, `4c583f2`) after they +> regressed E2E tail (57 migrations → HEAVY TTFT p90 15.9s → 59.1s; +> 134 offloads → E2E p90 37s → 82s). +> +> Current implementation is the **hybrid LMetric + high-cache affinity** +> direction landed in `255c8e6`. See `docs/migration-policy-design.md` +> for the active algorithm and `analysis/unified_routing_fix_review.md` +> for the reasoning. The numbers below remain valid for the +> `eval_unified_v3` artifact; do not treat them as the current +> production policy. Replaced two-phase routing (pick_instance → offload gate) with single `argmin(expected_latency)` per instance: diff --git a/analysis/elastic_hypotheses.md b/analysis/elastic_hypotheses.md index 1960d4e..a9a28f9 100644 --- a/analysis/elastic_hypotheses.md +++ b/analysis/elastic_hypotheses.md @@ -244,7 +244,11 @@ Offloaded: — 13/500 (2.6%) too few to matter ### What DOESN'T work for agentic workloads: 1. **PD-Sep**: net negative — KV cache memory wall on decode instances -2. **LMetric (OSDI'26)**: ≈ linear routing — session affinity limits routing freedom +2. **LMetric (OSDI'26)**: ≈ linear routing — `P_tokens` already includes + `new_uncached_tokens`, so cache-hit scoring gives LMetric an implicit + soft affinity that converges to similar placements as explicit sticky + affinity (see `analysis/research_findings.md` §2.2 for the corrected + framing) 3. **Elastic P2P RDMA offload**: net negative — Mooncake transfer overhead, no layerwise pipeline 4. **OVERLOAD_FACTOR tuning**: no effect — imbalance from workload skew, not routing 5. **Dedicated Prefill Service (PS)**: cannot win cost comparison without KV pull, PS is always slower than cached C @@ -270,3 +274,21 @@ Instead of fixed chunk size, dynamically adjust based on decode pressure: - When decode queue is deep: smaller chunks → more decode slots → better TPOT - When decode queue is empty: larger chunks → faster prefill → better TTFT - This is a vLLM scheduler modification, not a routing change + +--- + +## Current routing direction (cross-reference) + +The hypotheses above produced the following positive results that informed +the current `--policy unified` implementation: + +- H1 / H7 / H9 (negative): PD-sep offload, OVERLOAD_FACTOR tuning, and + elastic RDMA at high concurrency all regressed or stayed within noise. +- H3 / H4 / H6 (partial): cache-gated offload exists but only ~10-12% of + HEAVY requests have cache, and the offloaded subset pays RDMA penalty. + +The active algorithm (commit `255c8e6`) is **hybrid LMetric + high-cache +affinity** in baseline mode (no Mooncake). The retired migration variants +are catalogued in `docs/migration-policy-design.md` (Approach A and the +revert chain `cc6e562` / `4c583f2`). H7's rejection (OVERLOAD_FACTOR within +noise) is why the active default stays at `overload_factor=2.0`. diff --git a/analysis/research_findings.md b/analysis/research_findings.md index a5379cd..5f21d6a 100644 --- a/analysis/research_findings.md +++ b/analysis/research_findings.md @@ -38,7 +38,24 @@ These characteristics fundamentally change what optimizations matter. **Setup**: 8 instances, LMetric vs linear routing **Result**: TTFT +2.2%, TPOT -4.4%, E2E +2.6% — all within noise (±7% run-to-run) -**Root cause**: Session affinity constrains routing freedom. LMetric's benefit (hyperparameter-free load balancing) is neutralized because turn 2+ requests MUST go to their session-sticky instance regardless of the scoring function. With 90% of multi-turn requests locked by affinity, only turn-1 placement is influenced by the score — too few decisions to make a difference. +**Root cause (updated)**: LMetric is not "neutralized by affinity +constraints" — pure `--policy lmetric` runs without session affinity at all. +The actual reason the LMetric vs linear comparison sits within noise is that +`P_tokens` already includes `new_uncached_tokens = input_length - cache_hit`, +which means later turns of a session naturally score lowest on the instance +that cached their prefix. This gives LMetric an **implicit soft affinity** +that competes with linear's explicit sticky affinity. The two arrive at +similar placements through different mechanisms. + +This is also why explicit migration buys little on top of LMetric: the +first-order signal driving placement is already cache-derived. See +`docs/migration-policy-design.md` for how the current hybrid policy uses +this insight (LMetric base + explicit affinity only when `cache_ratio > 0.5`). + +**Previous framing (incorrect)**: an earlier draft of this section attributed +the result to session affinity constraining LMetric's routing freedom. That +framing assumed `--policy lmetric` inherited the linear-mode session-sticky +behavior, which it does not (verified in `tests/test_proxy_pick.py`). ### 2.3 Elastic P2P RDMA Offload (Heavy prefill on different instance) @@ -148,7 +165,9 @@ This changes the scheduling picture: most "HEAVY" requests in agentic workloads ### Why existing approaches don't work: 1. **PD-Sep** assumes decode needs dedicated resources → agentic has memory wall on decode -2. **LMetric** assumes routing freedom → agentic has session affinity constraints +2. **LMetric** matches linear within noise because cache-hit appears in + `P_tokens` itself, so it already routes later turns back to the cached + instance via implicit soft affinity — explicit affinity buys little 3. **Elastic RDMA** assumes KV transfer is cheap → Mooncake lacks layerwise pipelining 4. **Size-based classification** assumes HEAVY = needs special handling → after cache, most HEAVY is MEDIUM diff --git a/analysis/unified_routing_fix_review.md b/analysis/unified_routing_fix_review.md new file mode 100644 index 0000000..43b03a6 --- /dev/null +++ b/analysis/unified_routing_fix_review.md @@ -0,0 +1,406 @@ +# Unified Routing Fix Review Handoff + +Date: 2026-05-25 + +This is the corrected review handoff after reading the git history. The key +change from the previous draft is that we should **not** restore the old +single-argmin / PUSH-migration design. That path was implemented, measured, +and then discarded or simplified by later commits. + +## Executive Summary + +The latest commit history says the current direction is: + +- Use baseline mode / no Mooncake for the winning comparison against LMetric. +- Use LMetric-style load balancing as the base. +- Add explicit affinity only for sessions with high accumulated cache. +- Do not re-enable PD-sep offload or session migration unless the transfer + mechanism is fundamentally reworked. + +The main fixes now are cleanup, documentation consistency, tests, and +reproducibility. The biggest risk is that stale docs still describe abandoned +schemes as "final design". + +## Evidence From Git History + +Relevant commits: + +| Commit | Meaning | Outcome | +|---|---|---| +| `6b255fa` | Implemented single `argmin(expected_latency)` over local / PUSH / cold paths | Later superseded | +| `5892739` | Added soft affinity because pure argmin overloaded cache-source instances | Pure argmin was unstable | +| `2b9eae0` | Reported Unified v3 with 116 PUSH migrations | TTFT improved, but TPOT/E2E tail tradeoff existed | +| `cdf8349` | Added real cache sync and cached-prefill-on-C architecture | Fixed false PUSH and direct-read issues | +| `4b50c5a` | Fixed queue model and hard overload gate | Reduced imbalance, but still on offload path | +| `e991960` / `5772149` | Tried forced session migration triggers | Reverted | +| `cc6e562` | Reverted Approach B migration | 57 migrations made HEAVY TTFT p90 regress 15.9s -> 59.1s | +| `bf4469a` | Tried more accurate push cost / gate alignment | Later reverted | +| `4c583f2` | Reverted relaxed gate / push-cost fix | 134 offloads made E2E p90 37s -> 82s | +| `448361c` | Updated design doc: baseline no-Mooncake Unified beats LMetric | PD-sep offload degrades | +| `255c8e6` | Replaced full cost model with hybrid routing | Current direction: LMetric LB + high-cache affinity | + +Do not ask the implementer to "restore real three-way argmin". That was the +wrong instruction in the previous draft. + +## Current Intended Algorithm + +From `255c8e6`, the current algorithm should be documented as: + +```text +if session has an affinity instance: + if cache_ratio_on_affinity > 0.5 + and affinity_instance.num_requests <= avg_num_requests * overload_factor: + route to affinity instance + else: + route by LMetric +else: + route by LMetric +``` + +LMetric remains: + +```text +score = (pending_prefill_tokens + new_uncached_tokens) * num_requests +new_uncached_tokens = input_length - estimated_cache_hit +``` + +This is no longer the old expected-latency migration model. + +## P0 Fixes + +### 1. Remove Stale PUSH-Migration Code From the Current Unified Branch + +Location: `scripts/cache_aware_proxy.py`, `_handle_combined`. + +Problem: + +After `255c8e6`, `unified` is a hybrid policy, but the function still contains +an unreachable block guarded by `best_needs_push = False`. That block references +variables from the removed cost model such as `best_cache_idx`, +`best_cache_hit`, and `_current_offloads`. + +Fix: + +- In the `unified` branch, delete the unreachable `if best_needs_push:` block. +- Keep helper functions like `_handle_cached_prefill_offload` only if another + live mode still calls them. +- Update the `_handle_combined` docstring so it no longer says Unified always + computes `queue + prefill + transfer`. + +Why: + +The code is currently safe only because the branch is unreachable. Leaving it +there makes future changes dangerous and makes reviewers think migration is +still part of the active policy. + +Verification: + +- `rg "best_needs_push|best_cache_idx|push_cache_hit" scripts/cache_aware_proxy.py` + should show no stale references inside the active `unified` path. +- `pytest -q`. + +### 2. Reconcile Documentation With the Latest Commits + +Locations: + +- `REPORT.md` +- `docs/migration-policy-design.md` +- `analysis/research_findings.md` +- `analysis/elastic_hypotheses.md` + +Problem: + +Several docs still present old or contradictory conclusions: + +- `REPORT.md` section 3.9 still calls single argmin / PUSH migration the + "Final Design". +- `docs/migration-policy-design.md` describes a baseline-mode additive cost + model, while HEAD implements hybrid LMetric + high-cache affinity. +- `analysis/research_findings.md` says LMetric is neutralized by session + affinity constraints, but later corrected LMetric results say cache-hit + scoring creates implicit soft affinity. + +Fix: + +- Add an explicit "Superseded by git history" note near `REPORT.md` section + 3.9: + - `6b255fa/5892739/2b9eae0` were explored. + - `cc6e562` rejected forced migration. + - `4c583f2` rejected relaxed offload / high-offload configurations. + - `255c8e6` is the current implementation direction. +- Update `docs/migration-policy-design.md` to either: + - describe the current hybrid algorithm, or + - clearly state that the additive cost model is the previous Approach A and + not the latest code. +- Mark old analysis sections as superseded rather than deleting them. + +Why: + +The docs caused the previous bad review recommendation. Future reviewers need +to know which ideas were already tested and rejected. + +Verification: + +- `rg "Final Design|argmin\\(expected_latency\\)|PUSH_MIGRATE|Approach B" REPORT.md docs analysis` + should show clear superseded labels where appropriate. + +### 3. Preserve the LMetric Baseline Separately From Unified Hybrid + +Location: `scripts/cache_aware_proxy.py`. + +Problem: + +Pure `--policy lmetric` is the baseline being compared against. Unified hybrid +uses LMetric internally but should not accidentally change the LMetric baseline +behavior. + +Fix: + +- Keep `pick_instance_lmetric` as the pure corrected LMetric implementation. +- Put any hybrid-specific tie-breakers or affinity logic outside + `pick_instance_lmetric`, under `--policy unified`. +- In breakdown logs, record whether a Unified request used `affinity` or + `lmetric_fallback`. + +Why: + +If the baseline and Unified share hidden behavior, future A/B comparisons become +invalid. + +Verification: + +- Unit test: `--policy lmetric` never uses session affinity. +- Unit test: `--policy unified` can use affinity only when cache ratio and load + gates pass. + +## P1 Fixes + +### 4. Fix the Unified Hybrid / LMetric Fallback Empty-Batch Degeneracy + +Problem: + +LMetric score is `P_tokens * BS`. When `BS = num_requests = 0`, every instance +gets score 0, so tie-break chooses instance 0. `docs/migration-policy-design.md` +explicitly lists avoiding this as one reason Unified beats LMetric. + +The latest hybrid falls back to LMetric, so it may reintroduce this issue for +new sessions when all instances are idle. + +Fix: + +- Do not change the pure `--policy lmetric` baseline. +- For `--policy unified` fallback only, add a deterministic secondary key: + - primary: `P_tokens * BS` + - secondary when scores tie: `new_uncached_tokens`, then `num_requests`, then + a round-robin or least-recently-used instance index. +- Record tie-break count in breakdown or stats. + +Why: + +This preserves fair LMetric comparison while preventing Unified hybrid from +degenerating to instance 0 under empty or near-empty load. + +Verification: + +- Unit test with all `num_requests = 0`: Unified should not always choose index + 0 across repeated new sessions. +- Confirm pure LMetric test still matches the OSDI-style baseline. + +### 5. Add Tests for the Current Hybrid Policy + +Problem: + +Existing tests cover older `pick_instance` and pure LMetric behavior, but not +the current `unified` branch introduced by `255c8e6`. + +Fix: + +Add tests for: + +- High-cache session sticks to affinity instance when not overloaded. +- High-cache session breaks affinity when `num_requests` exceeds the overload + gate. +- Low-cache session falls back to LMetric. +- New session falls back to LMetric with the Unified-specific tie-breaker. +- Breakdown policy is recorded as `affinity` or `lmetric_fallback`. + +Why: + +This prevents future drift back toward the discarded migration cost model or +accidental changes to the LMetric baseline. + +Verification: + +- `pytest -q`. +- Tests should run without live vLLM. + +### 6. Treat `overload_factor` Changes as Experiments, Not Silent Fixes + +Observation: + +The current worktree has an uncommitted change: + +```text +Settings.overload_factor: 2.0 -> 1.5 +``` + +But the earlier H7 sweep found overload-factor tuning largely ineffective / +within noise. This is recorded in `analysis/elastic_hypotheses.md`. + +Fix: + +- Do not silently commit a default change to `1.5` without a paired benchmark. +- If testing `1.5`, make it an experiment tag/config value, not a new default. +- Keep docs and CLI defaults synchronized. + +Why: + +The prior experiment says imbalance was mostly workload/session skew, not the +threshold. A default change without evidence will create another reproducibility +gap. + +Verification: + +- Run paired `2.0` vs `1.5` after the hybrid tests exist. +- Report E2E p50/p90, TTFT p90, APC distribution, and GPU util imbalance. + +### 7. Standardize Breakdown Fields for Hybrid Routing + +Problem: + +The current breakdown logs do not clearly expose why Unified chose affinity vs +LMetric fallback. + +Fix: + +For each request under `--policy unified`, log: + +- `policy: "unified"` +- `decision: "affinity" | "lmetric_fallback"` +- `affinity_idx` +- `chosen_idx` +- `affinity_cache_hit` +- `affinity_cache_ratio` +- `affinity_num_requests` +- `avg_num_requests` +- `fallback_score` +- `tie_break_used` + +Why: + +The latest performance difference versus LMetric is small. Without decision +logs, it is hard to tell whether Unified is actually exercising the intended +high-cache affinity behavior. + +Verification: + +- `breakdown.json` can answer: how many requests used affinity, how many used + fallback, and what latency/APC each group saw. + +## P2 Experiment Fixes + +### 8. Re-run Paired LMetric vs Unified Hybrid Benchmarks + +Problem: + +`docs/migration-policy-design.md` says the 2% mean improvement needs multi-run +verification. Local raw outputs for the May 25 final comparison are not present +in this workspace. + +Fix: + +- Run 3-5 fresh paired trials. +- Same trace, same vLLM build, same machine, same restart procedure. +- Compare: + - pure LMetric + - current Unified hybrid + - optionally Linear/session-sticky as a reference + +Metrics: + +- TTFT mean/p50/p90/p99 +- TPOT mean/p50/p90/p99 +- E2E mean/p50/p90/p99 +- errors/timeouts +- aggregate APC +- per-instance APC distribution +- per-instance request count and token count +- GPU util mean/std/imbalance + +Why: + +The current reported win is small enough that run-to-run noise matters. + +Verification: + +- Commit or save artifacts under a date/tagged output directory: + `config.json`, `metrics.jsonl`, `metrics.summary.json`, `breakdown.json`, + `gpu_util.csv`, final vLLM APC snapshot, git commit hash. + +### 9. Do Not Re-open PD-Sep Offload Without a New Transfer Mechanism + +Rejected paths: + +- Full PD separation: decode KV memory wall. +- Elastic P2P RDMA offload: transfer and scheduling overhead exceed benefit. +- Cache-gate offload: improves balance for colocated survivors but offloaded + requests pay RDMA penalty. +- Approach B session migration: 57 migrations made HEAVY TTFT p90 much worse. +- Relaxed gate / many offloads: 134 offloads made E2E p90 much worse. + +Future work can revisit migration only if one of these changes first: + +- layerwise / pipelined KV transfer +- multi-machine P role so P work does not compete with D on the same GPU pool +- exact vLLM state / cache residency exposed to the router +- production-concurrency benchmark showing decode SLO pressure large enough to + amortize transfer overhead + +Why: + +The same local mechanism has already failed multiple times. Repeating it with +another threshold is unlikely to help. + +### 10. Restore Production-Concurrency Evaluation as a Separate Track + +Problem: + +Several conclusions were made at 1-2 req/GPU, while production is estimated at +8-15 req/GPU. Higher concurrency is where prefill-decode interference appears. + +Fix: + +- Restore or replace `--max-inflight-sessions` for controlled concurrency. +- Run at 64 and 128 active sessions. +- Treat this as a new experiment track, not as a reason to resurrect old + migration code immediately. + +Why: + +At high concurrency, the design pressure may change. But the implementation +should first prove the current hybrid baseline cleanly. + +## Suggested Implementation Order + +1. Update docs to mark discarded migration/offload schemes as superseded. +2. Remove stale unreachable PUSH code from the current Unified branch. +3. Add tests for current Unified hybrid behavior. +4. Add a Unified-only tie-breaker for LMetric fallback empty-batch cases. +5. Add breakdown fields for hybrid routing decisions. +6. Decide whether the local `overload_factor=1.5` diff is an experiment or + should be dropped. +7. Run paired multi-run LMetric vs Unified hybrid benchmarks. +8. Only after that, open a separate high-concurrency experiment track. + +## Review Checklist + +- Does the code still mention single `argmin(expected_latency)` as current + behavior? If yes, update it or mark it superseded. +- Is any migration/offload code reachable under `--policy unified`? If yes, + require a new experiment plan because recent history rejects it. +- Does pure `--policy lmetric` remain pure and affinity-free? +- Does `--policy unified` clearly log when it used affinity vs LMetric fallback? +- Are default parameter changes backed by paired results? +- Can another reviewer reproduce the result from committed scripts and saved + artifacts? + diff --git a/docs/migration-policy-design.md b/docs/migration-policy-design.md index 717791c..e37f22a 100644 --- a/docs/migration-policy-design.md +++ b/docs/migration-policy-design.md @@ -1,62 +1,109 @@ -# Migration Policy Design: Improving Load Balance in Elastic KV +# Routing & Migration Policy: Design Log -## Final Result +This file is the active reference for the routing policy. It supersedes the +"single argmin + PUSH migration" framing once described as the final design +(see commit notes below and `REPORT.md` §3.9 errata). -**Unified routing (baseline mode, no Mooncake)** beats LMetric on E2E mean/p50/p90. -Pending multi-run significance verification. +## Current Algorithm: Hybrid LMetric + High-Cache Affinity -| Metric | LMetric | Unified | Change | -|--------|---------|---------|--------| -| E2E mean | 18.204 | **17.831** | -2.0% | -| E2E p50 | 6.184 | **6.074** | -1.8% | -| E2E p90 | 39.438 | **37.073** | -6.0% | -| TTFT p90 | 9.331 | **8.034** | -13.9% | +Implemented in `255c8e6`. Active under `--policy unified` in +`scripts/cache_aware_proxy.py`. + +```python +# Step 1: affinity gate (only for sessions that have a recorded owner) +if session has affinity instance: + cache_ratio = cache_hit_on_affinity / input_length + gate_1: cache_ratio > 0.5 + gate_2: affinity.num_requests <= avg_num_requests * overload_factor + if gate_1 AND gate_2: + decision = "affinity" + return affinity_instance + +# Step 2: LMetric fallback with deterministic tie-breaker +for each instance i: + score_i = (pending_prefill_tokens_i + new_uncached_tokens_i) * num_requests_i + = P_tokens * BS # primary +secondary key: new_uncached_tokens # prefer cache +tertiary key: num_requests # prefer idle +quaternary: round-robin counter % n # break ties +return argmin +``` + +The pure `--policy lmetric` baseline stays affinity-free; the hybrid lives +entirely under `--policy unified`. The round-robin counter is required because +`P_tokens * BS = 0` whenever `BS = 0` for all instances (new sessions, cold +start), which would otherwise pin every fresh session to instance 0. + +Parameters: `overload_factor=2.0` (default). The previously-introduced +`decode_iteration_s` / `prefill_throughput` / `rdma_overhead_s` are kept in +`Settings` but no longer drive routing — they were Approach-A inputs. + +### Why this shape + +- **LMetric for load balance**: `P_tokens × BS` is hyperparameter-free and + captures both pending prefill work and current batch contention. +- **Implicit soft affinity from LMetric itself**: `P_tokens` includes + `new_uncached_tokens = input - cache_hit`. Later turns naturally prefer + the instance that already cached the prefix, because their `P_tokens` are + smaller there. This is the dominant reason explicit migration buys little. +- **Explicit affinity only for the long-cache case**: when cache_ratio > 0.5, + the placement cost of breaking sticky is large enough to justify a hard + gate. Below that ratio, defer to LMetric. + +## What Was Retired and Why + +| Commit | Approach | Outcome | +|---|---|---| +| `6b255fa` | Single `argmin(queue+prefill+transfer)` over local/PUSH/cold | Initial design; numbers in REPORT §3.9 | +| `5892739` | Soft affinity added (pure argmin overloaded cache owners) | Stabilized but tail still degraded | +| `2b9eae0` | Reported Unified v3 (116 PUSH migrations) | TTFT -25%/-32%, **E2E p90 +12%, p99 +24%** | +| `e991960`/`5772149` | Forced session migration triggers (Approach B) | 57 migrations, **HEAVY TTFT p90 15.9s → 59.1s** | +| `cc6e562` | Revert Approach B | "overhead exceeds LB benefit" | +| `bf4469a` | Tighter push_cost + aligned hard gate | Triggered too few migrations to recover | +| `4c583f2` | Revert relaxed gate | 134 offloads, **E2E p90 37s → 82s** | +| `255c8e6` | **Current** hybrid LMetric + high-cache affinity | Stable baseline | + +The shared lesson across the retired variants: PD-sep offload pays +`C_queue + C_prefill + RDMA + D_schedule + D_decode_start` and the saved +prefill time on D rarely amortizes this — especially because 92% of HEAVY +requests are turn-1 cold (no source-side cache to migrate). See +`analysis/elastic_hypotheses.md` H3-H9 for the per-variant evidence. + +## Historical Baseline-Mode Comparison (Approach A) + +These numbers are from the additive-cost-model variant of Unified routing +(before `255c8e6`). Kept for reference; the hybrid currently lives on top of +LMetric, not on this additive cost. The "Unified" column should not be cited +as the current implementation. + +| Metric | LMetric | Unified (Approach A, historical) | Change | +|--------|---------|----------------------------------|--------| +| E2E mean | 18.204 | 17.831 | -2.0% | +| E2E p50 | 6.184 | 6.074 | -1.8% | +| E2E p90 | 39.438 | 37.073 | -6.0% | +| TTFT p90 | 9.331 | 8.034 | -13.9% | | Errors | 0 | 0 | — | -### Why Unified beats LMetric - -1. **Session affinity** preserves KV cache across turns → turn 2+ TTFT much lower -2. **Additive cost model** (`contention + queue + prefill`) avoids LMetric's degenerate - case when `num_requests = 0` (all instances score 0, tie-break to instance 0) -3. **`num_requests` as contention signal** better captures GPU batch scheduling - overhead than `ongoing_tokens` - -### Why PD-sep offload doesn't help (yet) - -Extensive experimentation with offload/migration showed that PD-sep overhead -(C queue + prefill + KV transfer + D scheduling) consistently exceeds load -balance benefit: - -| Experiment | Offloads | E2E p90 | vs Baseline | -|-----------|----------|---------|-------------| -| A (old gate, ~5 offloads) | 5 | 39.0 | -25% | -| A (relaxed gate, ~6 offloads) | 6 | 46.0 | -12% | -| A+B2 (forced migration) | 57 | 84.2 | +61% | -| A (relaxed gate v2, both gates removed) | 134 | 81.5 | +56% | - -More offloads → worse performance. The offload mechanism itself is the bottleneck. - -## Algorithm: Unified Routing +Approach A (additive cost model, historical): ```python cost(instance_i) = num_requests_i × decode_iteration_s # contention + pending_prefill_tokens_i / throughput # prefill queue + max(0, input - cache_hit_i) / throughput # new prefill - -# Session affinity with two gates: if affinity instance exists: - gate 1: ongoing_tokens <= avg * overload_factor (hard gate) - gate 2: affinity_cost <= global_best * overload_factor (cost ratio) - if both pass → use affinity instance - else → use globally best instance -else: - use globally best instance + gate 1: ongoing_tokens <= avg * overload_factor + gate 2: affinity_cost <= global_best * overload_factor + if both pass → affinity + else → global best ``` -Parameters: `decode_iteration_s=0.05` (H20), `throughput=7000` (H20), -`overload_factor=2.0`. +Reason for retirement: the additive cost was load-bearing on +`decode_iteration_s` and `prefill_throughput` constants. LMetric reproduces +the same ordering without those constants because `P_tokens` and `BS` already +capture both prefill queue and batch contention. The hybrid keeps the cheap +LMetric core and adds an explicit affinity gate only for high-cache cases. -## Evolution of Results +### Evolution Table (historical) | Version | Description | ALL TTFT p90 | ALL E2E p90 | tok max/min | |---------|-------------|-------------|-------------|-------------| @@ -65,12 +112,34 @@ Parameters: `decode_iteration_s=0.05` (H20), `throughput=7000` (H20), | v2 (bug) | unified, queue=prefill only | 23.339 | 66.307 | 10.3x | | v3 | +decode in queue, +hard gate | 10.121 | 42.393 | 2.6x | | A (elastic) | +num_requests contention | 7.638 | 39.044 | 3.5x | -| **A (baseline)** | **same routing, no Mooncake** | **8.034** | **37.073** | **—** | +| A (baseline, Approach A) | same routing, no Mooncake | 8.034 | 37.073 | — | +| **Hybrid (current)** | **LMetric + high-cache affinity** | **see §8 re-run** | **see §8 re-run** | **—** | -## Rigorous Review Summary +The current Hybrid row deliberately has no number: per +`analysis/unified_routing_fix_review.md` #8, the small (≤2%) improvements +need 3-5 paired multi-run trials before being quoted. -Independent review found: +## Open Questions / Next Steps + +- **#8 paired multi-run**: 3-5 fresh trials of LMetric vs current hybrid on + identical trace + restart procedure, with full artifact bundle + (`config.json`, `metrics.jsonl`, `metrics.summary.json`, `breakdown.json`, + `gpu_util.csv`, per-instance APC snapshot, git commit hash). +- **#10 production-concurrency track**: re-introduce a controlled-concurrency + flag (`--max-inflight-sessions` or equivalent) and rerun the comparison at + 64 / 128 active sessions before drawing conclusions for production. +- **Hard affinity ceiling for the edge case where cache_ratio = 1.0 and the + affinity instance is overloaded**: in that scenario the LMetric fallback's + tie-breaker re-elects the same instance because its `new_uncached_tokens` + is 0. Either add a hard num_requests ceiling or accept the behavior. Open + in the test `test_hybrid_high_cache_breaks_on_overload` (works only when + the request has at least one uncached block). + +## Original "Rigorous Review Summary" (historical) + +Independent review of the Approach A result above: - **CLEAN**: Fair comparison (identical vLLM/proxy/trace/measurement) - **CLEAN**: No reward hacking (improvement from algorithmic difference) - **WARNING**: 2% mean improvement needs multi-run verification (3-5 runs) -- **NOTE**: Hardcoded constants (0.05, 7000) are hardware-specific but legitimate +- **NOTE**: Hardcoded constants (`0.05`, `7000`) are hardware-specific but + legitimate