Files
agentic-kvc/docs/migration-policy-design.md
Gahow Wang 6a27f75337 Docs: reconcile routing docs with current hybrid direction
Per analysis/unified_routing_fix_review.md #2, several docs still
presented the retired single-argmin + PUSH-migration design as the
final algorithm. Mark them superseded and document the current hybrid
direction (commit 255c8e6).

- REPORT.md §1.1 / §3.9: add errata callout and section header noting
  the "Final Design" framing was retired after cc6e562 / 4c583f2;
  point readers to docs/migration-policy-design.md.

- docs/migration-policy-design.md: rewrite. Opens with the current
  hybrid algorithm (LMetric base + cache_ratio>0.5 affinity gate +
  tie-breaker), then a "What Was Retired" commit table, then the old
  Approach A numbers preserved as "Historical Baseline-Mode Comparison".

- analysis/research_findings.md §2.2 / §5: correct the LMetric framing.
  LMetric isn't "neutralized by affinity constraints" (pure --policy
  lmetric has no affinity at all); it converges to similar placements
  because P_tokens includes new_uncached_tokens, giving it implicit
  soft affinity.

- analysis/elastic_hypotheses.md: same LMetric correction in the
  "DOESN'T work" summary, plus a footer cross-referencing the current
  routing direction.

- analysis/unified_routing_fix_review.md: track this file (was
  untracked); it is the review handoff cited from the updated docs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 10:47:14 +08:00

146 lines
7.2 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Routing & Migration Policy: Design Log
This file is the active reference for the routing policy. It supersedes the
"single argmin + PUSH migration" framing once described as the final design
(see commit notes below and `REPORT.md` §3.9 errata).
## Current Algorithm: Hybrid LMetric + High-Cache Affinity
Implemented in `255c8e6`. Active under `--policy unified` in
`scripts/cache_aware_proxy.py`.
```python
# Step 1: affinity gate (only for sessions that have a recorded owner)
if session has affinity instance:
cache_ratio = cache_hit_on_affinity / input_length
gate_1: cache_ratio > 0.5
gate_2: affinity.num_requests <= avg_num_requests * overload_factor
if gate_1 AND gate_2:
decision = "affinity"
return affinity_instance
# Step 2: LMetric fallback with deterministic tie-breaker
for each instance i:
score_i = (pending_prefill_tokens_i + new_uncached_tokens_i) * num_requests_i
= P_tokens * BS # primary
secondary key: new_uncached_tokens # prefer cache
tertiary key: num_requests # prefer idle
quaternary: round-robin counter % n # break ties
return argmin
```
The pure `--policy lmetric` baseline stays affinity-free; the hybrid lives
entirely under `--policy unified`. The round-robin counter is required because
`P_tokens * BS = 0` whenever `BS = 0` for all instances (new sessions, cold
start), which would otherwise pin every fresh session to instance 0.
Parameters: `overload_factor=2.0` (default). The previously-introduced
`decode_iteration_s` / `prefill_throughput` / `rdma_overhead_s` are kept in
`Settings` but no longer drive routing — they were Approach-A inputs.
### Why this shape
- **LMetric for load balance**: `P_tokens × BS` is hyperparameter-free and
captures both pending prefill work and current batch contention.
- **Implicit soft affinity from LMetric itself**: `P_tokens` includes
`new_uncached_tokens = input - cache_hit`. Later turns naturally prefer
the instance that already cached the prefix, because their `P_tokens` are
smaller there. This is the dominant reason explicit migration buys little.
- **Explicit affinity only for the long-cache case**: when cache_ratio > 0.5,
the placement cost of breaking sticky is large enough to justify a hard
gate. Below that ratio, defer to LMetric.
## What Was Retired and Why
| Commit | Approach | Outcome |
|---|---|---|
| `6b255fa` | Single `argmin(queue+prefill+transfer)` over local/PUSH/cold | Initial design; numbers in REPORT §3.9 |
| `5892739` | Soft affinity added (pure argmin overloaded cache owners) | Stabilized but tail still degraded |
| `2b9eae0` | Reported Unified v3 (116 PUSH migrations) | TTFT -25%/-32%, **E2E p90 +12%, p99 +24%** |
| `e991960`/`5772149` | Forced session migration triggers (Approach B) | 57 migrations, **HEAVY TTFT p90 15.9s → 59.1s** |
| `cc6e562` | Revert Approach B | "overhead exceeds LB benefit" |
| `bf4469a` | Tighter push_cost + aligned hard gate | Triggered too few migrations to recover |
| `4c583f2` | Revert relaxed gate | 134 offloads, **E2E p90 37s → 82s** |
| `255c8e6` | **Current** hybrid LMetric + high-cache affinity | Stable baseline |
The shared lesson across the retired variants: PD-sep offload pays
`C_queue + C_prefill + RDMA + D_schedule + D_decode_start` and the saved
prefill time on D rarely amortizes this — especially because 92% of HEAVY
requests are turn-1 cold (no source-side cache to migrate). See
`analysis/elastic_hypotheses.md` H3-H9 for the per-variant evidence.
## Historical Baseline-Mode Comparison (Approach A)
These numbers are from the additive-cost-model variant of Unified routing
(before `255c8e6`). Kept for reference; the hybrid currently lives on top of
LMetric, not on this additive cost. The "Unified" column should not be cited
as the current implementation.
| Metric | LMetric | Unified (Approach A, historical) | Change |
|--------|---------|----------------------------------|--------|
| E2E mean | 18.204 | 17.831 | -2.0% |
| E2E p50 | 6.184 | 6.074 | -1.8% |
| E2E p90 | 39.438 | 37.073 | -6.0% |
| TTFT p90 | 9.331 | 8.034 | -13.9% |
| Errors | 0 | 0 | — |
Approach A (additive cost model, historical):
```python
cost(instance_i) = num_requests_i × decode_iteration_s # contention
+ pending_prefill_tokens_i / throughput # prefill queue
+ max(0, input - cache_hit_i) / throughput # new prefill
if affinity instance exists:
gate 1: ongoing_tokens <= avg * overload_factor
gate 2: affinity_cost <= global_best * overload_factor
if both pass affinity
else global best
```
Reason for retirement: the additive cost was load-bearing on
`decode_iteration_s` and `prefill_throughput` constants. LMetric reproduces
the same ordering without those constants because `P_tokens` and `BS` already
capture both prefill queue and batch contention. The hybrid keeps the cheap
LMetric core and adds an explicit affinity gate only for high-cache cases.
### Evolution Table (historical)
| Version | Description | ALL TTFT p90 | ALL E2E p90 | tok max/min |
|---------|-------------|-------------|-------------|-------------|
| Baseline | linear routing | 16.058 | 52.292 | 2.7x |
| LMetric | P×BS, no affinity | 9.331 | 39.438 | 2.4x |
| v2 (bug) | unified, queue=prefill only | 23.339 | 66.307 | 10.3x |
| v3 | +decode in queue, +hard gate | 10.121 | 42.393 | 2.6x |
| A (elastic) | +num_requests contention | 7.638 | 39.044 | 3.5x |
| A (baseline, Approach A) | same routing, no Mooncake | 8.034 | 37.073 | — |
| **Hybrid (current)** | **LMetric + high-cache affinity** | **see §8 re-run** | **see §8 re-run** | **—** |
The current Hybrid row deliberately has no number: per
`analysis/unified_routing_fix_review.md` #8, the small (≤2%) improvements
need 3-5 paired multi-run trials before being quoted.
## Open Questions / Next Steps
- **#8 paired multi-run**: 3-5 fresh trials of LMetric vs current hybrid on
identical trace + restart procedure, with full artifact bundle
(`config.json`, `metrics.jsonl`, `metrics.summary.json`, `breakdown.json`,
`gpu_util.csv`, per-instance APC snapshot, git commit hash).
- **#10 production-concurrency track**: re-introduce a controlled-concurrency
flag (`--max-inflight-sessions` or equivalent) and rerun the comparison at
64 / 128 active sessions before drawing conclusions for production.
- **Hard affinity ceiling for the edge case where cache_ratio = 1.0 and the
affinity instance is overloaded**: in that scenario the LMetric fallback's
tie-breaker re-elects the same instance because its `new_uncached_tokens`
is 0. Either add a hard num_requests ceiling or accept the behavior. Open
in the test `test_hybrid_high_cache_breaks_on_overload` (works only when
the request has at least one uncached block).
## Original "Rigorous Review Summary" (historical)
Independent review of the Approach A result above:
- **CLEAN**: Fair comparison (identical vLLM/proxy/trace/measurement)
- **CLEAN**: No reward hacking (improvement from algorithmic difference)
- **WARNING**: 2% mean improvement needs multi-run verification (3-5 runs)
- **NOTE**: Hardcoded constants (`0.05`, `7000`) are hardware-specific but
legitimate