Per analysis/unified_routing_fix_review.md #2, several docs still
presented the retired single-argmin + PUSH-migration design as the
final algorithm. Mark them superseded and document the current hybrid
direction (commit 255c8e6).
- REPORT.md §1.1 / §3.9: add errata callout and section header noting
the "Final Design" framing was retired after cc6e562 / 4c583f2;
point readers to docs/migration-policy-design.md.
- docs/migration-policy-design.md: rewrite. Opens with the current
hybrid algorithm (LMetric base + cache_ratio>0.5 affinity gate +
tie-breaker), then a "What Was Retired" commit table, then the old
Approach A numbers preserved as "Historical Baseline-Mode Comparison".
- analysis/research_findings.md §2.2 / §5: correct the LMetric framing.
LMetric isn't "neutralized by affinity constraints" (pure --policy
lmetric has no affinity at all); it converges to similar placements
because P_tokens includes new_uncached_tokens, giving it implicit
soft affinity.
- analysis/elastic_hypotheses.md: same LMetric correction in the
"DOESN'T work" summary, plus a footer cross-referencing the current
routing direction.
- analysis/unified_routing_fix_review.md: track this file (was
untracked); it is the review handoff cited from the updated docs.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
A: Add /estimate_hit endpoint to bootstrap server for real-time cache
probing. Proxy queries this before committing to PUSH, eliminating
24% zero-match PUSH requests (shadow cache divergence).
C: Add _handle_cached_prefill_offload: C (cache source) does fast
cached prefill → KV to Mooncake → D pulls and decodes.
Replaces broken direct_read PUSH where D waited for RDMA transfer
while occupying KV blocks without doing compute.
Also: update §3.9 baseline to plain vLLM with full mean/p50/p90/p99.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Calls out that §3.1 (old random sampler, time-scale compression, 1 req/GPU
cap) and the early elastic v3 warm-vs-fresh runs are no longer current,
and that the "--max-inflight-sessions 64+" next-step text refers to a
flag that was removed and must be restored per FIXES.md §B2 before those
numbers can be reproduced. Points readers at §3.6/§3.7 as authoritative.
Root cause: 75% of HEAVY requests are cold (cache_ratio=0%), failing the
cache_ratio>=0.3 gate. Only 17/118 HEAVY offloaded, insufficient to reduce
prefill-decode interference. Offloaded requests are 50% SLOWER due to
P-side queuing (14.7s) + RDMA overhead (5.7s).
Interference IS real: 89% of WARM/MEDIUM have 1+ concurrent HEAVY prefill.
But elastic PS in current form can't address it because cold HEAVY prefills
(the majority) can't benefit from offload.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Our --max-inflight-sessions 8 yields 1 req/GPU, masking prefill-decode
interference that appears at 2/GPU (+38% TPOT) and would dominate at
production load (~15/GPU). Updated §8 to re-evaluate elastic PS at
production concurrency. Next step: --max-inflight-sessions 64 benchmark.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Key findings:
- Session-sticky imbalance is 8.6x at 200 req (small-sample artifact)
but only 1.24x at 1000 req (moderate, TPOT unaffected)
- Elastic PS not justified: interference reduction 0% at 1/GPU,
migration reduces imbalance 1.24x→1.18x at 1.5s/event cost
- Corrected LMetric (no affinity) matches Linear (sticky) on all
metrics (<2%), proving soft affinity from cache-hit scoring works
- Updated §3.4 errata, added §8 GPU imbalance + elastic PS analysis
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>