Per analysis/unified_routing_fix_review.md #2, several docs still
presented the retired single-argmin + PUSH-migration design as the
final algorithm. Mark them superseded and document the current hybrid
direction (commit 255c8e6).
- REPORT.md §1.1 / §3.9: add errata callout and section header noting
the "Final Design" framing was retired after cc6e562 / 4c583f2;
point readers to docs/migration-policy-design.md.
- docs/migration-policy-design.md: rewrite. Opens with the current
hybrid algorithm (LMetric base + cache_ratio>0.5 affinity gate +
tie-breaker), then a "What Was Retired" commit table, then the old
Approach A numbers preserved as "Historical Baseline-Mode Comparison".
- analysis/research_findings.md §2.2 / §5: correct the LMetric framing.
LMetric isn't "neutralized by affinity constraints" (pure --policy
lmetric has no affinity at all); it converges to similar placements
because P_tokens includes new_uncached_tokens, giving it implicit
soft affinity.
- analysis/elastic_hypotheses.md: same LMetric correction in the
"DOESN'T work" summary, plus a footer cross-referencing the current
routing direction.
- analysis/unified_routing_fix_review.md: track this file (was
untracked); it is the review handoff cited from the updated docs.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Key finding: at 16 concurrent sessions (2 per GPU), TPOT p90 degrades
from 0.073 to 0.106 (+45%), with MEDIUM TPOT at 0.197 (+149%).
This is the first time we've reproduced real prefill-decode interference
in controlled experiments.
Elastic RDMA at 16 sessions doesn't help: only 13/500 offloaded (cache-gate
correct for cold turn-1), kv_both adds ~16% TPOT overhead at high concurrency.
Load scaling: 1000req_ts20, 200req_ts10, 200req_ts5, 500req_ts10 all show
~30% GPU util at 8 sessions. The bottleneck is max_inflight_sessions, not
arrival rate.
Updated elastic_hypotheses.md with H8, H9, and comprehensive final analysis.
The real bottleneck is vLLM's chunked prefill scheduling, not routing or
PD disaggregation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
H7: Sweeping OVERLOAD_FACTOR (2.0/1.5/1.3/1.0) has no effect on GPU
imbalance (~3.5-4x across all settings). Root cause: imbalance is from
workload skew at session placement (turn 1), not from routing at turn 2+.
H4 GPU profiling confirms: GPU balance improvement IS real (4.0x→2.0x),
and it directly improves HEAVY_COLO TTFT by 10.5%. But RDMA-offloaded
requests have bimodal transfer times (0.6s or 18-31s) that negate the
routing benefit.
Updated elastic_hypotheses.md with H7 results and next directions:
higher load experiments where contention amplifies routing differences.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tracks all hypotheses tested during elastic PD disaggregation research:
- H1 (kv_both overhead): REJECTED — zero overhead at idle
- H2 (PS cold prefill): REJECTED — PS slower than cached C
- H3 (C_s+flexD): PARTIALLY VALIDATED — E2E -9% but HEAVY p90 +117%
- H4 (cache-aware offload): TODO — only offload high-cache-hit HEAVY
- H5 (RDMA overhead): TODO — Mooncake lacks layerwise transfer
- H6 (session migration): TODO — verify D's APC after migration
Key insight: offload decision should be cache-aware (new_tokens),
not size-based (total_input). 80k request with 90% cache = 8k prefill.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>