Commit Graph

4 Commits

Author SHA1 Message Date
6a27f75337 Docs: reconcile routing docs with current hybrid direction
Per analysis/unified_routing_fix_review.md #2, several docs still
presented the retired single-argmin + PUSH-migration design as the
final algorithm. Mark them superseded and document the current hybrid
direction (commit 255c8e6).

- REPORT.md §1.1 / §3.9: add errata callout and section header noting
  the "Final Design" framing was retired after cc6e562 / 4c583f2;
  point readers to docs/migration-policy-design.md.

- docs/migration-policy-design.md: rewrite. Opens with the current
  hybrid algorithm (LMetric base + cache_ratio>0.5 affinity gate +
  tie-breaker), then a "What Was Retired" commit table, then the old
  Approach A numbers preserved as "Historical Baseline-Mode Comparison".

- analysis/research_findings.md §2.2 / §5: correct the LMetric framing.
  LMetric isn't "neutralized by affinity constraints" (pure --policy
  lmetric has no affinity at all); it converges to similar placements
  because P_tokens includes new_uncached_tokens, giving it implicit
  soft affinity.

- analysis/elastic_hypotheses.md: same LMetric correction in the
  "DOESN'T work" summary, plus a footer cross-referencing the current
  routing direction.

- analysis/unified_routing_fix_review.md: track this file (was
  untracked); it is the review handoff cited from the updated docs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 10:47:14 +08:00
baf7ffb08c 16-session contention: TPOT +45% from prefill-decode interference
Key finding: at 16 concurrent sessions (2 per GPU), TPOT p90 degrades
from 0.073 to 0.106 (+45%), with MEDIUM TPOT at 0.197 (+149%).
This is the first time we've reproduced real prefill-decode interference
in controlled experiments.

Elastic RDMA at 16 sessions doesn't help: only 13/500 offloaded (cache-gate
correct for cold turn-1), kv_both adds ~16% TPOT overhead at high concurrency.

Load scaling: 1000req_ts20, 200req_ts10, 200req_ts5, 500req_ts10 all show
~30% GPU util at 8 sessions. The bottleneck is max_inflight_sessions, not
arrival rate.

Updated elastic_hypotheses.md with H8, H9, and comprehensive final analysis.
The real bottleneck is vLLM's chunked prefill scheduling, not routing or
PD disaggregation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 05:51:47 +08:00
85b230455e H7 OVERLOAD_FACTOR sweep: negative result + H4 GPU profiling
H7: Sweeping OVERLOAD_FACTOR (2.0/1.5/1.3/1.0) has no effect on GPU
imbalance (~3.5-4x across all settings). Root cause: imbalance is from
workload skew at session placement (turn 1), not from routing at turn 2+.

H4 GPU profiling confirms: GPU balance improvement IS real (4.0x→2.0x),
and it directly improves HEAVY_COLO TTFT by 10.5%. But RDMA-offloaded
requests have bimodal transfer times (0.6s or 18-31s) that negate the
routing benefit.

Updated elastic_hypotheses.md with H7 results and next directions:
higher load experiments where contention amplifies routing differences.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 03:04:02 +08:00
098d86385a Add elastic hypotheses tracking doc with H1-H6 analysis
Tracks all hypotheses tested during elastic PD disaggregation research:
- H1 (kv_both overhead): REJECTED — zero overhead at idle
- H2 (PS cold prefill): REJECTED — PS slower than cached C
- H3 (C_s+flexD): PARTIALLY VALIDATED — E2E -9% but HEAVY p90 +117%
- H4 (cache-aware offload): TODO — only offload high-cache-hit HEAVY
- H5 (RDMA overhead): TODO — Mooncake lacks layerwise transfer
- H6 (session migration): TODO — verify D's APC after migration

Key insight: offload decision should be cache-aware (new_tokens),
not size-based (total_input). 80k request with 90% cache = 8k prefill.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 01:17:12 +08:00