Commit Graph

7 Commits

Author SHA1 Message Date
9835d6af5d Elastic PS eval: near-neutral, offload gate triggers only 14% of HEAVY
Root cause: 75% of HEAVY requests are cold (cache_ratio=0%), failing the
cache_ratio>=0.3 gate. Only 17/118 HEAVY offloaded, insufficient to reduce
prefill-decode interference. Offloaded requests are 50% SLOWER due to
P-side queuing (14.7s) + RDMA overhead (5.7s).

Interference IS real: 89% of WARM/MEDIUM have 1+ concurrent HEAVY prefill.
But elastic PS in current form can't address it because cold HEAVY prefills
(the majority) can't benefit from offload.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 16:49:25 +08:00
bf037594c4 Production-realistic baseline: APC 67.5%, TPOT +139% from interference
Updated methodology:
- Window+thin sampling preserves cross-session sharing (48% vs 16%)
- --max-single-turn-ratio 0.3 boosts multi-turn to 70%
- --window-seconds 600 for 10-min contiguous window
- Trace-driven replay (no session limit, no time compression)
- Daily config: --requests 850 (~13 min, APC~76%)

Key result: TPOT p90=0.175s (vs 0.073s in legacy 1-req/GPU setup),
confirming prefill-decode interference is real at production concurrency.
APC 67.5% (vs 44%) from better KV reuse preservation.

Also fixed KV reuse breakdown: 62% intra-session / 38% cross-session
(was incorrectly reported as 91% / 9%).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 15:44:34 +08:00
c8ba666517 Benchmark concurrency gap: 1 req/GPU is 10-15x below production
Our --max-inflight-sessions 8 yields 1 req/GPU, masking prefill-decode
interference that appears at 2/GPU (+38% TPOT) and would dominate at
production load (~15/GPU). Updated §8 to re-evaluate elastic PS at
production concurrency. Next step: --max-inflight-sessions 64 benchmark.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 12:16:20 +08:00
fefbd71ca9 GPU imbalance analysis + elastic PS verdict + corrected LMetric results
Key findings:
- Session-sticky imbalance is 8.6x at 200 req (small-sample artifact)
  but only 1.24x at 1000 req (moderate, TPOT unaffected)
- Elastic PS not justified: interference reduction 0% at 1/GPU,
  migration reduces imbalance 1.24x→1.18x at 1.5s/event cost
- Corrected LMetric (no affinity) matches Linear (sticky) on all
  metrics (<2%), proving soft affinity from cache-hit scoring works
- Updated §3.4 errata, added §8 GPU imbalance + elastic PS analysis

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 12:11:23 +08:00
fc92410ec9 Invalidate prior A/B results + add proper experiment harness
Prior cross-machine comparison (commit 1e86285) was invalid: dash0
baseline used warm instances with residual KV cache, inflating TTFT
by 2x. Evidence: inst_7 APC=68.3% impossible from 25 cold-start
requests; WARM TTFT p90=3.3s vs fresh=0.26s.

Fair same-machine comparison (both fresh restart on dash0):
  Baseline:    TTFT50=1.075  TPOT90=0.076  E2E50=5.075  OK=198/200
  Elastic P2P: TTFT50=1.018  TPOT90=0.085  E2E50=6.977  OK=195/200
Elastic is WORSE due to Mooncake kv_both memory overhead.

Changes:
- REPORT.md: rewrite §3-4 with corrected results, add §3.5 errata
- pd_separation_analysis.md: update elastic TL;DR with correct numbers
- cache_aware_proxy.py: fix double-decrement bugs in offload path,
  add 120s prefill timeout with co-located fallback (HEAVY_COLO_FALLBACK)
- bench.sh: standardized experiment harness with guaranteed GPU cleanup
  and fresh-state verification (nvidia-smi check before start)
- run_elastic_stability_test.sh: two-phase elastic vs baseline test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 17:54:21 +08:00
e4fa56cb1e LMetric routing policy (OSDI'26) + A/B results vs linear baseline
Implement LMetric (P_tokens × BS multiplication score) from "Simple is
Better" (Zhang et al., OSDI'26) as alternative routing policy for
combined mode. Key changes:

- cache_aware_proxy.py: add --policy {linear,lmetric} flag, track
  pending_prefill_tokens and num_requests per instance, /stats endpoint
- run_lmetric_ab.sh: automated A/B script for fair comparison

Results (200 req, fresh restart, same trace):
  Linear:  TTFT50=1.086  TPOT90=0.077  E2E50=5.423
  LMetric: TTFT50=1.099  TPOT90=0.073  E2E50=5.205
  Delta:   TTFT +1.2%    TPOT -5.9%    E2E -4.0%

LMetric improves TPOT/E2E modestly through better load balancing, but
routing policy headroom is limited vs elastic P2P offload (-44% E2E).

TODO: vLLM → Redis → router pipeline for exact state ablation.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 16:57:32 +08:00
2b0ac70ee7 Phase 1 milestone: system-level analysis + reproducible report
- REPORT.md: self-contained milestone report covering baseline vs elastic
  setup, exact launch commands, benchmark params, results, log locations,
  and repo structure — sufficient for anyone to reproduce
- analysis/pd_separation_analysis.md §5: elastic P2P system-level breakdown
  (KV cache hit ratio, per-class TTFT, GPU util paradox explanation)
- scripts/cache_aware_proxy.py: round-robin P-instance selection replacing
  argmin(ongoing_tokens) to fix GPU load imbalance (3.0x → expected ~2x)
- scripts/launch_elastic_p2p.sh: one-command launch for elastic P2P config

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 16:17:41 +08:00