bf037594c4
Production-realistic baseline: APC 67.5%, TPOT +139% from interference
...
Updated methodology:
- Window+thin sampling preserves cross-session sharing (48% vs 16%)
- --max-single-turn-ratio 0.3 boosts multi-turn to 70%
- --window-seconds 600 for 10-min contiguous window
- Trace-driven replay (no session limit, no time compression)
- Daily config: --requests 850 (~13 min, APC~76%)
Key result: TPOT p90=0.175s (vs 0.073s in legacy 1-req/GPU setup),
confirming prefill-decode interference is real at production concurrency.
APC 67.5% (vs 44%) from better KV reuse preservation.
Also fixed KV reuse breakdown: 62% intra-session / 38% cross-session
(was incorrectly reported as 91% / 9%).
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-05-23 15:44:34 +08:00
c8ba666517
Benchmark concurrency gap: 1 req/GPU is 10-15x below production
...
Our --max-inflight-sessions 8 yields 1 req/GPU, masking prefill-decode
interference that appears at 2/GPU (+38% TPOT) and would dominate at
production load (~15/GPU). Updated §8 to re-evaluate elastic PS at
production concurrency. Next step: --max-inflight-sessions 64 benchmark.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-05-23 12:16:20 +08:00
fefbd71ca9
GPU imbalance analysis + elastic PS verdict + corrected LMetric results
...
Key findings:
- Session-sticky imbalance is 8.6x at 200 req (small-sample artifact)
but only 1.24x at 1000 req (moderate, TPOT unaffected)
- Elastic PS not justified: interference reduction 0% at 1/GPU,
migration reduces imbalance 1.24x→1.18x at 1.5s/event cost
- Corrected LMetric (no affinity) matches Linear (sticky) on all
metrics (<2%), proving soft affinity from cache-hit scoring works
- Updated §3.4 errata, added §8 GPU imbalance + elastic PS analysis
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-05-23 12:11:23 +08:00
fc92410ec9
Invalidate prior A/B results + add proper experiment harness
...
Prior cross-machine comparison (commit 1e86285 ) was invalid: dash0
baseline used warm instances with residual KV cache, inflating TTFT
by 2x. Evidence: inst_7 APC=68.3% impossible from 25 cold-start
requests; WARM TTFT p90=3.3s vs fresh=0.26s.
Fair same-machine comparison (both fresh restart on dash0):
Baseline: TTFT50=1.075 TPOT90=0.076 E2E50=5.075 OK=198/200
Elastic P2P: TTFT50=1.018 TPOT90=0.085 E2E50=6.977 OK=195/200
Elastic is WORSE due to Mooncake kv_both memory overhead.
Changes:
- REPORT.md: rewrite §3-4 with corrected results, add §3.5 errata
- pd_separation_analysis.md: update elastic TL;DR with correct numbers
- cache_aware_proxy.py: fix double-decrement bugs in offload path,
add 120s prefill timeout with co-located fallback (HEAVY_COLO_FALLBACK)
- bench.sh: standardized experiment harness with guaranteed GPU cleanup
and fresh-state verification (nvidia-smi check before start)
- run_elastic_stability_test.sh: two-phase elastic vs baseline test
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-05-22 17:54:21 +08:00
e4fa56cb1e
LMetric routing policy (OSDI'26) + A/B results vs linear baseline
...
Implement LMetric (P_tokens × BS multiplication score) from "Simple is
Better" (Zhang et al., OSDI'26) as alternative routing policy for
combined mode. Key changes:
- cache_aware_proxy.py: add --policy {linear,lmetric} flag, track
pending_prefill_tokens and num_requests per instance, /stats endpoint
- run_lmetric_ab.sh: automated A/B script for fair comparison
Results (200 req, fresh restart, same trace):
Linear: TTFT50=1.086 TPOT90=0.077 E2E50=5.423
LMetric: TTFT50=1.099 TPOT90=0.073 E2E50=5.205
Delta: TTFT +1.2% TPOT -5.9% E2E -4.0%
LMetric improves TPOT/E2E modestly through better load balancing, but
routing policy headroom is limited vs elastic P2P offload (-44% E2E).
TODO: vLLM → Redis → router pipeline for exact state ablation.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-05-22 16:57:32 +08:00
2b0ac70ee7
Phase 1 milestone: system-level analysis + reproducible report
...
- REPORT.md: self-contained milestone report covering baseline vs elastic
setup, exact launch commands, benchmark params, results, log locations,
and repo structure — sufficient for anyone to reproduce
- analysis/pd_separation_analysis.md §5: elastic P2P system-level breakdown
(KV cache hit ratio, per-class TTFT, GPU util paradox explanation)
- scripts/cache_aware_proxy.py: round-robin P-instance selection replacing
argmin(ongoing_tokens) to fix GPU load imbalance (3.0x → expected ~2x)
- scripts/launch_elastic_p2p.sh: one-command launch for elastic P2P config
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-05-22 16:17:41 +08:00