Gahow Wang
fc92410ec9
Invalidate prior A/B results + add proper experiment harness
Prior cross-machine comparison (commit 1e86285) was invalid: dash0
baseline used warm instances with residual KV cache, inflating TTFT
by 2x. Evidence: inst_7 APC=68.3% impossible from 25 cold-start
requests; WARM TTFT p90=3.3s vs fresh=0.26s.
Fair same-machine comparison (both fresh restart on dash0):
Baseline: TTFT50=1.075 TPOT90=0.076 E2E50=5.075 OK=198/200
Elastic P2P: TTFT50=1.018 TPOT90=0.085 E2E50=6.977 OK=195/200
Elastic is WORSE due to Mooncake kv_both memory overhead.
Changes:
- REPORT.md: rewrite §3-4 with corrected results, add §3.5 errata
- pd_separation_analysis.md: update elastic TL;DR with correct numbers
- cache_aware_proxy.py: fix double-decrement bugs in offload path,
add 120s prefill timeout with co-located fallback (HEAVY_COLO_FALLBACK)
- bench.sh: standardized experiment harness with guaranteed GPU cleanup
and fresh-state verification (nvidia-smi check before start)
- run_elastic_stability_test.sh: two-phase elastic vs baseline test
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 17:54:21 +08:00
..
2026-05-22 00:44:22 +08:00
2026-05-22 13:50:25 +08:00
2026-05-22 01:27:22 +08:00
2026-05-22 10:15:08 +08:00
2026-05-22 17:54:21 +08:00