agentic-kvc

Go to file

Gahow Wang fc92410ec9 Invalidate prior A/B results + add proper experiment harness

Prior cross-machine comparison (commit 1e86285) was invalid: dash0
baseline used warm instances with residual KV cache, inflating TTFT
by 2x. Evidence: inst_7 APC=68.3% impossible from 25 cold-start
requests; WARM TTFT p90=3.3s vs fresh=0.26s.

Fair same-machine comparison (both fresh restart on dash0):
  Baseline:    TTFT50=1.075  TPOT90=0.076  E2E50=5.075  OK=198/200
  Elastic P2P: TTFT50=1.018  TPOT90=0.085  E2E50=6.977  OK=195/200
Elastic is WORSE due to Mooncake kv_both memory overhead.

Changes:
- REPORT.md: rewrite §3-4 with corrected results, add §3.5 errata
- pd_separation_analysis.md: update elastic TL;DR with correct numbers
- cache_aware_proxy.py: fix double-decrement bugs in offload path,
  add 120s prefill timeout with co-located fallback (HEAVY_COLO_FALLBACK)
- bench.sh: standardized experiment harness with guaranteed GPU cleanup
  and fresh-state verification (nvidia-smi check before start)
- run_elastic_stability_test.sh: two-phase elastic vs baseline test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-22 17:54:21 +08:00

analysis

Invalidate prior A/B results + add proper experiment harness

2026-05-22 17:54:21 +08:00

patches

Add vLLM patches directory for version-controlled patch management

2026-05-22 00:26:14 +08:00

replayer

Elastic P2P offload: TTFT p50 -49% vs baseline (0.551 vs 1.080)

2026-05-22 13:50:25 +08:00

scripts

Invalidate prior A/B results + add proper experiment harness