agentic-kvc

Files

Gahow Wang fc92410ec9 Invalidate prior A/B results + add proper experiment harness

Prior cross-machine comparison (commit 1e86285) was invalid: dash0
baseline used warm instances with residual KV cache, inflating TTFT
by 2x. Evidence: inst_7 APC=68.3% impossible from 25 cold-start
requests; WARM TTFT p90=3.3s vs fresh=0.26s.

Fair same-machine comparison (both fresh restart on dash0):
  Baseline:    TTFT50=1.075  TPOT90=0.076  E2E50=5.075  OK=198/200
  Elastic P2P: TTFT50=1.018  TPOT90=0.085  E2E50=6.977  OK=195/200
Elastic is WORSE due to Mooncake kv_both memory overhead.

Changes:
- REPORT.md: rewrite §3-4 with corrected results, add §3.5 errata
- pd_separation_analysis.md: update elastic TL;DR with correct numbers
- cache_aware_proxy.py: fix double-decrement bugs in offload path,
  add 120s prefill timeout with co-located fallback (HEAVY_COLO_FALLBACK)
- bench.sh: standardized experiment harness with guaranteed GPU cleanup
  and fresh-state verification (nvidia-smi check before start)
- run_elastic_stability_test.sh: two-phase elastic vs baseline test

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-22 17:54:21 +08:00

adaptive_prefill_offload_design.md

Design doc: Adaptive Prefill Offload

2026-05-22 00:44:22 +08:00

elastic_offload_design.md

Elastic P2P offload: TTFT p50 -49% vs baseline (0.551 vs 1.080)

2026-05-22 13:50:25 +08:00

kv_lifecycle_design.md

KV cache lifecycle design + eviction loss analysis

2026-05-22 01:27:22 +08:00

overnight_work_report.md

Update report: adaptive v2 confirms no KV transfer helps single-machine

2026-05-22 10:15:08 +08:00

pd_separation_analysis.md

Invalidate prior A/B results + add proper experiment harness

2026-05-22 17:54:21 +08:00