agentic-kvc

Author	SHA1	Message	Date
Gahow Wang	6a27f75337	Docs: reconcile routing docs with current hybrid direction Per analysis/unified_routing_fix_review.md #2, several docs still presented the retired single-argmin + PUSH-migration design as the final algorithm. Mark them superseded and document the current hybrid direction (commit `255c8e6`). - REPORT.md §1.1 / §3.9: add errata callout and section header noting the "Final Design" framing was retired after `cc6e562` / 4c583f2; point readers to docs/migration-policy-design.md. - docs/migration-policy-design.md: rewrite. Opens with the current hybrid algorithm (LMetric base + cache_ratio>0.5 affinity gate + tie-breaker), then a "What Was Retired" commit table, then the old Approach A numbers preserved as "Historical Baseline-Mode Comparison". - analysis/research_findings.md §2.2 / §5: correct the LMetric framing. LMetric isn't "neutralized by affinity constraints" (pure --policy lmetric has no affinity at all); it converges to similar placements because P_tokens includes new_uncached_tokens, giving it implicit soft affinity. - analysis/elastic_hypotheses.md: same LMetric correction in the "DOESN'T work" summary, plus a footer cross-referencing the current routing direction. - analysis/unified_routing_fix_review.md: track this file (was untracked); it is the review handoff cited from the updated docs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 10:47:14 +08:00
Gahow Wang	cdf83493ab	Fix A+C: real cache sync + cached-prefill-on-C architecture A: Add /estimate_hit endpoint to bootstrap server for real-time cache probing. Proxy queries this before committing to PUSH, eliminating 24% zero-match PUSH requests (shadow cache divergence). C: Add _handle_cached_prefill_offload: C (cache source) does fast cached prefill → KV to Mooncake → D pulls and decodes. Replaces broken direct_read PUSH where D waited for RDMA transfer while occupying KV blocks without doing compute. Also: update §3.9 baseline to plain vLLM with full mean/p50/p90/p99. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 11:22:38 +08:00
Gahow Wang	2b9eae0d54	Report §3.9: Unified routing final results — TTFT -25%, E2E -7% 850/850, 0 errors. Single argmin(latency) with soft affinity. 116 PUSH_MIGRATE (all with cache, avg 25k tokens), 723 LOCAL. TPOT p90 +15% tradeoff from kv_both overhead. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 03:15:32 +08:00
Gahow Wang	1cd0a18e2c	Report §3.8: Document direct KV cache migration architecture + bugs fixed Complete documentation of bootstrap-triggered PUSH implementation: hash table sync, token-based lookup, RDMA WRITE path, cost model, PYTHONHASHSEED requirement, and all 6 bugs fixed during development. Verified: 640/640 blocks pushed, External APC 80%, TTFT 0.367s (vs local cache 0.338s, +0.03s overhead). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 01:52:38 +08:00
Gahow Wang	4f93bb5b8a	Report §3.8: Direct RDMA read results — HEAVY TTFT -70%, TPOT p90 -38% D reads C's cached KV blocks via batch_transfer_sync_read, bypassing C's scheduler entirely. 65/318 HEAVY requests offloaded. HEAVY_OFFLOAD TTFT: 3.40s vs HEAVY_COLO 11.21s (-70%) Overall TPOT p90: 0.100 vs baseline 0.162 (-38%) kv_both mode has 67.5% error rate (Mooncake instability), but 276 successful requests show strong performance improvement. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 22:56:16 +08:00
Gahow Wang	0958823cdb	REPORT: add §1.1 errata flagging superseded sections (S3) Calls out that §3.1 (old random sampler, time-scale compression, 1 req/GPU cap) and the early elastic v3 warm-vs-fresh runs are no longer current, and that the "--max-inflight-sessions 64+" next-step text refers to a flag that was removed and must be restored per FIXES.md §B2 before those numbers can be reproduced. Points readers at §3.6/§3.7 as authoritative.	2026-05-23 20:58:38 +08:00
Gahow Wang	9835d6af5d	Elastic PS eval: near-neutral, offload gate triggers only 14% of HEAVY Root cause: 75% of HEAVY requests are cold (cache_ratio=0%), failing the cache_ratio>=0.3 gate. Only 17/118 HEAVY offloaded, insufficient to reduce prefill-decode interference. Offloaded requests are 50% SLOWER due to P-side queuing (14.7s) + RDMA overhead (5.7s). Interference IS real: 89% of WARM/MEDIUM have 1+ concurrent HEAVY prefill. But elastic PS in current form can't address it because cold HEAVY prefills (the majority) can't benefit from offload. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 16:49:25 +08:00
Gahow Wang	bf037594c4	Production-realistic baseline: APC 67.5%, TPOT +139% from interference Updated methodology: - Window+thin sampling preserves cross-session sharing (48% vs 16%) - --max-single-turn-ratio 0.3 boosts multi-turn to 70% - --window-seconds 600 for 10-min contiguous window - Trace-driven replay (no session limit, no time compression) - Daily config: --requests 850 (~13 min, APC~76%) Key result: TPOT p90=0.175s (vs 0.073s in legacy 1-req/GPU setup), confirming prefill-decode interference is real at production concurrency. APC 67.5% (vs 44%) from better KV reuse preservation. Also fixed KV reuse breakdown: 62% intra-session / 38% cross-session (was incorrectly reported as 91% / 9%). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 15:44:34 +08:00
Gahow Wang	c8ba666517	Benchmark concurrency gap: 1 req/GPU is 10-15x below production Our --max-inflight-sessions 8 yields 1 req/GPU, masking prefill-decode interference that appears at 2/GPU (+38% TPOT) and would dominate at production load (~15/GPU). Updated §8 to re-evaluate elastic PS at production concurrency. Next step: --max-inflight-sessions 64 benchmark. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 12:16:20 +08:00
Gahow Wang	fefbd71ca9	GPU imbalance analysis + elastic PS verdict + corrected LMetric results Key findings: - Session-sticky imbalance is 8.6x at 200 req (small-sample artifact) but only 1.24x at 1000 req (moderate, TPOT unaffected) - Elastic PS not justified: interference reduction 0% at 1/GPU, migration reduces imbalance 1.24x→1.18x at 1.5s/event cost - Corrected LMetric (no affinity) matches Linear (sticky) on all metrics (<2%), proving soft affinity from cache-hit scoring works - Updated §3.4 errata, added §8 GPU imbalance + elastic PS analysis Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 12:11:23 +08:00
Gahow Wang	fc92410ec9	Invalidate prior A/B results + add proper experiment harness Prior cross-machine comparison (commit `1e86285`) was invalid: dash0 baseline used warm instances with residual KV cache, inflating TTFT by 2x. Evidence: inst_7 APC=68.3% impossible from 25 cold-start requests; WARM TTFT p90=3.3s vs fresh=0.26s. Fair same-machine comparison (both fresh restart on dash0): Baseline: TTFT50=1.075 TPOT90=0.076 E2E50=5.075 OK=198/200 Elastic P2P: TTFT50=1.018 TPOT90=0.085 E2E50=6.977 OK=195/200 Elastic is WORSE due to Mooncake kv_both memory overhead. Changes: - REPORT.md: rewrite §3-4 with corrected results, add §3.5 errata - pd_separation_analysis.md: update elastic TL;DR with correct numbers - cache_aware_proxy.py: fix double-decrement bugs in offload path, add 120s prefill timeout with co-located fallback (HEAVY_COLO_FALLBACK) - bench.sh: standardized experiment harness with guaranteed GPU cleanup and fresh-state verification (nvidia-smi check before start) - run_elastic_stability_test.sh: two-phase elastic vs baseline test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 17:54:21 +08:00
Gahow Wang	e4fa56cb1e	LMetric routing policy (OSDI'26) + A/B results vs linear baseline Implement LMetric (P_tokens × BS multiplication score) from "Simple is Better" (Zhang et al., OSDI'26) as alternative routing policy for combined mode. Key changes: - cache_aware_proxy.py: add --policy {linear,lmetric} flag, track pending_prefill_tokens and num_requests per instance, /stats endpoint - run_lmetric_ab.sh: automated A/B script for fair comparison Results (200 req, fresh restart, same trace): Linear: TTFT50=1.086 TPOT90=0.077 E2E50=5.423 LMetric: TTFT50=1.099 TPOT90=0.073 E2E50=5.205 Delta: TTFT +1.2% TPOT -5.9% E2E -4.0% LMetric improves TPOT/E2E modestly through better load balancing, but routing policy headroom is limited vs elastic P2P offload (-44% E2E). TODO: vLLM → Redis → router pipeline for exact state ablation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 16:57:32 +08:00
Gahow Wang	2b0ac70ee7	Phase 1 milestone: system-level analysis + reproducible report - REPORT.md: self-contained milestone report covering baseline vs elastic setup, exact launch commands, benchmark params, results, log locations, and repo structure — sufficient for anyone to reproduce - analysis/pd_separation_analysis.md §5: elastic P2P system-level breakdown (KV cache hit ratio, per-class TTFT, GPU util paradox explanation) - scripts/cache_aware_proxy.py: round-robin P-instance selection replacing argmin(ongoing_tokens) to fix GPU load imbalance (3.0x → expected ~2x) - scripts/launch_elastic_p2p.sh: one-command launch for elastic P2P config Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 16:17:41 +08:00

13 Commits