agentic-kvc

Author	SHA1	Message	Date
Gahow Wang	6a27f75337	Docs: reconcile routing docs with current hybrid direction Per analysis/unified_routing_fix_review.md #2, several docs still presented the retired single-argmin + PUSH-migration design as the final algorithm. Mark them superseded and document the current hybrid direction (commit `255c8e6`). - REPORT.md §1.1 / §3.9: add errata callout and section header noting the "Final Design" framing was retired after `cc6e562` / 4c583f2; point readers to docs/migration-policy-design.md. - docs/migration-policy-design.md: rewrite. Opens with the current hybrid algorithm (LMetric base + cache_ratio>0.5 affinity gate + tie-breaker), then a "What Was Retired" commit table, then the old Approach A numbers preserved as "Historical Baseline-Mode Comparison". - analysis/research_findings.md §2.2 / §5: correct the LMetric framing. LMetric isn't "neutralized by affinity constraints" (pure --policy lmetric has no affinity at all); it converges to similar placements because P_tokens includes new_uncached_tokens, giving it implicit soft affinity. - analysis/elastic_hypotheses.md: same LMetric correction in the "DOESN'T work" summary, plus a footer cross-referencing the current routing direction. - analysis/unified_routing_fix_review.md: track this file (was untracked); it is the review handoff cited from the updated docs. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 10:47:14 +08:00
Gahow Wang	baf7ffb08c	16-session contention: TPOT +45% from prefill-decode interference Key finding: at 16 concurrent sessions (2 per GPU), TPOT p90 degrades from 0.073 to 0.106 (+45%), with MEDIUM TPOT at 0.197 (+149%). This is the first time we've reproduced real prefill-decode interference in controlled experiments. Elastic RDMA at 16 sessions doesn't help: only 13/500 offloaded (cache-gate correct for cold turn-1), kv_both adds ~16% TPOT overhead at high concurrency. Load scaling: 1000req_ts20, 200req_ts10, 200req_ts5, 500req_ts10 all show ~30% GPU util at 8 sessions. The bottleneck is max_inflight_sessions, not arrival rate. Updated elastic_hypotheses.md with H8, H9, and comprehensive final analysis. The real bottleneck is vLLM's chunked prefill scheduling, not routing or PD disaggregation. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 05:51:47 +08:00
Gahow Wang	85b230455e	H7 OVERLOAD_FACTOR sweep: negative result + H4 GPU profiling H7: Sweeping OVERLOAD_FACTOR (2.0/1.5/1.3/1.0) has no effect on GPU imbalance (~3.5-4x across all settings). Root cause: imbalance is from workload skew at session placement (turn 1), not from routing at turn 2+. H4 GPU profiling confirms: GPU balance improvement IS real (4.0x→2.0x), and it directly improves HEAVY_COLO TTFT by 10.5%. But RDMA-offloaded requests have bimodal transfer times (0.6s or 18-31s) that negate the routing benefit. Updated elastic_hypotheses.md with H7 results and next directions: higher load experiments where contention amplifies routing differences. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 03:04:02 +08:00
Gahow Wang	098d86385a	Add elastic hypotheses tracking doc with H1-H6 analysis Tracks all hypotheses tested during elastic PD disaggregation research: - H1 (kv_both overhead): REJECTED — zero overhead at idle - H2 (PS cold prefill): REJECTED — PS slower than cached C - H3 (C_s+flexD): PARTIALLY VALIDATED — E2E -9% but HEAVY p90 +117% - H4 (cache-aware offload): TODO — only offload high-cache-hit HEAVY - H5 (RDMA overhead): TODO — Mooncake lacks layerwise transfer - H6 (session migration): TODO — verify D's APC after migration Key insight: offload decision should be cache-aware (new_tokens), not size-based (total_input). 80k request with 90% cache = 8k prefill. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 01:17:12 +08:00

4 Commits