agentic-kvc

Author	SHA1	Message	Date
Gahow Wang	b2ede1da77	bench.sh: add trap for graceful cleanup on kill/interrupt Added EXIT/INT/TERM traps to ensure vLLM, proxy, and gpu_monitor processes are cleaned up even when bench.sh is killed externally. Also includes gpu_monitor in cleanup_gpu pattern matching. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 20:24:13 +08:00
Gahow Wang	bf037594c4	Production-realistic baseline: APC 67.5%, TPOT +139% from interference Updated methodology: - Window+thin sampling preserves cross-session sharing (48% vs 16%) - --max-single-turn-ratio 0.3 boosts multi-turn to 70% - --window-seconds 600 for 10-min contiguous window - Trace-driven replay (no session limit, no time compression) - Daily config: --requests 850 (~13 min, APC~76%) Key result: TPOT p90=0.175s (vs 0.073s in legacy 1-req/GPU setup), confirming prefill-decode interference is real at production concurrency. APC 67.5% (vs 44%) from better KV reuse preservation. Also fixed KV reuse breakdown: 62% intra-session / 38% cross-session (was incorrectly reported as 91% / 9%). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 15:44:34 +08:00
Gahow Wang	4089ffd63f	Fix replay methodology: trace-driven dispatch, no artificial limits The replayer was artificially limiting concurrency with --max-inflight-sessions (semaphore) and --time-scale (time compression), producing unrealistically low 1 req/GPU load that masked prefill-decode interference. Replayer changes: - Remove session_sem and time_scale entirely - Each request dispatched at its trace timestamp exactly - Sessions still sequential (turn N+1 waits for turn N completion) - If turn completes late, next turn fires immediately Sampler changes: - Add --sample-ratio for GPU-proportional session sampling - Keep --target-requests for backwards compat - No time compression (preserve original arrival pattern) bench.sh: remove --time-scale and --max-inflight-sessions args Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 12:43:41 +08:00
Gahow Wang	3594f7dce0	Fix LMetric routing: remove session affinity, align with OSDI'26 spec LMetric was incorrectly sharing session-sticky logic with Linear policy. Fixed to pure per-request routing: score = P_tokens × BS where P = pending_prefill + (input - cache_hit), BS = num_requests. Experiment result (200 req, fresh restart): Linear vs corrected LMetric show <2% difference on all metrics — LMetric's cache-hit estimation provides implicit soft affinity that preserves locality without explicit session stickiness. Also fix bench.sh missing cd (replayer module not found from non-project cwd) and rewrite run_lmetric_ab.sh as thin wrapper around bench.sh to eliminate duplicated launch/cleanup logic that broke under set -euo. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 11:56:58 +08:00
Gahow Wang	080a8fa138	Chunk-size ablation + comprehensive synthesis max_num_batched_tokens sweep at 16 sessions (2048/4096/8192/16384): - Default 8192 has best overall TPOT p90 (0.106) and E2E p50 (5.83) - 16384: HEAVY TTFT -16%, HEAVY TPOT -17%, but overall worse (+18%) - Smaller chunks (2048/4096) always worse (scheduler overhead) bench.sh now supports --max-batched-tokens flag. Updated elastic_hypotheses.md with H8 (high concurrency validated), H9 (elastic RDMA at 16s rejected), and final synthesis. Key conclusion: for agentic workloads, the dominant optimization is cache-aware session-sticky routing (-60% TTFT, +24pp APC vs RR). Neither PD-Sep, LMetric, elastic RDMA, nor chunk-size tuning provides additional benefit beyond well-tuned routing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 07:15:02 +08:00
Gahow Wang	85b230455e	H7 OVERLOAD_FACTOR sweep: negative result + H4 GPU profiling H7: Sweeping OVERLOAD_FACTOR (2.0/1.5/1.3/1.0) has no effect on GPU imbalance (~3.5-4x across all settings). Root cause: imbalance is from workload skew at session placement (turn 1), not from routing at turn 2+. H4 GPU profiling confirms: GPU balance improvement IS real (4.0x→2.0x), and it directly improves HEAVY_COLO TTFT by 10.5%. But RDMA-offloaded requests have bimodal transfer times (0.6s or 18-31s) that negate the routing benefit. Updated elastic_hypotheses.md with H7 results and next directions: higher load experiments where contention amplifies routing differences. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 03:04:02 +08:00
Gahow Wang	3bc37cc6d5	PS experiments + H4 cache-gate + GPU profiling + Mooncake elif→if fix Experiments run: - Phase 0: kv_both has zero idle overhead (TPOT +1.3%, noise) - PS V1 (cold prefill): REJECTED — PS always slower than cached C - PS V1+flexD: 92.5% OK, HEAVY TTFT 7.8s (baseline 5.0s) — PS bottleneck - V2 (C_s prefill + flexible D): E2E -9% but 6 errors, RDMA bimodal - H4 (cache-gate): 198/200 OK, GPU imbalance 4.0x→2.0x, but HEAVY_OFFLOAD TTFT=11.5s due to RDMA. HEAVY_COLO improved 10.5% from better balance. - H5: Mooncake RDMA transfer R²=0.095, bimodal (0.6s or 18-30s) Key findings: - Mooncake lacks layerwise KV transfer → RDMA is pure sequential overhead - 92% of HEAVY are turn-1 cold → offloading cold requests always loses - GPU balance improvement from routing IS real (-10.5% HEAVY_COLO TTFT) - RDMA transfer negates the routing benefit for offloaded requests Code changes: - bench.sh: add GPU timeline monitoring (gpu_monitor.sh during benchmark) - cache_aware_proxy.py: H4 cache-gate, flexible D, PS routing - mooncake_connector.py: elif→if fix (allow dual prefill+decode flags) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 02:14:37 +08:00
Gahow Wang	fc92410ec9	Invalidate prior A/B results + add proper experiment harness Prior cross-machine comparison (commit `1e86285`) was invalid: dash0 baseline used warm instances with residual KV cache, inflating TTFT by 2x. Evidence: inst_7 APC=68.3% impossible from 25 cold-start requests; WARM TTFT p90=3.3s vs fresh=0.26s. Fair same-machine comparison (both fresh restart on dash0): Baseline: TTFT50=1.075 TPOT90=0.076 E2E50=5.075 OK=198/200 Elastic P2P: TTFT50=1.018 TPOT90=0.085 E2E50=6.977 OK=195/200 Elastic is WORSE due to Mooncake kv_both memory overhead. Changes: - REPORT.md: rewrite §3-4 with corrected results, add §3.5 errata - pd_separation_analysis.md: update elastic TL;DR with correct numbers - cache_aware_proxy.py: fix double-decrement bugs in offload path, add 120s prefill timeout with co-located fallback (HEAVY_COLO_FALLBACK) - bench.sh: standardized experiment harness with guaranteed GPU cleanup and fresh-state verification (nvidia-smi check before start) - run_elastic_stability_test.sh: two-phase elastic vs baseline test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 17:54:21 +08:00

8 Commits