agentic-kvc

Author	SHA1	Message	Date
Gahow Wang	5816aad731	A3: vLLM scheduler patch for step-level JSONL log When AGENTIC_STEP_LOG_PATH is set, the scheduler emits one JSONL line per scheduler step with t_unix, worker_id, prefill/decode token counts, n_running/n_waiting, preempted ids, and per-request phase labels. No-op when the env var is unset, so production engines are not impacted. bench.sh now threads AGENTIC_STEP_LOG_DIR through to each per-engine launch so step logs end up at engine_${i}.jsonl. Required by Batch 2 (PD-colo interference index) and Batch 5 (same-worker overlap attribution); engine /metrics polling cannot provide per-step granularity. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 16:19:11 +08:00
Gahow Wang	21ffb3d4f7	PD-sep matrix infrastructure: bench.sh pdsep mode + matrix driver Adds the experiment harness that gates the empirical claims (C2/C3/C4/C5) in the PD-sep paper section. Three pieces: 1. scripts/bench.sh: new --mode pdsep with --pd-ratio P:D, and an --eager flag to re-enable --enforce-eager for the cuda-graph ablation. pdsep reuses the elastic-mode Mooncake kv_both launch and swaps the proxy command from --combined to --prefill/--decode. baseline and elastic flows are unchanged. 2. analysis/pd_sep_paper_section/scripts/bench_pd_matrix.sh: matrix driver that runs {combined-ca, pdsep-4p4d, pdsep-6p2d} x cudagraph x 3 seeds by default (~2 h on dash0). --with-rr adds combined-rr; --with-eager doubles to ~5 h with the cuda-graph ablation. Skips completed runs, captures per-instance vLLM logs (needed for C3 step-level KV-utilization mining). 3. fig_kv_memory_wall.pdf: empirical anchor (star) at REPORT.md §3.3's observed 6P+2D 97% KV utilization. The marker lands on the model's predicted curve at p90 input, confirming the steady-state analysis. README updated with the run command, output layout, and the followup plotters that consume outputs/pd_matrix/. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 11:47:33 +08:00
Gahow Wang	d71a111099	Paper section: PD-sep scaffold + drop --enforce-eager from launch scripts Adds analysis/pd_sep_paper_section/ as the home for the "PD separation is net negative under agentic workloads" paper section: plot scripts for C1 (workload chars), C6 (roofline), C7 (routing-vs-PD-sep lever), the C6/C7 PDFs already rendered, and a README mapping candidate claims to required figures plus open re-run items. Removes --enforce-eager from bench.sh and all active launch scripts so cuda graphs are captured -- the prior methodology suppressed one of PD-sep's structural advantages (D-node fixed-shape decode). Legacy scripts under scripts/legacy/ are intentionally untouched as historical records. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 11:24:16 +08:00
Gahow Wang	9cebdb6b9b	Fix multi-turn replay fidelity: track realized output tokens across all components The replayer and proxy were building multi-turn prompts from trace tokens, but the model generates different output tokens. Subsequent turns had wrong prefix tokens, causing cache misses and invalid experimental measurements. - replay.py: min_tokens=max_tokens for deterministic length, return_token_ids to capture actual output, _apply_realized_prefix for next-turn correction - proxy: extract output token_ids from SSE, record prompt+output as realized prefix in shadow cache, extract _handle_local_request to deduplicate - bench.sh/launch_elastic_p2p.sh: default elastic mode to unified policy - mooncake_connector: only send prompt blocks (not stale output blocks), track failed_recving_block_ids for error recovery Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 14:47:51 +08:00
Gahow Wang	bf76273778	Add --offload-mode switch for ablation (direct_read vs cached_prefill) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 11:24:15 +08:00
Gahow Wang	7e91b83d88	Set PYTHONHASHSEED=42 for elastic mode to ensure consistent block hashes Root cause confirmed: NONE_HASH = os.urandom(32) differs between scheduler and bootstrap server even in the same process (init_none_hash called separately by each import path). PYTHONHASHSEED makes it deterministic: NONE_HASH = hash_fn(seed), same across all code paths. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 01:27:52 +08:00
Gahow Wang	c843f2e3db	proxy: Settings dataclass + cache-ratio gate + P-pick offload penalty (B4, M2, M3, D5) - Replace mutable module constants (HEAVY_THRESHOLD/OVERLOAD_FACTOR/ MAX_OFFLOAD_INFLIGHT/PREFILL_THROUGHPUT/RDMA_OVERHEAD_S/ CACHE_CAPACITY_BLOCKS) with a Settings dataclass + SETTINGS singleton. __main__ now mutates SETTINGS so CLI overrides survive even when the module is imported as a library (e.g. by tests/) (D5). - Add --max-offload-inflight CLI flag (M3) and read it from SETTINGS. - Add --cache-gate-ratio CLI flag and a real gate before the cost-model branch: if cache_hit/input_length < ratio, mark cache_gate_REASON and fall back to colocated. cache_ratio is no longer a write-only field (B4). - P candidate selection penalises instances already running offloaded HEAVY prefills, so back-to-back HEAVY requests don't pile onto the same P (M2). - bench.sh forwards --max-offload-inflight / --cache-gate-ratio to the proxy. - Tests cover SETTINGS knobs + the heavy_threshold-driven P-offload penalty.	2026-05-23 21:11:17 +08:00
Gahow Wang	c64b0b39c7	bench.sh: fix stale MODEL and TRACE defaults (B6) The default MODEL pointed at /home/admin/cpfs/... which never existed on the public dev machines (other launch_*.sh and TODO.md use $HOME/models), and the default TRACE pointed at traces/sampled_1000req_seed42.jsonl which was deleted when the sampler moved to window+thin output. Update both to the values the rest of the repo already standardized on.	2026-05-23 20:56:40 +08:00
Gahow Wang	b2ede1da77	bench.sh: add trap for graceful cleanup on kill/interrupt Added EXIT/INT/TERM traps to ensure vLLM, proxy, and gpu_monitor processes are cleaned up even when bench.sh is killed externally. Also includes gpu_monitor in cleanup_gpu pattern matching. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 20:24:13 +08:00
Gahow Wang	bf037594c4	Production-realistic baseline: APC 67.5%, TPOT +139% from interference Updated methodology: - Window+thin sampling preserves cross-session sharing (48% vs 16%) - --max-single-turn-ratio 0.3 boosts multi-turn to 70% - --window-seconds 600 for 10-min contiguous window - Trace-driven replay (no session limit, no time compression) - Daily config: --requests 850 (~13 min, APC~76%) Key result: TPOT p90=0.175s (vs 0.073s in legacy 1-req/GPU setup), confirming prefill-decode interference is real at production concurrency. APC 67.5% (vs 44%) from better KV reuse preservation. Also fixed KV reuse breakdown: 62% intra-session / 38% cross-session (was incorrectly reported as 91% / 9%). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 15:44:34 +08:00
Gahow Wang	4089ffd63f	Fix replay methodology: trace-driven dispatch, no artificial limits The replayer was artificially limiting concurrency with --max-inflight-sessions (semaphore) and --time-scale (time compression), producing unrealistically low 1 req/GPU load that masked prefill-decode interference. Replayer changes: - Remove session_sem and time_scale entirely - Each request dispatched at its trace timestamp exactly - Sessions still sequential (turn N+1 waits for turn N completion) - If turn completes late, next turn fires immediately Sampler changes: - Add --sample-ratio for GPU-proportional session sampling - Keep --target-requests for backwards compat - No time compression (preserve original arrival pattern) bench.sh: remove --time-scale and --max-inflight-sessions args Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 12:43:41 +08:00
Gahow Wang	3594f7dce0	Fix LMetric routing: remove session affinity, align with OSDI'26 spec LMetric was incorrectly sharing session-sticky logic with Linear policy. Fixed to pure per-request routing: score = P_tokens × BS where P = pending_prefill + (input - cache_hit), BS = num_requests. Experiment result (200 req, fresh restart): Linear vs corrected LMetric show <2% difference on all metrics — LMetric's cache-hit estimation provides implicit soft affinity that preserves locality without explicit session stickiness. Also fix bench.sh missing cd (replayer module not found from non-project cwd) and rewrite run_lmetric_ab.sh as thin wrapper around bench.sh to eliminate duplicated launch/cleanup logic that broke under set -euo. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 11:56:58 +08:00
Gahow Wang	080a8fa138	Chunk-size ablation + comprehensive synthesis max_num_batched_tokens sweep at 16 sessions (2048/4096/8192/16384): - Default 8192 has best overall TPOT p90 (0.106) and E2E p50 (5.83) - 16384: HEAVY TTFT -16%, HEAVY TPOT -17%, but overall worse (+18%) - Smaller chunks (2048/4096) always worse (scheduler overhead) bench.sh now supports --max-batched-tokens flag. Updated elastic_hypotheses.md with H8 (high concurrency validated), H9 (elastic RDMA at 16s rejected), and final synthesis. Key conclusion: for agentic workloads, the dominant optimization is cache-aware session-sticky routing (-60% TTFT, +24pp APC vs RR). Neither PD-Sep, LMetric, elastic RDMA, nor chunk-size tuning provides additional benefit beyond well-tuned routing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 07:15:02 +08:00
Gahow Wang	85b230455e	H7 OVERLOAD_FACTOR sweep: negative result + H4 GPU profiling H7: Sweeping OVERLOAD_FACTOR (2.0/1.5/1.3/1.0) has no effect on GPU imbalance (~3.5-4x across all settings). Root cause: imbalance is from workload skew at session placement (turn 1), not from routing at turn 2+. H4 GPU profiling confirms: GPU balance improvement IS real (4.0x→2.0x), and it directly improves HEAVY_COLO TTFT by 10.5%. But RDMA-offloaded requests have bimodal transfer times (0.6s or 18-31s) that negate the routing benefit. Updated elastic_hypotheses.md with H7 results and next directions: higher load experiments where contention amplifies routing differences. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 03:04:02 +08:00
Gahow Wang	3bc37cc6d5	PS experiments + H4 cache-gate + GPU profiling + Mooncake elif→if fix Experiments run: - Phase 0: kv_both has zero idle overhead (TPOT +1.3%, noise) - PS V1 (cold prefill): REJECTED — PS always slower than cached C - PS V1+flexD: 92.5% OK, HEAVY TTFT 7.8s (baseline 5.0s) — PS bottleneck - V2 (C_s prefill + flexible D): E2E -9% but 6 errors, RDMA bimodal - H4 (cache-gate): 198/200 OK, GPU imbalance 4.0x→2.0x, but HEAVY_OFFLOAD TTFT=11.5s due to RDMA. HEAVY_COLO improved 10.5% from better balance. - H5: Mooncake RDMA transfer R²=0.095, bimodal (0.6s or 18-30s) Key findings: - Mooncake lacks layerwise KV transfer → RDMA is pure sequential overhead - 92% of HEAVY are turn-1 cold → offloading cold requests always loses - GPU balance improvement from routing IS real (-10.5% HEAVY_COLO TTFT) - RDMA transfer negates the routing benefit for offloaded requests Code changes: - bench.sh: add GPU timeline monitoring (gpu_monitor.sh during benchmark) - cache_aware_proxy.py: H4 cache-gate, flexible D, PS routing - mooncake_connector.py: elif→if fix (allow dual prefill+decode flags) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 02:14:37 +08:00
Gahow Wang	fc92410ec9	Invalidate prior A/B results + add proper experiment harness Prior cross-machine comparison (commit `1e86285`) was invalid: dash0 baseline used warm instances with residual KV cache, inflating TTFT by 2x. Evidence: inst_7 APC=68.3% impossible from 25 cold-start requests; WARM TTFT p90=3.3s vs fresh=0.26s. Fair same-machine comparison (both fresh restart on dash0): Baseline: TTFT50=1.075 TPOT90=0.076 E2E50=5.075 OK=198/200 Elastic P2P: TTFT50=1.018 TPOT90=0.085 E2E50=6.977 OK=195/200 Elastic is WORSE due to Mooncake kv_both memory overhead. Changes: - REPORT.md: rewrite §3-4 with corrected results, add §3.5 errata - pd_separation_analysis.md: update elastic TL;DR with correct numbers - cache_aware_proxy.py: fix double-decrement bugs in offload path, add 120s prefill timeout with co-located fallback (HEAVY_COLO_FALLBACK) - bench.sh: standardized experiment harness with guaranteed GPU cleanup and fresh-state verification (nvidia-smi check before start) - run_elastic_stability_test.sh: two-phase elastic vs baseline test Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 17:54:21 +08:00

16 Commits