agentic-kvc

Author	SHA1	Message	Date
Gahow Wang	8a6b22c11c	Replayer think-time dispatch mode + benchmarking guidance Adds `--dispatch-mode {tracets,thinktime}` to the replayer and documents that agentic serving should be benchmarked with `thinktime` (the faithful load). - `tracets` (old default): turn-k at the absolute trace timestamp, i.e. max(prev_finished, trace_ts) -- collapses inter-turn think-time to ~0 when the system is behind, manufacturing request bursts. - `thinktime`: turn-1 at trace arrival; turn-k at prev_finished + time_to_parent_chat (real production gap). scripts/add_time_to_parent.py annotates a trace with that gap from the raw trace's request_ready/end_ms. exp(c) ablation (v2/exp_c_dispatch_ablation/): at N=8 (capacity slack) thinktime beats tracets -- E2E p90 -28% (73.5 vs 102.8s), TTFT p90 -29%, TPS +7%, because tracets' bursts spike concurrency -> KV pressure -> preemption. At N=6 (saturated) they converge. So tracets makes the system look ~30% worse on tail latency than realistic agent pacing. Root README.md carries the headline guidance; raw per-request metrics gitignored (perf_summary.json kept). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-30 16:28:36 +08:00
Gahow Wang	48ae72467a	Replayer: closed-loop inter-turn think-time mode Add --inter-turn-think (env REPLAY_INTER_TURN_THINK_S): turn 1 fires on session admission, each later turn a FIXED think-time after the previous turn COMPLETES, ignoring absolute trace timestamps. Combined with --max-inflight-sessions (env REPLAY_MAX_INFLIGHT) this is a stable N-user closed loop, removing the open-loop "fire immediately because timestamp is in the past" retrigger artifact. Needed for the dispatch-coupling (wall-clock amplification) sweep. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>	2026-05-29 18:19:12 +08:00
Gahow Wang	52cdb80367	EAR outline: copy reusable figures, mark migration sections deferred - replayer/replay.py: emit trace_span_s and amplification in summary (Phase 1 of the wall-clock amplification measurement plan; needed for §2.3 dispatch coupling empirical closure) - figs/: 8 reusable figures copied from analysis/ with paper-spec names (f2a/b/c workload, f4a/b/c/d failure modes, f6 e2e partial) - PAPER_OUTLINE.md: real figure paths, explicit TBD markers for custom drawings and pending data; new "Validation Status" table at top and reorganized "Work Plan" splitting can-do-now vs migration-deferred Migration validation deferred per user: 4 prior attempts (`6b255fa`, e991960/5772149, `cc6e562`, `4c583f2`) were reverted due to transfer overhead; pending re-test on top of connector_tax DR-fix. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-27 01:44:13 +08:00
Gahow Wang	f42c715ec1	A4: open-loop session-causal SRR loadgen New replayer/srr.py drives a Poisson session-arrival load against the existing proxy, with strict per-session turn sequentiality, explicit warmup/steady/drain windows, and per-arrival fresh session_id + request_id so APC/session-affinity counters are not contaminated by repeated draws from the trace pool. Writes window_summary.json with attempted/completed/errored split by window so latency tails can be read on the steady-state window only. Required by Batch 4 SRR sweep; trace-timestamp dispatch in replay.py cannot drive arrival rate independently. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 16:19:20 +08:00
Gahow Wang	d57e338366	A1: replayer instrumentation for cross-process join RequestMetrics gains absolute unix timestamps (t_dispatch_unix, t_first_token_unix, t_finish_unix), the proxy_request_id, the chosen endpoint URL, and the trace hash_ids. Replayer sends X-Request-Id: <session_id>:<turn_id>:<chat_id>:<idx> so proxy breakdown rows can be joined to metrics by exact key. Required by Batch 0 (online sequentiality proof) and Batch 1 reuse decomposition; existing metrics.jsonl couldn't establish either. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>	2026-05-25 16:18:52 +08:00
Gahow Wang	9cebdb6b9b	Fix multi-turn replay fidelity: track realized output tokens across all components The replayer and proxy were building multi-turn prompts from trace tokens, but the model generates different output tokens. Subsequent turns had wrong prefix tokens, causing cache misses and invalid experimental measurements. - replay.py: min_tokens=max_tokens for deterministic length, return_token_ids to capture actual output, _apply_realized_prefix for next-turn correction - proxy: extract output token_ids from SSE, record prompt+output as realized prefix in shadow cache, extract _handle_local_request to deduplicate - bench.sh/launch_elastic_p2p.sh: default elastic mode to unified policy - mooncake_connector: only send prompt blocks (not stale output blocks), track failed_recving_block_ids for error recovery Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-24 14:47:51 +08:00
Gahow Wang	7c7f8b951a	replayer: wire --max-inflight-sessions cap into replay loop (B2) Trace-driven dispatch is preserved by default (semaphore=None when the flag is not set), but operators can now cap concurrent sessions to reproduce session-admission scenarios from earlier sweeps without artificial time compression.	2026-05-23 21:04:09 +08:00
Gahow Wang	2c7f7fdaae	replayer: restore optional max_inflight_sessions for backwards compat Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 21:02:26 +08:00
Gahow Wang	0ed1ce200e	metrics: replace round-based percentile with linear interpolation (B5) The previous implementation used round((n-1) * pct), which under Python's banker's rounding returned the upper-middle element on every even-length array (e.g. p50 of [1,2,3,4] returned 3 instead of 2.5). All summary JSONs were biased upward at p50 as a result. Match numpy.percentile's default linear interpolation between the two adjacent sorted values.	2026-05-23 21:00:24 +08:00
Gahow Wang	4089ffd63f	Fix replay methodology: trace-driven dispatch, no artificial limits The replayer was artificially limiting concurrency with --max-inflight-sessions (semaphore) and --time-scale (time compression), producing unrealistically low 1 req/GPU load that masked prefill-decode interference. Replayer changes: - Remove session_sem and time_scale entirely - Each request dispatched at its trace timestamp exactly - Sessions still sequential (turn N+1 waits for turn N completion) - If turn completes late, next turn fires immediately Sampler changes: - Add --sample-ratio for GPU-proportional session sampling - Keep --target-requests for backwards compat - No time compression (preserve original arrival pattern) bench.sh: remove --time-scale and --max-inflight-sessions args Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-23 12:43:41 +08:00
Gahow Wang	1d2eeb4925	Elastic P2P offload: TTFT p50 -49% vs baseline (0.551 vs 1.080) Design: offload HEAVY prefill only when P instance is less loaded than D AND P is not overloaded (< 1.5x avg). Preserves session-sticky on D for future KV reuse. External KV correctly registered in prefix cache. Result (67/200 processed, 75% success): TTFT p50: 0.551s (-49% vs baseline 1.080s) TTFT p90: 4.135s (vs baseline 9.410s, -56%) TPOT p90: 0.074s (same as baseline) E2E p50: 2.938s (-45% vs baseline 5.306s) 25% error rate from ReadTimeout on very large HEAVY requests queuing on P. Needs stricter elastic gate or higher timeout. But successful requests show significant improvement over both baseline and previous P2P. Also: added external_prefix_cache metrics tracking to replayer summary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 13:50:25 +08:00
Gahow Wang	32f09d32cd	Balanced session-sticky routing + agentic workload pattern analysis Routing fix: new sessions placed by cumulative token load (greedy bin packing) with cache-hit tiebreak. Session affinity for turn 2+. Replayer now sends X-Session-Id header for proper session tracking. Agentic workload core patterns (GLM-5.1 trace): - 91% of reusable KV is intra-session (not cross-session) - Session-sticky routing is THE critical optimization - 36% warm requests (1.3k new tokens), 64% cold (17k+) - After cache: effective prefill/decode ratio drops from 61.5x to 28.7x - Cross-session sharing (system prompt) is only 4.8% of tokens Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 01:50:27 +08:00
Gahow Wang	05592e6adc	Agentic workload PD separation analysis with trace-driven benchmarks Systematic study of prefill-decode disaggregation for agentic LLM workloads using production GLM-5.1 coder trace (2.1M requests, 71B input tokens). Key findings: - Cache-aware routing improves TPOT p90 by 15% and APC from 20.8% to 44.7% without PD separation, matching PD-Sep's decode isolation benefit - PD separation adds +72% TTFT overhead (KV transfer) with no TPOT gain when using the same cache-aware scheduler - Prefill remains compute-bound even at 95% KV cache reuse (AI >1000x vs decode AI <2), but absolute FLOPs drop 71% from cache hits - For agentic MoE workloads, cache-aware routing > PD separation Infrastructure: - Trace sampler preserving session structure + hash_ids for prefix sharing - Async trace replayer with streaming TTFT/TPOT/E2E measurement - Unified cache-aware + token-level load-balanced global scheduler proxy supporting both PD-colocated and PD-disaggregated (Mooncake/RDMA) modes - vLLM 0.18.1 scheduler patch for KV transfer abort race condition - Roofline analysis tool for prefill/decode compute characterization Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-21 21:21:57 +08:00

13 Commits