# agentic-kv Serving agentic LLM workloads by keeping the KV working set in GPU HBM (GPU-hit-first). Research outline: [`PAPER_OUTLINE.md`](PAPER_OUTLINE.md). Evidence + experiments: [`v2/`](v2/). --- ## ⚠️ Benchmarking methodology — read this first > **Replay agentic traces with `--dispatch-mode thinktime`, not the default > `tracets`.** It is the faithful, more realistic load — and the dispatch mode > materially changes the performance you measure. The replayer offers two ways to time each turn: | mode | turn-k dispatched at | what it models | |---|---|---| | `tracets` (default) | `max(prev_turn_finished, trace_ts)` | absolute production schedule | | **`thinktime` (use this)** | `prev_turn_finished + time_to_parent_chat` | real closed-loop agent pacing | **Why it matters.** `tracets` collapses the inter-turn think-time to ~0 whenever the system falls behind (it fires the next turn immediately because the trace timestamp is already in the past). That manufactures **artificial request bursts** — spiking instantaneous concurrency → KV-pool pressure → preemption → inflated tail latency and wasted throughput. `thinktime` keeps each turn's real gap (tool-exec + agent think), so the offered load is what a real agent produces. **Measured (w600 first-300s window, 8×H20, round-robin, 100% completion):** | metric (N=8) | `tracets` (Mode 1) | **`thinktime` (Mode 2)** | Δ | |---|---:|---:|---:| | E2E p90 | 102.8 s | **73.5 s** | **−28%** | | E2E p99 | 245 s | **227 s** | −7% | | TTFT p90 | 56.1 s | **39.7 s** | **−29%** | | system TPS | 111.8 | **119.3** | **+7%** | | wall-clock | 967 s | **787 s** | −19% | | TPOT p90 | 0.174 s | 0.188 s | ~flat | So under realistic capacity, `tracets` makes the system look **~30% worse on tail latency** than it actually is. Tell-tale: scaling 6→8 instances barely helped `tracets` (975→967 s — its bursts re-saturate regardless of capacity) but helped `thinktime` a lot (1125→787 s). Under heavy saturation (N=6) the two converge (E2E p90 ≈ 118–120 s), since there is no slack for bursts to harm. Decode (TPOT) is dispatch-independent everywhere. **Recommendation:** benchmark with `--dispatch-mode thinktime`; use `tracets` only as an explicit bursty stress case. Full ablation: [`v2/exp_c_dispatch_ablation/`](v2/exp_c_dispatch_ablation/). ### How to use it ```bash # 1. annotate a trace with the real per-turn gap (one-time; scans the raw trace) python scripts/add_time_to_parent.py traces/w600_r0.0015_st30.jsonl traces/w600_ttp.jsonl # 2. replay closed-loop with faithful think-time python -m replayer --trace traces/w600_ttp.jsonl --endpoint \ --model --dispatch-mode thinktime ``` `time_to_parent_chat = this_turn.request_ready_time_ms − parent_turn.request_end_time_ms`, computed from the raw trace and stored per request; turn-1 has none (fires at its trace arrival). Traces without the field fall back to `tracets`. --- ## Project map - [`PAPER_OUTLINE.md`](PAPER_OUTLINE.md) — GPU-hit-first paper outline (the thesis). - [`v2/`](v2/) — evidence experiments: - `exp_a_tier_latency/` — KV-hit cost by tier (GPU < CPU-local < remote-RDMA < miss). - `exp_b_capacity_knee/` — realized APC / latency knee vs GPU capacity. - `exp_c_dispatch_ablation/` — the replay-mode study above. - `replayer/` — trace replayer (`--dispatch-mode`, closed-loop think-time). - `scripts/add_time_to_parent.py` — trace annotation for `thinktime`. - `microbench/`, `analysis/` — PD-disagg, routing, workload characterization.