Files
agentic-kvc/README.md
Gahow Wang 8a6b22c11c Replayer think-time dispatch mode + benchmarking guidance
Adds `--dispatch-mode {tracets,thinktime}` to the replayer and documents that
agentic serving should be benchmarked with `thinktime` (the faithful load).

- `tracets` (old default): turn-k at the absolute trace timestamp, i.e.
  max(prev_finished, trace_ts) -- collapses inter-turn think-time to ~0 when the
  system is behind, manufacturing request bursts.
- `thinktime`: turn-1 at trace arrival; turn-k at prev_finished +
  time_to_parent_chat (real production gap). scripts/add_time_to_parent.py
  annotates a trace with that gap from the raw trace's request_ready/end_ms.

exp(c) ablation (v2/exp_c_dispatch_ablation/): at N=8 (capacity slack) thinktime
beats tracets -- E2E p90 -28% (73.5 vs 102.8s), TTFT p90 -29%, TPS +7%, because
tracets' bursts spike concurrency -> KV pressure -> preemption. At N=6
(saturated) they converge. So tracets makes the system look ~30% worse on tail
latency than realistic agent pacing. Root README.md carries the headline
guidance; raw per-request metrics gitignored (perf_summary.json kept).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 16:28:36 +08:00

3.5 KiB
Raw Blame History

agentic-kv

Serving agentic LLM workloads by keeping the KV working set in GPU HBM (GPU-hit-first). Research outline: PAPER_OUTLINE.md. Evidence + experiments: v2/.


⚠️ Benchmarking methodology — read this first

Replay agentic traces with --dispatch-mode thinktime, not the default tracets. It is the faithful, more realistic load — and the dispatch mode materially changes the performance you measure.

The replayer offers two ways to time each turn:

mode turn-k dispatched at what it models
tracets (default) max(prev_turn_finished, trace_ts) absolute production schedule
thinktime (use this) prev_turn_finished + time_to_parent_chat real closed-loop agent pacing

Why it matters. tracets collapses the inter-turn think-time to ~0 whenever the system falls behind (it fires the next turn immediately because the trace timestamp is already in the past). That manufactures artificial request bursts — spiking instantaneous concurrency → KV-pool pressure → preemption → inflated tail latency and wasted throughput. thinktime keeps each turn's real gap (tool-exec + agent think), so the offered load is what a real agent produces.

Measured (w600 first-300s window, 8×H20, round-robin, 100% completion):

metric (N=8) tracets (Mode 1) thinktime (Mode 2) Δ
E2E p90 102.8 s 73.5 s 28%
E2E p99 245 s 227 s 7%
TTFT p90 56.1 s 39.7 s 29%
system TPS 111.8 119.3 +7%
wall-clock 967 s 787 s 19%
TPOT p90 0.174 s 0.188 s ~flat

So under realistic capacity, tracets makes the system look ~30% worse on tail latency than it actually is. Tell-tale: scaling 6→8 instances barely helped tracets (975→967 s — its bursts re-saturate regardless of capacity) but helped thinktime a lot (1125→787 s). Under heavy saturation (N=6) the two converge (E2E p90 ≈ 118120 s), since there is no slack for bursts to harm. Decode (TPOT) is dispatch-independent everywhere.

Recommendation: benchmark with --dispatch-mode thinktime; use tracets only as an explicit bursty stress case. Full ablation: v2/exp_c_dispatch_ablation/.

How to use it

# 1. annotate a trace with the real per-turn gap (one-time; scans the raw trace)
python scripts/add_time_to_parent.py traces/w600_r0.0015_st30.jsonl traces/w600_ttp.jsonl

# 2. replay closed-loop with faithful think-time
python -m replayer --trace traces/w600_ttp.jsonl --endpoint <eps> \
    --model <model> --dispatch-mode thinktime

time_to_parent_chat = this_turn.request_ready_time_ms parent_turn.request_end_time_ms, computed from the raw trace and stored per request; turn-1 has none (fires at its trace arrival). Traces without the field fall back to tracets.


Project map

  • PAPER_OUTLINE.md — GPU-hit-first paper outline (the thesis).
  • v2/ — evidence experiments:
    • exp_a_tier_latency/ — KV-hit cost by tier (GPU < CPU-local < remote-RDMA < miss).
    • exp_b_capacity_knee/ — realized APC / latency knee vs GPU capacity.
    • exp_c_dispatch_ablation/ — the replay-mode study above.
  • replayer/ — trace replayer (--dispatch-mode, closed-loop think-time).
  • scripts/add_time_to_parent.py — trace annotation for thinktime.
  • microbench/, analysis/ — PD-disagg, routing, workload characterization.