Files

Gahow Wang 8a6b22c11c Replayer think-time dispatch mode + benchmarking guidance

Adds `--dispatch-mode {tracets,thinktime}` to the replayer and documents that
agentic serving should be benchmarked with `thinktime` (the faithful load).

- `tracets` (old default): turn-k at the absolute trace timestamp, i.e.
  max(prev_finished, trace_ts) -- collapses inter-turn think-time to ~0 when the
  system is behind, manufacturing request bursts.
- `thinktime`: turn-1 at trace arrival; turn-k at prev_finished +
  time_to_parent_chat (real production gap). scripts/add_time_to_parent.py
  annotates a trace with that gap from the raw trace's request_ready/end_ms.

exp(c) ablation (v2/exp_c_dispatch_ablation/): at N=8 (capacity slack) thinktime
beats tracets -- E2E p90 -28% (73.5 vs 102.8s), TTFT p90 -29%, TPS +7%, because
tracets' bursts spike concurrency -> KV pressure -> preemption. At N=6
(saturated) they converge. So tracets makes the system look ~30% worse on tail
latency than realistic agent pacing. Root README.md carries the headline
guidance; raw per-request metrics gitignored (perf_summary.json kept).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-05-30 16:28:36 +08:00

results

Replayer think-time dispatch mode + benchmarking guidance

2026-05-30 16:28:36 +08:00

analyze.py

Replayer think-time dispatch mode + benchmarking guidance

2026-05-30 16:28:36 +08:00

README.md

Replayer think-time dispatch mode + benchmarking guidance

2026-05-30 16:28:36 +08:00

run_ablation.sh

Replayer think-time dispatch mode + benchmarking guidance

2026-05-30 16:28:36 +08:00

README.md

exp (c) — Replay dispatch mode: `tracets` vs `thinktime`

Which replay mode should we benchmark agentic serving with, and how much does it change the measured performance?

Two dispatch modes (replayer --dispatch-mode):

tracets (default): turn-k at the absolute trace timestamp ⇒ effectively max(prev_finished, trace_ts).
thinktime: turn-1 at trace arrival; turn-k at prev_finished + time_to_parent_chat (the REAL production gap; annotated by scripts/add_time_to_parent.py from the raw trace's request_ready_time_ms/request_end_time_ms).

Setup: w600 windowed to first 300 s (366 reqs, 223 multi-turn), round-robin across N H20 instances, both modes on the same instances, 100% completion throughout.

Performance result

metric	N6 tracets	N6 thinktime	N8 tracets	N8 thinktime
system TPS	110.9	96.1	111.8	119.3
wall (s)	975	1125	967	787
TTFT p50 / p90 / p99	4.4 / 61.8 / 135	4.5 / 83.7 / 130	2.9 / 56.1 / 115	3.1 / 39.7 / 83.5
TPOT p50 / p90 / p99	.039 / .242 / .96	.037 / .264 / .69	.037 / .174 / .89	.037 / .188 / .85
E2E p50 / p90 / p99	17.1 / 118 / 298	15.0 / 120 / 338	11.9 / 102.8 / 245	12.3 / 73.5 / 227

At N=8 (capacity slack), thinktime is clearly better: E2E p90 −28%, TTFT p90 −29%, TPS +7%, wall −19%. At N=6 (saturated) they converge (E2E p90 ≈ 118–120 s). TPOT (decode) is dispatch-independent everywhere.

Why — the mechanism (`figs/exp_c_dispatch_ablation.png`)

tracets collapses the realized inter-turn gap to ~0 under load (p50 0.00 s, 75% < 0.5 s) — it fires the next turn immediately because the trace timestamp is in the past. thinktime preserves the real gap (p50 1.22 s = the trace). The figure shows both realized-gap CDFs against the real time_to_parent_chat.

That gap-collapse manufactures bursts → peak concurrency spikes → KV-pool pressure → preemption → inflated tail latency + wasted throughput. The bursts re-saturate the system regardless of capacity, which is why scaling 6→8 instances barely helped tracets (975→967 s) but helped thinktime a lot (1125→787 s). Under saturation (N=6) there is no slack for bursts to harm, so the modes converge.

Conclusion

Benchmark agentic serving with --dispatch-mode thinktime — it is the faithful closed-loop agent load and avoids the tracets burst artifact that makes the system look ~30% worse on tail latency than it is. Use tracets only as an explicit bursty stress case. (See the repo README.md for the headline guidance.)

Caveat: round-robin pays full prefill every turn (no cache reuse), so absolute latencies here are high; a cache-aware policy (LPWL) would lower them and likely widen the thinktime advantage. The raw window is also heavy (E2E in tens of seconds); a lighter load shows a healthier operating point.

Repro

N=8 TRACE=traces/w600_ttp_win.jsonl bash v2/exp_c_dispatch_ablation/run_ablation.sh
python v2/exp_c_dispatch_ablation/analyze.py traces/w600_ttp_win.jsonl \
    v2/exp_c_dispatch_ablation/results/metrics_{tracets,thinktime}.jsonl

README.md Unescape Escape

exp (c) — Replay dispatch mode: tracets vs thinktime

Performance result

Why — the mechanism (figs/exp_c_dispatch_ablation.png)

Conclusion

Repro

README.md

exp (c) — Replay dispatch mode: `tracets` vs `thinktime`

Why — the mechanism (`figs/exp_c_dispatch_ablation.png`)