Files
agentic-kvc/v2/exp_c_dispatch_ablation/README.md
Gahow Wang 8a6b22c11c Replayer think-time dispatch mode + benchmarking guidance
Adds `--dispatch-mode {tracets,thinktime}` to the replayer and documents that
agentic serving should be benchmarked with `thinktime` (the faithful load).

- `tracets` (old default): turn-k at the absolute trace timestamp, i.e.
  max(prev_finished, trace_ts) -- collapses inter-turn think-time to ~0 when the
  system is behind, manufacturing request bursts.
- `thinktime`: turn-1 at trace arrival; turn-k at prev_finished +
  time_to_parent_chat (real production gap). scripts/add_time_to_parent.py
  annotates a trace with that gap from the raw trace's request_ready/end_ms.

exp(c) ablation (v2/exp_c_dispatch_ablation/): at N=8 (capacity slack) thinktime
beats tracets -- E2E p90 -28% (73.5 vs 102.8s), TTFT p90 -29%, TPS +7%, because
tracets' bursts spike concurrency -> KV pressure -> preemption. At N=6
(saturated) they converge. So tracets makes the system look ~30% worse on tail
latency than realistic agent pacing. Root README.md carries the headline
guidance; raw per-request metrics gitignored (perf_summary.json kept).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 16:28:36 +08:00

62 lines
3.1 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# exp (c) — Replay dispatch mode: `tracets` vs `thinktime`
Which replay mode should we benchmark agentic serving with, and how much does it
change the measured performance?
**Two dispatch modes** (`replayer --dispatch-mode`):
- **`tracets`** (default): turn-k at the absolute trace timestamp ⇒ effectively
`max(prev_finished, trace_ts)`.
- **`thinktime`**: turn-1 at trace arrival; turn-k at
`prev_finished + time_to_parent_chat` (the REAL production gap; annotated by
`scripts/add_time_to_parent.py` from the raw trace's
`request_ready_time_ms`/`request_end_time_ms`).
Setup: w600 windowed to first 300 s (366 reqs, 223 multi-turn), round-robin across
N H20 instances, both modes on the same instances, 100% completion throughout.
## Performance result
| metric | N6 tracets | N6 thinktime | N8 tracets | **N8 thinktime** |
|---|---:|---:|---:|---:|
| system TPS | 110.9 | 96.1 | 111.8 | **119.3** |
| wall (s) | 975 | 1125 | 967 | **787** |
| TTFT p50 / p90 / p99 | 4.4 / 61.8 / 135 | 4.5 / 83.7 / 130 | 2.9 / 56.1 / 115 | 3.1 / **39.7** / **83.5** |
| TPOT p50 / p90 / p99 | .039 / .242 / .96 | .037 / .264 / .69 | .037 / .174 / .89 | .037 / .188 / .85 |
| E2E p50 / p90 / p99 | 17.1 / 118 / 298 | 15.0 / 120 / 338 | 11.9 / 102.8 / 245 | 12.3 / **73.5** / **227** |
**At N=8 (capacity slack), `thinktime` is clearly better**: E2E p90 28%, TTFT p90
29%, TPS +7%, wall 19%. **At N=6 (saturated) they converge** (E2E p90 ≈ 118120 s).
TPOT (decode) is dispatch-independent everywhere.
## Why — the mechanism (`figs/exp_c_dispatch_ablation.png`)
`tracets` collapses the realized inter-turn gap to ~0 under load (p50 0.00 s, 75%
< 0.5 s) it fires the next turn immediately because the trace timestamp is in the
past. `thinktime` preserves the real gap (p50 1.22 s = the trace). The figure shows
both realized-gap CDFs against the real `time_to_parent_chat`.
That gap-collapse manufactures **bursts** peak concurrency spikes KV-pool
pressure preemption inflated tail latency + wasted throughput. The bursts
re-saturate the system regardless of capacity, which is why scaling 68 instances
barely helped `tracets` (975967 s) but helped `thinktime` a lot (1125787 s).
Under saturation (N=6) there is no slack for bursts to harm, so the modes converge.
## Conclusion
Benchmark agentic serving with **`--dispatch-mode thinktime`** it is the faithful
closed-loop agent load and avoids the `tracets` burst artifact that makes the system
look ~30% worse on tail latency than it is. Use `tracets` only as an explicit bursty
stress case. (See the repo [`README.md`](../../README.md) for the headline guidance.)
Caveat: round-robin pays full prefill every turn (no cache reuse), so absolute
latencies here are high; a cache-aware policy (LPWL) would lower them and likely
widen the `thinktime` advantage. The raw window is also heavy (E2E in tens of
seconds); a lighter load shows a healthier operating point.
## Repro
```bash
N=8 TRACE=traces/w600_ttp_win.jsonl bash v2/exp_c_dispatch_ablation/run_ablation.sh
python v2/exp_c_dispatch_ablation/analyze.py traces/w600_ttp_win.jsonl \
v2/exp_c_dispatch_ablation/results/metrics_{tracets,thinktime}.jsonl
```