Replayer think-time dispatch mode + benchmarking guidance
Adds `--dispatch-mode {tracets,thinktime}` to the replayer and documents that
agentic serving should be benchmarked with `thinktime` (the faithful load).
- `tracets` (old default): turn-k at the absolute trace timestamp, i.e.
max(prev_finished, trace_ts) -- collapses inter-turn think-time to ~0 when the
system is behind, manufacturing request bursts.
- `thinktime`: turn-1 at trace arrival; turn-k at prev_finished +
time_to_parent_chat (real production gap). scripts/add_time_to_parent.py
annotates a trace with that gap from the raw trace's request_ready/end_ms.
exp(c) ablation (v2/exp_c_dispatch_ablation/): at N=8 (capacity slack) thinktime
beats tracets -- E2E p90 -28% (73.5 vs 102.8s), TTFT p90 -29%, TPS +7%, because
tracets' bursts spike concurrency -> KV pressure -> preemption. At N=6
(saturated) they converge. So tracets makes the system look ~30% worse on tail
latency than realistic agent pacing. Root README.md carries the headline
guidance; raw per-request metrics gitignored (perf_summary.json kept).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
61
v2/exp_c_dispatch_ablation/README.md
Normal file
61
v2/exp_c_dispatch_ablation/README.md
Normal file
@@ -0,0 +1,61 @@
|
||||
# exp (c) — Replay dispatch mode: `tracets` vs `thinktime`
|
||||
|
||||
Which replay mode should we benchmark agentic serving with, and how much does it
|
||||
change the measured performance?
|
||||
|
||||
**Two dispatch modes** (`replayer --dispatch-mode`):
|
||||
- **`tracets`** (default): turn-k at the absolute trace timestamp ⇒ effectively
|
||||
`max(prev_finished, trace_ts)`.
|
||||
- **`thinktime`**: turn-1 at trace arrival; turn-k at
|
||||
`prev_finished + time_to_parent_chat` (the REAL production gap; annotated by
|
||||
`scripts/add_time_to_parent.py` from the raw trace's
|
||||
`request_ready_time_ms`/`request_end_time_ms`).
|
||||
|
||||
Setup: w600 windowed to first 300 s (366 reqs, 223 multi-turn), round-robin across
|
||||
N H20 instances, both modes on the same instances, 100% completion throughout.
|
||||
|
||||
## Performance result
|
||||
|
||||
| metric | N6 tracets | N6 thinktime | N8 tracets | **N8 thinktime** |
|
||||
|---|---:|---:|---:|---:|
|
||||
| system TPS | 110.9 | 96.1 | 111.8 | **119.3** |
|
||||
| wall (s) | 975 | 1125 | 967 | **787** |
|
||||
| TTFT p50 / p90 / p99 | 4.4 / 61.8 / 135 | 4.5 / 83.7 / 130 | 2.9 / 56.1 / 115 | 3.1 / **39.7** / **83.5** |
|
||||
| TPOT p50 / p90 / p99 | .039 / .242 / .96 | .037 / .264 / .69 | .037 / .174 / .89 | .037 / .188 / .85 |
|
||||
| E2E p50 / p90 / p99 | 17.1 / 118 / 298 | 15.0 / 120 / 338 | 11.9 / 102.8 / 245 | 12.3 / **73.5** / **227** |
|
||||
|
||||
**At N=8 (capacity slack), `thinktime` is clearly better**: E2E p90 −28%, TTFT p90
|
||||
−29%, TPS +7%, wall −19%. **At N=6 (saturated) they converge** (E2E p90 ≈ 118–120 s).
|
||||
TPOT (decode) is dispatch-independent everywhere.
|
||||
|
||||
## Why — the mechanism (`figs/exp_c_dispatch_ablation.png`)
|
||||
|
||||
`tracets` collapses the realized inter-turn gap to ~0 under load (p50 0.00 s, 75%
|
||||
< 0.5 s) — it fires the next turn immediately because the trace timestamp is in the
|
||||
past. `thinktime` preserves the real gap (p50 1.22 s = the trace). The figure shows
|
||||
both realized-gap CDFs against the real `time_to_parent_chat`.
|
||||
|
||||
That gap-collapse manufactures **bursts** → peak concurrency spikes → KV-pool
|
||||
pressure → preemption → inflated tail latency + wasted throughput. The bursts
|
||||
re-saturate the system regardless of capacity, which is why scaling 6→8 instances
|
||||
barely helped `tracets` (975→967 s) but helped `thinktime` a lot (1125→787 s).
|
||||
Under saturation (N=6) there is no slack for bursts to harm, so the modes converge.
|
||||
|
||||
## Conclusion
|
||||
|
||||
Benchmark agentic serving with **`--dispatch-mode thinktime`** — it is the faithful
|
||||
closed-loop agent load and avoids the `tracets` burst artifact that makes the system
|
||||
look ~30% worse on tail latency than it is. Use `tracets` only as an explicit bursty
|
||||
stress case. (See the repo [`README.md`](../../README.md) for the headline guidance.)
|
||||
|
||||
Caveat: round-robin pays full prefill every turn (no cache reuse), so absolute
|
||||
latencies here are high; a cache-aware policy (LPWL) would lower them and likely
|
||||
widen the `thinktime` advantage. The raw window is also heavy (E2E in tens of
|
||||
seconds); a lighter load shows a healthier operating point.
|
||||
|
||||
## Repro
|
||||
```bash
|
||||
N=8 TRACE=traces/w600_ttp_win.jsonl bash v2/exp_c_dispatch_ablation/run_ablation.sh
|
||||
python v2/exp_c_dispatch_ablation/analyze.py traces/w600_ttp_win.jsonl \
|
||||
v2/exp_c_dispatch_ablation/results/metrics_{tracets,thinktime}.jsonl
|
||||
```
|
||||
Reference in New Issue
Block a user