Replayer think-time dispatch mode + benchmarking guidance
Adds `--dispatch-mode {tracets,thinktime}` to the replayer and documents that
agentic serving should be benchmarked with `thinktime` (the faithful load).
- `tracets` (old default): turn-k at the absolute trace timestamp, i.e.
max(prev_finished, trace_ts) -- collapses inter-turn think-time to ~0 when the
system is behind, manufacturing request bursts.
- `thinktime`: turn-1 at trace arrival; turn-k at prev_finished +
time_to_parent_chat (real production gap). scripts/add_time_to_parent.py
annotates a trace with that gap from the raw trace's request_ready/end_ms.
exp(c) ablation (v2/exp_c_dispatch_ablation/): at N=8 (capacity slack) thinktime
beats tracets -- E2E p90 -28% (73.5 vs 102.8s), TTFT p90 -29%, TPS +7%, because
tracets' bursts spike concurrency -> KV pressure -> preemption. At N=6
(saturated) they converge. So tracets makes the system look ~30% worse on tail
latency than realistic agent pacing. Root README.md carries the headline
guidance; raw per-request metrics gitignored (perf_summary.json kept).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This commit is contained in:
77
README.md
Normal file
77
README.md
Normal file
@@ -0,0 +1,77 @@
|
||||
# agentic-kv
|
||||
|
||||
Serving agentic LLM workloads by keeping the KV working set in GPU HBM
|
||||
(GPU-hit-first). Research outline: [`PAPER_OUTLINE.md`](PAPER_OUTLINE.md).
|
||||
Evidence + experiments: [`v2/`](v2/).
|
||||
|
||||
---
|
||||
|
||||
## ⚠️ Benchmarking methodology — read this first
|
||||
|
||||
> **Replay agentic traces with `--dispatch-mode thinktime`, not the default
|
||||
> `tracets`.** It is the faithful, more realistic load — and the dispatch mode
|
||||
> materially changes the performance you measure.
|
||||
|
||||
The replayer offers two ways to time each turn:
|
||||
|
||||
| mode | turn-k dispatched at | what it models |
|
||||
|---|---|---|
|
||||
| `tracets` (default) | `max(prev_turn_finished, trace_ts)` | absolute production schedule |
|
||||
| **`thinktime` (use this)** | `prev_turn_finished + time_to_parent_chat` | real closed-loop agent pacing |
|
||||
|
||||
**Why it matters.** `tracets` collapses the inter-turn think-time to ~0 whenever
|
||||
the system falls behind (it fires the next turn immediately because the trace
|
||||
timestamp is already in the past). That manufactures **artificial request
|
||||
bursts** — spiking instantaneous concurrency → KV-pool pressure → preemption →
|
||||
inflated tail latency and wasted throughput. `thinktime` keeps each turn's real
|
||||
gap (tool-exec + agent think), so the offered load is what a real agent produces.
|
||||
|
||||
**Measured (w600 first-300s window, 8×H20, round-robin, 100% completion):**
|
||||
|
||||
| metric (N=8) | `tracets` (Mode 1) | **`thinktime` (Mode 2)** | Δ |
|
||||
|---|---:|---:|---:|
|
||||
| E2E p90 | 102.8 s | **73.5 s** | **−28%** |
|
||||
| E2E p99 | 245 s | **227 s** | −7% |
|
||||
| TTFT p90 | 56.1 s | **39.7 s** | **−29%** |
|
||||
| system TPS | 111.8 | **119.3** | **+7%** |
|
||||
| wall-clock | 967 s | **787 s** | −19% |
|
||||
| TPOT p90 | 0.174 s | 0.188 s | ~flat |
|
||||
|
||||
So under realistic capacity, `tracets` makes the system look **~30% worse on
|
||||
tail latency** than it actually is. Tell-tale: scaling 6→8 instances barely helped
|
||||
`tracets` (975→967 s — its bursts re-saturate regardless of capacity) but helped
|
||||
`thinktime` a lot (1125→787 s). Under heavy saturation (N=6) the two converge
|
||||
(E2E p90 ≈ 118–120 s), since there is no slack for bursts to harm. Decode (TPOT)
|
||||
is dispatch-independent everywhere.
|
||||
|
||||
**Recommendation:** benchmark with `--dispatch-mode thinktime`; use `tracets`
|
||||
only as an explicit bursty stress case. Full ablation:
|
||||
[`v2/exp_c_dispatch_ablation/`](v2/exp_c_dispatch_ablation/).
|
||||
|
||||
### How to use it
|
||||
|
||||
```bash
|
||||
# 1. annotate a trace with the real per-turn gap (one-time; scans the raw trace)
|
||||
python scripts/add_time_to_parent.py traces/w600_r0.0015_st30.jsonl traces/w600_ttp.jsonl
|
||||
|
||||
# 2. replay closed-loop with faithful think-time
|
||||
python -m replayer --trace traces/w600_ttp.jsonl --endpoint <eps> \
|
||||
--model <model> --dispatch-mode thinktime
|
||||
```
|
||||
|
||||
`time_to_parent_chat = this_turn.request_ready_time_ms − parent_turn.request_end_time_ms`,
|
||||
computed from the raw trace and stored per request; turn-1 has none (fires at its
|
||||
trace arrival). Traces without the field fall back to `tracets`.
|
||||
|
||||
---
|
||||
|
||||
## Project map
|
||||
|
||||
- [`PAPER_OUTLINE.md`](PAPER_OUTLINE.md) — GPU-hit-first paper outline (the thesis).
|
||||
- [`v2/`](v2/) — evidence experiments:
|
||||
- `exp_a_tier_latency/` — KV-hit cost by tier (GPU < CPU-local < remote-RDMA < miss).
|
||||
- `exp_b_capacity_knee/` — realized APC / latency knee vs GPU capacity.
|
||||
- `exp_c_dispatch_ablation/` — the replay-mode study above.
|
||||
- `replayer/` — trace replayer (`--dispatch-mode`, closed-loop think-time).
|
||||
- `scripts/add_time_to_parent.py` — trace annotation for `thinktime`.
|
||||
- `microbench/`, `analysis/` — PD-disagg, routing, workload characterization.
|
||||
Reference in New Issue
Block a user