Files
agentic-kvc/README.md
Gahow Wang 8a6b22c11c Replayer think-time dispatch mode + benchmarking guidance
Adds `--dispatch-mode {tracets,thinktime}` to the replayer and documents that
agentic serving should be benchmarked with `thinktime` (the faithful load).

- `tracets` (old default): turn-k at the absolute trace timestamp, i.e.
  max(prev_finished, trace_ts) -- collapses inter-turn think-time to ~0 when the
  system is behind, manufacturing request bursts.
- `thinktime`: turn-1 at trace arrival; turn-k at prev_finished +
  time_to_parent_chat (real production gap). scripts/add_time_to_parent.py
  annotates a trace with that gap from the raw trace's request_ready/end_ms.

exp(c) ablation (v2/exp_c_dispatch_ablation/): at N=8 (capacity slack) thinktime
beats tracets -- E2E p90 -28% (73.5 vs 102.8s), TTFT p90 -29%, TPS +7%, because
tracets' bursts spike concurrency -> KV pressure -> preemption. At N=6
(saturated) they converge. So tracets makes the system look ~30% worse on tail
latency than realistic agent pacing. Root README.md carries the headline
guidance; raw per-request metrics gitignored (perf_summary.json kept).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 16:28:36 +08:00

78 lines
3.5 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# agentic-kv
Serving agentic LLM workloads by keeping the KV working set in GPU HBM
(GPU-hit-first). Research outline: [`PAPER_OUTLINE.md`](PAPER_OUTLINE.md).
Evidence + experiments: [`v2/`](v2/).
---
## ⚠️ Benchmarking methodology — read this first
> **Replay agentic traces with `--dispatch-mode thinktime`, not the default
> `tracets`.** It is the faithful, more realistic load — and the dispatch mode
> materially changes the performance you measure.
The replayer offers two ways to time each turn:
| mode | turn-k dispatched at | what it models |
|---|---|---|
| `tracets` (default) | `max(prev_turn_finished, trace_ts)` | absolute production schedule |
| **`thinktime` (use this)** | `prev_turn_finished + time_to_parent_chat` | real closed-loop agent pacing |
**Why it matters.** `tracets` collapses the inter-turn think-time to ~0 whenever
the system falls behind (it fires the next turn immediately because the trace
timestamp is already in the past). That manufactures **artificial request
bursts** — spiking instantaneous concurrency → KV-pool pressure → preemption →
inflated tail latency and wasted throughput. `thinktime` keeps each turn's real
gap (tool-exec + agent think), so the offered load is what a real agent produces.
**Measured (w600 first-300s window, 8×H20, round-robin, 100% completion):**
| metric (N=8) | `tracets` (Mode 1) | **`thinktime` (Mode 2)** | Δ |
|---|---:|---:|---:|
| E2E p90 | 102.8 s | **73.5 s** | **28%** |
| E2E p99 | 245 s | **227 s** | 7% |
| TTFT p90 | 56.1 s | **39.7 s** | **29%** |
| system TPS | 111.8 | **119.3** | **+7%** |
| wall-clock | 967 s | **787 s** | 19% |
| TPOT p90 | 0.174 s | 0.188 s | ~flat |
So under realistic capacity, `tracets` makes the system look **~30% worse on
tail latency** than it actually is. Tell-tale: scaling 6→8 instances barely helped
`tracets` (975→967 s — its bursts re-saturate regardless of capacity) but helped
`thinktime` a lot (1125→787 s). Under heavy saturation (N=6) the two converge
(E2E p90 ≈ 118120 s), since there is no slack for bursts to harm. Decode (TPOT)
is dispatch-independent everywhere.
**Recommendation:** benchmark with `--dispatch-mode thinktime`; use `tracets`
only as an explicit bursty stress case. Full ablation:
[`v2/exp_c_dispatch_ablation/`](v2/exp_c_dispatch_ablation/).
### How to use it
```bash
# 1. annotate a trace with the real per-turn gap (one-time; scans the raw trace)
python scripts/add_time_to_parent.py traces/w600_r0.0015_st30.jsonl traces/w600_ttp.jsonl
# 2. replay closed-loop with faithful think-time
python -m replayer --trace traces/w600_ttp.jsonl --endpoint <eps> \
--model <model> --dispatch-mode thinktime
```
`time_to_parent_chat = this_turn.request_ready_time_ms parent_turn.request_end_time_ms`,
computed from the raw trace and stored per request; turn-1 has none (fires at its
trace arrival). Traces without the field fall back to `tracets`.
---
## Project map
- [`PAPER_OUTLINE.md`](PAPER_OUTLINE.md) — GPU-hit-first paper outline (the thesis).
- [`v2/`](v2/) — evidence experiments:
- `exp_a_tier_latency/` — KV-hit cost by tier (GPU < CPU-local < remote-RDMA < miss).
- `exp_b_capacity_knee/` realized APC / latency knee vs GPU capacity.
- `exp_c_dispatch_ablation/` the replay-mode study above.
- `replayer/` trace replayer (`--dispatch-mode`, closed-loop think-time).
- `scripts/add_time_to_parent.py` trace annotation for `thinktime`.
- `microbench/`, `analysis/` PD-disagg, routing, workload characterization.