Files
agentic-kvc/v2/exp_d_policy_dispatch/README.md
Gahow Wang 0b180c191e v2 exp(d): expand figure to 6 panels (TTFT/E2E mean+p90, TPS, per-worker GPU util)
Per request: TTFT mean+p90, E2E mean+p90, decode TPS (output goodput; total/
prefill TPS omitted as cache-miss-inflated), and per-worker GPU-util boxplots
(8 workers/arm, tracets vs thinktime) showing utilization level + balance.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 21:10:27 +08:00

122 lines
6.4 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# exp (d) — 5-policy routing under `tracets` vs `thinktime`
exp (c) showed the **dispatch mode** changes measured performance for a single
round-robin policy, and predicted: *"a cache-aware policy (LPWL) would lower the
latencies and likely **widen** the thinktime advantage."* exp (d) tests that with
the full routing comparison — and finds something stronger: **the dispatch mode
flips which policy wins.**
**Question.** Does the parameter-free LPWL still beat the tuned `unified+A+B`
baseline once we benchmark with the *faithful* `thinktime` load instead of the
`tracets` burst artifact?
## Setup
5 routing policies, each its own **fresh vLLM (cold APC)** on dash0 8×H20,
Qwen3-Coder-30B-A3B, via `scripts/b3_isolated_policy.sh`. **Both dispatch modes
run on the *same* trace** `traces/w600_r0.0015_st30_first600s_ttp.jsonl` (807
reqs, 274 sessions) — the only variable is `REPLAY_DISPATCH_MODE`
(`tracets` ignores the `time_to_parent_chat` field, `thinktime` consumes it).
Analyzer: `scripts/bench_report.py` (summaries in `results/`).
- `leastwork`**LPWL**, parameter-free (`pending_prefill + max(0, inputcache_hit)`)
- `unified_ab` — unified hybrid, tuned A+B (`of=1.3, lmw=0.01`)
- `unified_def` — unified hybrid, defaults (`of=2.0, lmw=0.0`)
- `lmetric` — P_tokens × BS, no affinity
- `sticky` — hard session affinity
## Result (ms; `figs/exp_d_policy_dispatch.png`)
The figure has 6 panels — TTFT mean/p90, E2E mean/p90, decode TPS (output
goodput), and the per-worker GPU-util box (8 workers/arm). Decode TPS is the
honest throughput metric (total/prefill TPS is inflated by cache-miss recompute,
e.g. LMetric); thinktime ≥ tracets on it everywhere (the system drains faster
with real think-time). The GPU-util box shows LPWL also keeps the tightest
worker balance.
| policy | mode | TTFT p90 | E2E mean | E2E p90 | E2E p99 | TPOT p90 | APC | req-bal |
|---|---|---:|---:|---:|---:|---:|---:|---:|
| **LPWL** | tracets | 11099 | 9827 | 25366 | 93929 | 33 | 0.650 | **1.49×** |
| **LPWL** | **thinktime** | **6713** | **6788** | **17635** | 69946 | **18** | 0.676 | 1.94× |
| unified+A+B | tracets | 10783 | 8531 | 22063 | 75419 | 21 | 0.667 | 1.54× |
| unified+A+B | thinktime | 9736 | 7131 | 18690 | **63788** | 19 | 0.676 | 2.16× |
| unified default | tracets | 12997 | 8366 | 22819 | 82257 | 20 | 0.693 | 1.56× |
| unified default | thinktime | 11268 | 7975 | 24096 | 72334 | 22 | 0.693 | 2.91× |
| LMetric | tracets | 16492 | 10775 | 27791 | 99231 | 39 | 0.495 | 2.19× |
| LMetric | thinktime | 15607 | 9902 | 27819 | 73672 | 30 | 0.483 | 2.10× |
| sticky | tracets | 15236 | 10139 | 27974 | 82362 | 31 | 0.693 | 2.06× |
| sticky | thinktime | 14838 | 8663 | 24966 | 70933 | 24 | 0.694 | 2.48× |
### Finding 1 — `thinktime` helps every policy, but helps **LPWL the most**
Per-policy `tracets``thinktime` change (negative = thinktime better):
| policy | ΔTTFT p90 | ΔE2E mean | ΔTPOT p90 |
|---|---:|---:|---:|
| **LPWL** | **40%** | **31%** | **45%** |
| unified+A+B | 10% | 16% | 10% |
| unified default | 13% | 5% | +10% |
| LMetric | 5% | 8% | 23% |
| sticky | 3% | 15% | 23% |
`tracets` collapses the inter-turn think-time to ~0 (exp c), manufacturing bursts
→ peak concurrency → KV pressure → preemption. Those bursts punish exactly the
policy that spreads prefill thinly across hosts (LPWL keeps the tightest request
balance, 1.49×), because under a burst the spread sacrifices locality without the
slack to amortize it. Remove the artifact and LPWL's prefill-aware placement pays.
### Finding 2 — the dispatch mode **flips the cross-policy ranking**
- **TTFT p90:** `tracets``unified_ab (10.8s) ≈ LPWL (11.1s)` — LPWL only *ties*,
even slightly behind. `thinktime`**LPWL (6.7s)** < unified_ab (9.7s): LPWL is
first, **31%** vs the tuned baseline.
- **E2E mean:** `tracets` unified_def (8.4s) < unified_ab (8.5s) < **LPWL (9.8s)**
LPWL is *3rd, behind both unified variants*. `thinktime` **LPWL (6.8s)** <
unified_ab (7.1s) < unified_def (8.0s): LPWL is **first**.
So under artificial `tracets` bursts the parameter-free policy looks tied-or-worse;
under the faithful `thinktime` load it is the clear winner on TTFT and E2E, at
zero knobs and best balance.
## Conclusion
**Benchmark agentic routing with `thinktime`. Under it, the parameter-free LPWL is
the best of the five policies** TTFT p90 31%, E2E mean 5% / p90 6%, best TPOT,
tightest balance vs the *tuned* `unified+A+B` and the `tracets` burst artifact is
precisely what erases that advantage (it even drops LPWL to 3rd on E2E). This both
confirms exp (c)'s prediction and is independent evidence for the GPU-hit-first
routing story: faithful load rewards keeping the active working set GPU-resident.
## Caveats
- **n = 1 per arm.** The `tracets` ranking here does **not** reproduce the earlier
dash1 `analysis/lpwl_5policy_600s.md` (which saw LPWL win TTFT p90 31% *in
tracets*); on dash0 `tracets` it is a tie. i.e. **`tracets` rankings are
run/harness-sensitive** the robust signal is the `thinktime` advantage, which
appears in *both* environments. Repeat ×3 to bound noise.
- LPWL's one persistent weak spot is **E2E p99** (thinktime 69.9s vs unified_ab
63.8s) the structural HEAVY+ >50k decode tail, identical across policies, not
routing-fixable (see `lpwl_5policy_600s.md` κ-ablation).
- `thinktime` advantage is a capacity-slack effect; under saturation the modes
converge (exp c, N=6).
## Repro
```bash
# 1. annotate the full trace with time_to_parent_chat (dash0; once)
python scripts/add_ttp_streaming.py 051315-051317.jsonl 051315-051317-ttp.jsonl \
051315-051317-raw.jsonl
# 2. resample (same seed reproduces traces/w600_r0.0015_st30.jsonl + the ttp field;
# first600s = timestamp<600 filter)
python scripts/sample_trace.py --input 051315-051317-ttp.jsonl \
--output traces/w600_r0.0015_st30_ttp.jsonl \
--window-seconds 600 --sample-ratio 0.0015 --max-single-turn-ratio 0.30 --seed 42
# 3. run both modes x 5 policies (~3.5 h, fresh vLLM/arm)
TRACE_FILE=traces/w600_r0.0015_st30_first600s_ttp.jsonl \
bash microbench/connector_tax/cache_sweep/run_5policy_both_modes.sh
# 4. report + plot
python scripts/bench_report.py --root outputs/policy5_600s_thinktime_<date> \
--json v2/exp_d_policy_dispatch/results/thinktime.json \
leastwork unified_ab unified_def lmetric sticky
python v2/exp_d_policy_dispatch/plot.py
```