Per request: TTFT mean+p90, E2E mean+p90, decode TPS (output goodput; total/ prefill TPS omitted as cache-miss-inflated), and per-worker GPU-util boxplots (8 workers/arm, tracets vs thinktime) showing utilization level + balance. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
122 lines
6.4 KiB
Markdown
122 lines
6.4 KiB
Markdown
# exp (d) — 5-policy routing under `tracets` vs `thinktime`
|
||
|
||
exp (c) showed the **dispatch mode** changes measured performance for a single
|
||
round-robin policy, and predicted: *"a cache-aware policy (LPWL) would lower the
|
||
latencies and likely **widen** the thinktime advantage."* exp (d) tests that with
|
||
the full routing comparison — and finds something stronger: **the dispatch mode
|
||
flips which policy wins.**
|
||
|
||
**Question.** Does the parameter-free LPWL still beat the tuned `unified+A+B`
|
||
baseline once we benchmark with the *faithful* `thinktime` load instead of the
|
||
`tracets` burst artifact?
|
||
|
||
## Setup
|
||
|
||
5 routing policies, each its own **fresh vLLM (cold APC)** on dash0 8×H20,
|
||
Qwen3-Coder-30B-A3B, via `scripts/b3_isolated_policy.sh`. **Both dispatch modes
|
||
run on the *same* trace** `traces/w600_r0.0015_st30_first600s_ttp.jsonl` (807
|
||
reqs, 274 sessions) — the only variable is `REPLAY_DISPATCH_MODE`
|
||
(`tracets` ignores the `time_to_parent_chat` field, `thinktime` consumes it).
|
||
Analyzer: `scripts/bench_report.py` (summaries in `results/`).
|
||
|
||
- `leastwork` — **LPWL**, parameter-free (`pending_prefill + max(0, input−cache_hit)`)
|
||
- `unified_ab` — unified hybrid, tuned A+B′ (`of=1.3, lmw=0.01`)
|
||
- `unified_def` — unified hybrid, defaults (`of=2.0, lmw=0.0`)
|
||
- `lmetric` — P_tokens × BS, no affinity
|
||
- `sticky` — hard session affinity
|
||
|
||
## Result (ms; `figs/exp_d_policy_dispatch.png`)
|
||
|
||
The figure has 6 panels — TTFT mean/p90, E2E mean/p90, decode TPS (output
|
||
goodput), and the per-worker GPU-util box (8 workers/arm). Decode TPS is the
|
||
honest throughput metric (total/prefill TPS is inflated by cache-miss recompute,
|
||
e.g. LMetric); thinktime ≥ tracets on it everywhere (the system drains faster
|
||
with real think-time). The GPU-util box shows LPWL also keeps the tightest
|
||
worker balance.
|
||
|
||
| policy | mode | TTFT p90 | E2E mean | E2E p90 | E2E p99 | TPOT p90 | APC | req-bal |
|
||
|---|---|---:|---:|---:|---:|---:|---:|---:|
|
||
| **LPWL** | tracets | 11099 | 9827 | 25366 | 93929 | 33 | 0.650 | **1.49×** |
|
||
| **LPWL** | **thinktime** | **6713** | **6788** | **17635** | 69946 | **18** | 0.676 | 1.94× |
|
||
| unified+A+B | tracets | 10783 | 8531 | 22063 | 75419 | 21 | 0.667 | 1.54× |
|
||
| unified+A+B | thinktime | 9736 | 7131 | 18690 | **63788** | 19 | 0.676 | 2.16× |
|
||
| unified default | tracets | 12997 | 8366 | 22819 | 82257 | 20 | 0.693 | 1.56× |
|
||
| unified default | thinktime | 11268 | 7975 | 24096 | 72334 | 22 | 0.693 | 2.91× |
|
||
| LMetric | tracets | 16492 | 10775 | 27791 | 99231 | 39 | 0.495 | 2.19× |
|
||
| LMetric | thinktime | 15607 | 9902 | 27819 | 73672 | 30 | 0.483 | 2.10× |
|
||
| sticky | tracets | 15236 | 10139 | 27974 | 82362 | 31 | 0.693 | 2.06× |
|
||
| sticky | thinktime | 14838 | 8663 | 24966 | 70933 | 24 | 0.694 | 2.48× |
|
||
|
||
### Finding 1 — `thinktime` helps every policy, but helps **LPWL the most**
|
||
|
||
Per-policy `tracets`→`thinktime` change (negative = thinktime better):
|
||
|
||
| policy | ΔTTFT p90 | ΔE2E mean | ΔTPOT p90 |
|
||
|---|---:|---:|---:|
|
||
| **LPWL** | **−40%** | **−31%** | **−45%** |
|
||
| unified+A+B | −10% | −16% | −10% |
|
||
| unified default | −13% | −5% | +10% |
|
||
| LMetric | −5% | −8% | −23% |
|
||
| sticky | −3% | −15% | −23% |
|
||
|
||
`tracets` collapses the inter-turn think-time to ~0 (exp c), manufacturing bursts
|
||
→ peak concurrency → KV pressure → preemption. Those bursts punish exactly the
|
||
policy that spreads prefill thinly across hosts (LPWL keeps the tightest request
|
||
balance, 1.49×), because under a burst the spread sacrifices locality without the
|
||
slack to amortize it. Remove the artifact and LPWL's prefill-aware placement pays.
|
||
|
||
### Finding 2 — the dispatch mode **flips the cross-policy ranking**
|
||
|
||
- **TTFT p90:** `tracets` → `unified_ab (10.8s) ≈ LPWL (11.1s)` — LPWL only *ties*,
|
||
even slightly behind. `thinktime` → **LPWL (6.7s)** < unified_ab (9.7s): LPWL is
|
||
first, **−31%** vs the tuned baseline.
|
||
- **E2E mean:** `tracets` → unified_def (8.4s) < unified_ab (8.5s) < **LPWL (9.8s)**
|
||
— LPWL is *3rd, behind both unified variants*. `thinktime` → **LPWL (6.8s)** <
|
||
unified_ab (7.1s) < unified_def (8.0s): LPWL is **first**.
|
||
|
||
So under artificial `tracets` bursts the parameter-free policy looks tied-or-worse;
|
||
under the faithful `thinktime` load it is the clear winner on TTFT and E2E, at
|
||
zero knobs and best balance.
|
||
|
||
## Conclusion
|
||
|
||
**Benchmark agentic routing with `thinktime`. Under it, the parameter-free LPWL is
|
||
the best of the five policies** — TTFT p90 −31%, E2E mean −5% / p90 −6%, best TPOT,
|
||
tightest balance vs the *tuned* `unified+A+B` — and the `tracets` burst artifact is
|
||
precisely what erases that advantage (it even drops LPWL to 3rd on E2E). This both
|
||
confirms exp (c)'s prediction and is independent evidence for the GPU-hit-first
|
||
routing story: faithful load rewards keeping the active working set GPU-resident.
|
||
|
||
## Caveats
|
||
|
||
- **n = 1 per arm.** The `tracets` ranking here does **not** reproduce the earlier
|
||
dash1 `analysis/lpwl_5policy_600s.md` (which saw LPWL win TTFT p90 −31% *in
|
||
tracets*); on dash0 `tracets` it is a tie. i.e. **`tracets` rankings are
|
||
run/harness-sensitive** — the robust signal is the `thinktime` advantage, which
|
||
appears in *both* environments. Repeat ×3 to bound noise.
|
||
- LPWL's one persistent weak spot is **E2E p99** (thinktime 69.9s vs unified_ab
|
||
63.8s) — the structural HEAVY+ >50k decode tail, identical across policies, not
|
||
routing-fixable (see `lpwl_5policy_600s.md` κ-ablation).
|
||
- `thinktime` advantage is a capacity-slack effect; under saturation the modes
|
||
converge (exp c, N=6).
|
||
|
||
## Repro
|
||
```bash
|
||
# 1. annotate the full trace with time_to_parent_chat (dash0; once)
|
||
python scripts/add_ttp_streaming.py 051315-051317.jsonl 051315-051317-ttp.jsonl \
|
||
051315-051317-raw.jsonl
|
||
# 2. resample (same seed reproduces traces/w600_r0.0015_st30.jsonl + the ttp field;
|
||
# first600s = timestamp<600 filter)
|
||
python scripts/sample_trace.py --input 051315-051317-ttp.jsonl \
|
||
--output traces/w600_r0.0015_st30_ttp.jsonl \
|
||
--window-seconds 600 --sample-ratio 0.0015 --max-single-turn-ratio 0.30 --seed 42
|
||
# 3. run both modes x 5 policies (~3.5 h, fresh vLLM/arm)
|
||
TRACE_FILE=traces/w600_r0.0015_st30_first600s_ttp.jsonl \
|
||
bash microbench/connector_tax/cache_sweep/run_5policy_both_modes.sh
|
||
# 4. report + plot
|
||
python scripts/bench_report.py --root outputs/policy5_600s_thinktime_<date> \
|
||
--json v2/exp_d_policy_dispatch/results/thinktime.json \
|
||
leastwork unified_ab unified_def lmetric sticky
|
||
python v2/exp_d_policy_dispatch/plot.py
|
||
```
|