Per request: TTFT mean+p90, E2E mean+p90, decode TPS (output goodput; total/ prefill TPS omitted as cache-miss-inflated), and per-worker GPU-util boxplots (8 workers/arm, tracets vs thinktime) showing utilization level + balance. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
exp (d) — 5-policy routing under tracets vs thinktime
exp (c) showed the dispatch mode changes measured performance for a single round-robin policy, and predicted: "a cache-aware policy (LPWL) would lower the latencies and likely widen the thinktime advantage." exp (d) tests that with the full routing comparison — and finds something stronger: the dispatch mode flips which policy wins.
Question. Does the parameter-free LPWL still beat the tuned unified+A+B
baseline once we benchmark with the faithful thinktime load instead of the
tracets burst artifact?
Setup
5 routing policies, each its own fresh vLLM (cold APC) on dash0 8×H20,
Qwen3-Coder-30B-A3B, via scripts/b3_isolated_policy.sh. Both dispatch modes
run on the same trace traces/w600_r0.0015_st30_first600s_ttp.jsonl (807
reqs, 274 sessions) — the only variable is REPLAY_DISPATCH_MODE
(tracets ignores the time_to_parent_chat field, thinktime consumes it).
Analyzer: scripts/bench_report.py (summaries in results/).
leastwork— LPWL, parameter-free (pending_prefill + max(0, input−cache_hit))unified_ab— unified hybrid, tuned A+B′ (of=1.3, lmw=0.01)unified_def— unified hybrid, defaults (of=2.0, lmw=0.0)lmetric— P_tokens × BS, no affinitysticky— hard session affinity
Result (ms; figs/exp_d_policy_dispatch.png)
The figure has 6 panels — TTFT mean/p90, E2E mean/p90, decode TPS (output goodput), and the per-worker GPU-util box (8 workers/arm). Decode TPS is the honest throughput metric (total/prefill TPS is inflated by cache-miss recompute, e.g. LMetric); thinktime ≥ tracets on it everywhere (the system drains faster with real think-time). The GPU-util box shows LPWL also keeps the tightest worker balance.
| policy | mode | TTFT p90 | E2E mean | E2E p90 | E2E p99 | TPOT p90 | APC | req-bal |
|---|---|---|---|---|---|---|---|---|
| LPWL | tracets | 11099 | 9827 | 25366 | 93929 | 33 | 0.650 | 1.49× |
| LPWL | thinktime | 6713 | 6788 | 17635 | 69946 | 18 | 0.676 | 1.94× |
| unified+A+B | tracets | 10783 | 8531 | 22063 | 75419 | 21 | 0.667 | 1.54× |
| unified+A+B | thinktime | 9736 | 7131 | 18690 | 63788 | 19 | 0.676 | 2.16× |
| unified default | tracets | 12997 | 8366 | 22819 | 82257 | 20 | 0.693 | 1.56× |
| unified default | thinktime | 11268 | 7975 | 24096 | 72334 | 22 | 0.693 | 2.91× |
| LMetric | tracets | 16492 | 10775 | 27791 | 99231 | 39 | 0.495 | 2.19× |
| LMetric | thinktime | 15607 | 9902 | 27819 | 73672 | 30 | 0.483 | 2.10× |
| sticky | tracets | 15236 | 10139 | 27974 | 82362 | 31 | 0.693 | 2.06× |
| sticky | thinktime | 14838 | 8663 | 24966 | 70933 | 24 | 0.694 | 2.48× |
Finding 1 — thinktime helps every policy, but helps LPWL the most
Per-policy tracets→thinktime change (negative = thinktime better):
| policy | ΔTTFT p90 | ΔE2E mean | ΔTPOT p90 |
|---|---|---|---|
| LPWL | −40% | −31% | −45% |
| unified+A+B | −10% | −16% | −10% |
| unified default | −13% | −5% | +10% |
| LMetric | −5% | −8% | −23% |
| sticky | −3% | −15% | −23% |
tracets collapses the inter-turn think-time to ~0 (exp c), manufacturing bursts
→ peak concurrency → KV pressure → preemption. Those bursts punish exactly the
policy that spreads prefill thinly across hosts (LPWL keeps the tightest request
balance, 1.49×), because under a burst the spread sacrifices locality without the
slack to amortize it. Remove the artifact and LPWL's prefill-aware placement pays.
Finding 2 — the dispatch mode flips the cross-policy ranking
- TTFT p90:
tracets→unified_ab (10.8s) ≈ LPWL (11.1s)— LPWL only ties, even slightly behind.thinktime→ LPWL (6.7s) < unified_ab (9.7s): LPWL is first, −31% vs the tuned baseline. - E2E mean:
tracets→ unified_def (8.4s) < unified_ab (8.5s) < LPWL (9.8s) — LPWL is 3rd, behind both unified variants.thinktime→ LPWL (6.8s) < unified_ab (7.1s) < unified_def (8.0s): LPWL is first.
So under artificial tracets bursts the parameter-free policy looks tied-or-worse;
under the faithful thinktime load it is the clear winner on TTFT and E2E, at
zero knobs and best balance.
Conclusion
Benchmark agentic routing with thinktime. Under it, the parameter-free LPWL is
the best of the five policies — TTFT p90 −31%, E2E mean −5% / p90 −6%, best TPOT,
tightest balance vs the tuned unified+A+B — and the tracets burst artifact is
precisely what erases that advantage (it even drops LPWL to 3rd on E2E). This both
confirms exp (c)'s prediction and is independent evidence for the GPU-hit-first
routing story: faithful load rewards keeping the active working set GPU-resident.
Caveats
- n = 1 per arm. The
tracetsranking here does not reproduce the earlier dash1analysis/lpwl_5policy_600s.md(which saw LPWL win TTFT p90 −31% in tracets); on dash0tracetsit is a tie. i.e.tracetsrankings are run/harness-sensitive — the robust signal is thethinktimeadvantage, which appears in both environments. Repeat ×3 to bound noise. - LPWL's one persistent weak spot is E2E p99 (thinktime 69.9s vs unified_ab
63.8s) — the structural HEAVY+ >50k decode tail, identical across policies, not
routing-fixable (see
lpwl_5policy_600s.mdκ-ablation). thinktimeadvantage is a capacity-slack effect; under saturation the modes converge (exp c, N=6).
Repro
# 1. annotate the full trace with time_to_parent_chat (dash0; once)
python scripts/add_ttp_streaming.py 051315-051317.jsonl 051315-051317-ttp.jsonl \
051315-051317-raw.jsonl
# 2. resample (same seed reproduces traces/w600_r0.0015_st30.jsonl + the ttp field;
# first600s = timestamp<600 filter)
python scripts/sample_trace.py --input 051315-051317-ttp.jsonl \
--output traces/w600_r0.0015_st30_ttp.jsonl \
--window-seconds 600 --sample-ratio 0.0015 --max-single-turn-ratio 0.30 --seed 42
# 3. run both modes x 5 policies (~3.5 h, fresh vLLM/arm)
TRACE_FILE=traces/w600_r0.0015_st30_first600s_ttp.jsonl \
bash microbench/connector_tax/cache_sweep/run_5policy_both_modes.sh
# 4. report + plot
python scripts/bench_report.py --root outputs/policy5_600s_thinktime_<date> \
--json v2/exp_d_policy_dispatch/results/thinktime.json \
leastwork unified_ab unified_def lmetric sticky
python v2/exp_d_policy_dispatch/plot.py