Files

Gahow Wang 0b180c191e v2 exp(d): expand figure to 6 panels (TTFT/E2E mean+p90, TPS, per-worker GPU util)

Per request: TTFT mean+p90, E2E mean+p90, decode TPS (output goodput; total/
prefill TPS omitted as cache-miss-inflated), and per-worker GPU-util boxplots
(8 workers/arm, tracets vs thinktime) showing utilization level + balance.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

2026-05-30 21:10:27 +08:00

results

v2 exp(d): 5-policy routing under tracets vs thinktime — ranking flip

2026-05-30 20:59:18 +08:00

plot.py

v2 exp(d): expand figure to 6 panels (TTFT/E2E mean+p90, TPS, per-worker GPU util)

2026-05-30 21:10:27 +08:00

README.md

v2 exp(d): expand figure to 6 panels (TTFT/E2E mean+p90, TPS, per-worker GPU util)

2026-05-30 21:10:27 +08:00

README.md

exp (d) — 5-policy routing under `tracets` vs `thinktime`

exp (c) showed the dispatch mode changes measured performance for a single round-robin policy, and predicted: "a cache-aware policy (LPWL) would lower the latencies and likely widen the thinktime advantage." exp (d) tests that with the full routing comparison — and finds something stronger: the dispatch mode flips which policy wins.

Question. Does the parameter-free LPWL still beat the tuned unified+A+B baseline once we benchmark with the faithful thinktime load instead of the tracets burst artifact?

Setup

5 routing policies, each its own fresh vLLM (cold APC) on dash0 8×H20, Qwen3-Coder-30B-A3B, via scripts/b3_isolated_policy.sh. Both dispatch modes run on the same trace traces/w600_r0.0015_st30_first600s_ttp.jsonl (807 reqs, 274 sessions) — the only variable is REPLAY_DISPATCH_MODE (tracets ignores the time_to_parent_chat field, thinktime consumes it). Analyzer: scripts/bench_report.py (summaries in results/).

leastwork — LPWL, parameter-free (pending_prefill + max(0, input−cache_hit))
unified_ab — unified hybrid, tuned A+B′ (of=1.3, lmw=0.01)
unified_def — unified hybrid, defaults (of=2.0, lmw=0.0)
lmetric — P_tokens × BS, no affinity
sticky — hard session affinity

Result (ms; `figs/exp_d_policy_dispatch.png`)

The figure has 6 panels — TTFT mean/p90, E2E mean/p90, decode TPS (output goodput), and the per-worker GPU-util box (8 workers/arm). Decode TPS is the honest throughput metric (total/prefill TPS is inflated by cache-miss recompute, e.g. LMetric); thinktime ≥ tracets on it everywhere (the system drains faster with real think-time). The GPU-util box shows LPWL also keeps the tightest worker balance.

policy	mode	TTFT p90	E2E mean	E2E p90	E2E p99	TPOT p90	APC	req-bal
LPWL	tracets	11099	9827	25366	93929	33	0.650	1.49×
LPWL	thinktime	6713	6788	17635	69946	18	0.676	1.94×
unified+A+B	tracets	10783	8531	22063	75419	21	0.667	1.54×
unified+A+B	thinktime	9736	7131	18690	63788	19	0.676	2.16×
unified default	tracets	12997	8366	22819	82257	20	0.693	1.56×
unified default	thinktime	11268	7975	24096	72334	22	0.693	2.91×
LMetric	tracets	16492	10775	27791	99231	39	0.495	2.19×
LMetric	thinktime	15607	9902	27819	73672	30	0.483	2.10×
sticky	tracets	15236	10139	27974	82362	31	0.693	2.06×
sticky	thinktime	14838	8663	24966	70933	24	0.694	2.48×

Finding 1 — `thinktime` helps every policy, but helps LPWL the most

Per-policy tracets→thinktime change (negative = thinktime better):

policy	ΔTTFT p90	ΔE2E mean	ΔTPOT p90
LPWL	−40%	−31%	−45%
unified+A+B	−10%	−16%	−10%
unified default	−13%	−5%	+10%
LMetric	−5%	−8%	−23%
sticky	−3%	−15%	−23%

tracets collapses the inter-turn think-time to ~0 (exp c), manufacturing bursts → peak concurrency → KV pressure → preemption. Those bursts punish exactly the policy that spreads prefill thinly across hosts (LPWL keeps the tightest request balance, 1.49×), because under a burst the spread sacrifices locality without the slack to amortize it. Remove the artifact and LPWL's prefill-aware placement pays.

Finding 2 — the dispatch mode flips the cross-policy ranking

TTFT p90: tracets → unified_ab (10.8s) ≈ LPWL (11.1s) — LPWL only ties, even slightly behind. thinktime → LPWL (6.7s) < unified_ab (9.7s): LPWL is first, −31% vs the tuned baseline.
E2E mean: tracets → unified_def (8.4s) < unified_ab (8.5s) < LPWL (9.8s) — LPWL is 3rd, behind both unified variants. thinktime → LPWL (6.8s) < unified_ab (7.1s) < unified_def (8.0s): LPWL is first.

So under artificial tracets bursts the parameter-free policy looks tied-or-worse; under the faithful thinktime load it is the clear winner on TTFT and E2E, at zero knobs and best balance.

Conclusion

Benchmark agentic routing with thinktime. Under it, the parameter-free LPWL is the best of the five policies — TTFT p90 −31%, E2E mean −5% / p90 −6%, best TPOT, tightest balance vs the tuned unified+A+B — and the tracets burst artifact is precisely what erases that advantage (it even drops LPWL to 3rd on E2E). This both confirms exp (c)'s prediction and is independent evidence for the GPU-hit-first routing story: faithful load rewards keeping the active working set GPU-resident.

Caveats

n = 1 per arm. The tracets ranking here does not reproduce the earlier dash1 analysis/lpwl_5policy_600s.md (which saw LPWL win TTFT p90 −31% in tracets); on dash0 tracets it is a tie. i.e. tracets rankings are run/harness-sensitive — the robust signal is the thinktime advantage, which appears in both environments. Repeat ×3 to bound noise.
LPWL's one persistent weak spot is E2E p99 (thinktime 69.9s vs unified_ab 63.8s) — the structural HEAVY+ >50k decode tail, identical across policies, not routing-fixable (see lpwl_5policy_600s.md κ-ablation).
thinktime advantage is a capacity-slack effect; under saturation the modes converge (exp c, N=6).

Repro

# 1. annotate the full trace with time_to_parent_chat (dash0; once)
python scripts/add_ttp_streaming.py 051315-051317.jsonl 051315-051317-ttp.jsonl \
    051315-051317-raw.jsonl
# 2. resample (same seed reproduces traces/w600_r0.0015_st30.jsonl + the ttp field;
#    first600s = timestamp<600 filter)
python scripts/sample_trace.py --input 051315-051317-ttp.jsonl \
    --output traces/w600_r0.0015_st30_ttp.jsonl \
    --window-seconds 600 --sample-ratio 0.0015 --max-single-turn-ratio 0.30 --seed 42
# 3. run both modes x 5 policies (~3.5 h, fresh vLLM/arm)
TRACE_FILE=traces/w600_r0.0015_st30_first600s_ttp.jsonl \
    bash microbench/connector_tax/cache_sweep/run_5policy_both_modes.sh
# 4. report + plot
python scripts/bench_report.py --root outputs/policy5_600s_thinktime_<date> \
    --json v2/exp_d_policy_dispatch/results/thinktime.json \
    leastwork unified_ab unified_def lmetric sticky
python v2/exp_d_policy_dispatch/plot.py

README.md Unescape Escape

exp (d) — 5-policy routing under tracets vs thinktime

Setup

Result (ms; figs/exp_d_policy_dispatch.png)

Finding 1 — thinktime helps every policy, but helps LPWL the most

Finding 2 — the dispatch mode flips the cross-policy ranking

Conclusion

Caveats

Repro

README.md

exp (d) — 5-policy routing under `tracets` vs `thinktime`

Result (ms; `figs/exp_d_policy_dispatch.png`)

Finding 1 — `thinktime` helps every policy, but helps LPWL the most