traces/w600_r0.0015_st30_first600s.jsonl: first-600s cut of the shipped w600 trace (807 reqs, 274 sessions, all turn-1s + early later-turns; theoretical APC ceiling ~70% vs 80% full). Faster iteration (~18 min/arm) but a colder, lower-locality regime; whitelisted alongside the parent anonymized trace. analysis/lpwl_5policy_600s.md: LPWL vs LMetric/sticky/unified/unified+A+B on the 600s trace (dash1 8xH20, cold APC, n=1). LPWL is overall best with zero knobs — TTFT p90 7983ms vs tuned A+B 11562 (-31%), E2E p90 -16%, best request balance; APC 0.648 (emergent affinity, far above LMetric 0.507); only loss is E2E p99 from heavy-class decode concentration. Demonstrates anti-overfit: A+B was tuned on full w600 yet is beaten by the knob-free policy on this regime. Includes the run_5policy_600s.sh repro driver. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
4.9 KiB
LPWL vs 4 baselines — parameter-free routing for agentic workloads
Date: 2026-05-29. Hardware: dash1, 8×H20, Qwen3-Coder-30B-A3B, TP=1,
max_model_len=200000, fresh vLLM per arm (cold APC), --policy via
scripts/b3_isolated_policy.sh. Analyzer: scripts/bench_report.py.
Motivation
unified+A+B carries too many knobs (overload_factor, lmetric_decode_weight,
the 0.5 cache_ratio gate). Goal: a policy derived from the agentic pattern
with no tuned constants, that does not overfit.
LPWL (Least-Prefill-Work-Left)
scripts/cache_aware_proxy.py:pick_instance_leastwork, --policy leastwork:
score_i = pending_prefill_tokens_i + max(0, input_len − cache_hit_i) → argmin
Tie-break: fewest num_requests, then round-robin. Zero hyperparameters.
Why this shape (straight from the workload):
- Decode is cheap (I/O ~217×) ⇒ the only load worth modeling is outstanding
prefill token-work. No decode weight; dropping LMetric's
×num_requestsalso makes an idle-but-decoding host scoreinput(its true marginal cost), not 0 — fixing the empty-batch degeneracy for free. - Cache-awareness is the affinity mechanism: a returning session's owner has
new_uncached ≈ 0, so it sticks unless its prefill backlog exceeds the cache saving (input). The stick-vs-spill crossover is computed from real token-work — nooverload_factor, nocache_ratiogate. - Session skew degrades gracefully: a heavy session inflates its owner's
pending_prefill, auto-diverting other sessions while the heavy one stays put (no cold re-prefill).
Results — 600s trace (w600_r0.0015_st30_first600s.jsonl, 807 reqs)
This is the colder regime (theoretical APC ceiling ≈ 70% vs 80% for full w600).
| policy | knobs | TTFT mean | TTFT p90 | E2E mean | E2E p90 | E2E p99 | TPOT p90 | APC | req-bal |
|---|---|---|---|---|---|---|---|---|---|
| LPWL | 0 | 3398 | 7983 | 8116 | 19014 | 87024 | 26 | 0.648 | 1.55× |
| unified+A+B | 3 | 3876 | 11562 | 8199 | 22569 | 74266 | 25 | 0.661 | 1.56× |
| unified default | 2 | 5066 | 16389 | 10481 | 28427 | 96361 | 34 | 0.689 | 2.28× |
| LMetric | 0 | 4809 | 14037 | 10051 | 26726 | 97442 | 32 | 0.507 | 2.11× |
| sticky | 0 | 5758 | 20356 | 10815 | 34734 | 82732 | 28 | 0.696 | 3.86× |
(latencies ms; req-bal = max:min per-worker request count; this batch predates the GPU-capture harness change so per-worker GPU util reads N/A.)
Findings
- LPWL is overall best with zero knobs: TTFT mean −12% / p90 −31%, E2E mean ~tie / p90 −16% vs the tuned unified+A+B; best request balance; TPOT tied-best. Only loss is E2E p99 (+17%) from heavy-class decode concentration.
- The baselines bracket the problem and explain why LPWL works: sticky has
the highest APC (0.696) but worst latency (hot-pin, 3.86× imbalance); LMetric
has the worst APC (0.507) because
×num_requestsswallows the cache signal. LPWL drops exactly that factor, so locality re-emerges (APC 0.648, beside the explicit-affinity policies) while balance stays tight — the sweet spot, no gate, no tuning. - Anti-overfit, demonstrated: unified+A+B was tuned (of=1.3, lmw=0.01) on the full w600; on the colder 600s regime the parameter-free policy beats it by 31% TTFT p90. The tuning did not transfer; LPWL did.
Per-class TTFT (ms, mean / p50 / p90 / p99) — LPWL dominates except the floor
| class | LPWL p90 | A+B p90 | LPWL p99 | A+B p99 |
|---|---|---|---|---|
| WARM<5k | 319 | 324 | 1032 | 2092 |
| MED5-20k | 1618 | 1952 | 3013 | 33189 |
| HEAVY20-50k | 4851 | 6198 | 14599 | 29044 |
| HEAVY+>50k | 28942 | 33777 | 52651 | 50778 |
LPWL's only weak class is the workload-inherent HEAVY+>50k floor (≈tied across all policies). Elsewhere it avoids the mid-class tails the unified gate creates when it pins a mid request behind a 50k-token turn on a barely-warm owner.
Full-w600 cross-check (1214 reqs, outputs/lpwl_vs_ab_live/)
On the warmer full trace, LPWL vs unified+A+B is a wash: LPWL wins TTFT p90 (−14%) but loses TPOT (+38%) and per-worker balance. Combined claim across both regimes: LPWL ∈ [tied, clearly-better] vs a tuned baseline, at zero knobs.
Caveats / open work
- n=1 per arm. The 600s −31% TTFT p90 is corroborated by mean/p50/per-class, but repeat to bound run-to-run noise (no 3× repeats yet, by request — quick single set first).
- E2E-p99 deep tail is the one consistent LPWL weak spot (heavy-session decode
concentration). Proposed knob-free fix: add
+ κ·ongoing_decodewithκ = measured(TPOT/token) / prefill_throughput(a derived hardware ratio, not a tuned scalar). Not yet implemented.
Repro
# 5-policy, 600s trace (≈18 min/arm, ~90 min total)
OUTROOT=.../outputs/policy5_600s \
bash microbench/connector_tax/cache_sweep/run_5policy_600s.sh
# unified report
.venv/bin/python scripts/bench_report.py --root .../outputs/policy5_600s \
leastwork unified_ab unified_def lmetric sticky