# LPWL vs 4 baselines — parameter-free routing for agentic workloads

Date: 2026-05-29. Hardware: dash1, 8×H20, Qwen3-Coder-30B-A3B, TP=1,
max_model_len=200000, fresh vLLM per arm (cold APC), `--policy` via
`scripts/b3_isolated_policy.sh`. Analyzer: `scripts/bench_report.py`.

## Motivation

unified+A+B carries too many knobs (`overload_factor`, `lmetric_decode_weight`,
the 0.5 `cache_ratio` gate). Goal: a policy derived from the agentic *pattern*
with no tuned constants, that does not overfit.

## LPWL (Least-Prefill-Work-Left)

`scripts/cache_aware_proxy.py:pick_instance_leastwork`, `--policy leastwork`:

```
score_i = pending_prefill_tokens_i + max(0, input_len − cache_hit_i)   → argmin
```

Tie-break: fewest `num_requests`, then round-robin. **Zero hyperparameters.**

Why this shape (straight from the workload):
- Decode is cheap (I/O ~217×) ⇒ the only load worth modeling is outstanding
  *prefill* token-work. No decode weight; dropping LMetric's `×num_requests`
  also makes an idle-but-decoding host score `input` (its true marginal cost),
  not 0 — fixing the empty-batch degeneracy for free.
- Cache-awareness *is* the affinity mechanism: a returning session's owner has
  `new_uncached ≈ 0`, so it sticks unless its prefill backlog exceeds the cache
  saving (`input`). The stick-vs-spill crossover is computed from real
  token-work — no `overload_factor`, no `cache_ratio` gate.
- Session skew degrades gracefully: a heavy session inflates its owner's
  `pending_prefill`, auto-diverting *other* sessions while the heavy one stays
  put (no cold re-prefill).

## Results — 600s trace (`w600_r0.0015_st30_first600s.jsonl`, 807 reqs)

This is the colder regime (theoretical APC ceiling ≈ 70% vs 80% for full w600).

| policy | knobs | TTFT mean | TTFT p90 | E2E mean | E2E p90 | E2E p99 | TPOT p90 | APC | req-bal |
|---|---:|---:|---:|---:|---:|---:|---:|---:|---:|
| **LPWL** | **0** | **3398** | **7983** | **8116** | **19014** | 87024 | 26 | 0.648 | **1.55×** |
| unified+A+B | 3 | 3876 | 11562 | 8199 | 22569 | **74266** | **25** | 0.661 | 1.56× |
| unified default | 2 | 5066 | 16389 | 10481 | 28427 | 96361 | 34 | 0.689 | 2.28× |
| LMetric | 0 | 4809 | 14037 | 10051 | 26726 | 97442 | 32 | 0.507 | 2.11× |
| sticky | 0 | 5758 | 20356 | 10815 | 34734 | 82732 | 28 | **0.696** | 3.86× |

(latencies ms; req-bal = max:min per-worker request count; this batch predates
the GPU-capture harness change so per-worker GPU util reads N/A.)

### Findings

1. **LPWL is overall best with zero knobs:** TTFT mean −12% / p90 −31%, E2E
   mean ~tie / p90 −16% vs the tuned unified+A+B; best request balance; TPOT
   tied-best. Only loss is E2E p99 (+17%) from heavy-class decode concentration.
2. **The baselines bracket the problem and explain why LPWL works:** sticky has
   the highest APC (0.696) but worst latency (hot-pin, 3.86× imbalance); LMetric
   has the worst APC (0.507) because `×num_requests` swallows the cache signal.
   LPWL drops exactly that factor, so locality re-emerges (APC 0.648, beside the
   explicit-affinity policies) while balance stays tight — the sweet spot, no
   gate, no tuning.
3. **Anti-overfit, demonstrated:** unified+A+B was tuned (of=1.3, lmw=0.01) on
   the *full* w600; on the colder 600s regime the parameter-free policy beats it
   by 31% TTFT p90. The tuning did not transfer; LPWL did.

### Per-class TTFT (ms, mean / p50 / p90 / p99) — LPWL dominates except the floor

| class | LPWL p90 | A+B p90 | LPWL p99 | A+B p99 |
|---|---:|---:|---:|---:|
| WARM<5k | 319 | 324 | 1032 | 2092 |
| MED5-20k | 1618 | 1952 | 3013 | 33189 |
| HEAVY20-50k | 4851 | 6198 | 14599 | 29044 |
| HEAVY+>50k | 28942 | 33777 | 52651 | 50778 |

LPWL's only weak class is the workload-inherent HEAVY+>50k floor (≈tied across
all policies). Elsewhere it avoids the mid-class tails the unified gate creates
when it pins a mid request behind a 50k-token turn on a barely-warm owner.

## Full-w600 cross-check (1214 reqs, `outputs/lpwl_vs_ab_live/`)

On the warmer full trace, LPWL vs unified+A+B is a wash: LPWL wins TTFT p90
(−14%) but loses TPOT (+38%) and per-worker balance. Combined claim across both
regimes: **LPWL ∈ [tied, clearly-better] vs a tuned baseline, at zero knobs.**

## Ablation: derived-κ decode term (`leastwork_kappa`) — NET-NEGATIVE

Tested the proposed knob-free fix for LPWL's E2E-p99: `--policy leastwork_kappa`,
`score = (pending_prefill + new_uncached) × (1 + κ·ongoing_decode_tokens)`, with
κ = 2.5e-6 *derived* from hardware (KV ~100 KB/tok ÷ HBM 4 TB/s ÷ TPOT 10 ms on
H20+Qwen3-30B-A3B), not trace-tuned. Same 600s trace, fresh vLLM, cold APC.

| metric | leastwork | leastwork_kappa | Δ |
|---|---:|---:|---:|
| TTFT p90 | 7983 | 9390 | +18% (worse) |
| TTFT p99 | 44891 | 42370 | −6% |
| E2E p90 | 19014 | 21674 | +14% (worse) |
| E2E p99 | 87024 | 90155 | +4% (did NOT fix) |
| APC | 0.648 | 0.647 | tie |
| req-balance | 1.55× | 1.97× | worse |

**Verdict: decode-awareness is the wrong lever for agentic.** The κ term is
correct physics aimed at a negligible effect (decode is cheap, output p50≈80),
so it mostly bounces heavy requests off their cache-owner → cold re-prefill
elsewhere → new hotspots (balance degrades 1.55×→1.97×). It does NOT fix E2E-p99
because that tail is the **structural HEAVY+>50k floor** (per-class p99 ≈51–52k
for *all* policies), not decode interference — i.e. not routing-fixable. This is
a negative result that *justifies* LPWL's omission of any decode term. The policy
is kept in-tree as a documented ablation; do not revive without a decode-heavy
regime. (First run on the GPU-capturing harness: per-worker GPU util mean 42–83%,
1.95× spread — it even shows the κ-induced imbalance.)

## Caveats / open work

- n=1 per arm. The 600s −31% TTFT p90 is corroborated by mean/p50/per-class, but
  repeat to bound run-to-run noise (no 3× repeats yet, by request — quick single
  set first).
- E2E-p99 deep tail is the one consistent LPWL weak spot (heavy-session decode
  concentration). Proposed knob-free fix: add `+ κ·ongoing_decode` with
  `κ = measured(TPOT/token) / prefill_throughput` (a derived hardware ratio, not
  a tuned scalar). Not yet implemented.

## Repro

```bash
# 5-policy, 600s trace (≈18 min/arm, ~90 min total)
OUTROOT=.../outputs/policy5_600s \
    bash microbench/connector_tax/cache_sweep/run_5policy_600s.sh
# unified report
.venv/bin/python scripts/bench_report.py --root .../outputs/policy5_600s \
    leastwork unified_ab unified_def lmetric sticky
```