Files
agentic-kvc/analysis/lpwl_5policy_600s.md
Gahow Wang a0db3cbe77 Add leastwork_kappa decode-aware ablation (net-negative, documented)
--policy leastwork_kappa + --kappa (default 2.5e-6, derived from KV ~100KB/tok
/ HBM 4TB/s / TPOT 10ms on H20+Qwen3-30B-A3B): score = prefill_work * (1 +
kappa * ongoing_decode_tokens), modelling decode as a fractional throughput tax
on a new prefill.

Result on the 600s trace: NET-NEGATIVE vs plain leastwork — TTFT p90 +18%,
E2E p90 +14%, balance 1.55x->1.97x, and it does NOT fix the E2E-p99 it targeted.
Decode is too cheap in agentic (output p50~80) for the term to help; it just
bounces heavy reqs off their cache-owner into cold re-prefill. The E2E-p99 tail
is the structural HEAVY+>50k floor (per-class p99 ~51-52k for ALL policies), not
decode interference. Kept in-tree as a documented ablation justifying LPWL's
omission of any decode term; do not revive without a decode-heavy regime.
See analysis/lpwl_5policy_600s.md.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 17:07:23 +08:00

6.4 KiB
Raw Blame History

LPWL vs 4 baselines — parameter-free routing for agentic workloads

Date: 2026-05-29. Hardware: dash1, 8×H20, Qwen3-Coder-30B-A3B, TP=1, max_model_len=200000, fresh vLLM per arm (cold APC), --policy via scripts/b3_isolated_policy.sh. Analyzer: scripts/bench_report.py.

Motivation

unified+A+B carries too many knobs (overload_factor, lmetric_decode_weight, the 0.5 cache_ratio gate). Goal: a policy derived from the agentic pattern with no tuned constants, that does not overfit.

LPWL (Least-Prefill-Work-Left)

scripts/cache_aware_proxy.py:pick_instance_leastwork, --policy leastwork:

score_i = pending_prefill_tokens_i + max(0, input_len  cache_hit_i)   → argmin

Tie-break: fewest num_requests, then round-robin. Zero hyperparameters.

Why this shape (straight from the workload):

  • Decode is cheap (I/O ~217×) ⇒ the only load worth modeling is outstanding prefill token-work. No decode weight; dropping LMetric's ×num_requests also makes an idle-but-decoding host score input (its true marginal cost), not 0 — fixing the empty-batch degeneracy for free.
  • Cache-awareness is the affinity mechanism: a returning session's owner has new_uncached ≈ 0, so it sticks unless its prefill backlog exceeds the cache saving (input). The stick-vs-spill crossover is computed from real token-work — no overload_factor, no cache_ratio gate.
  • Session skew degrades gracefully: a heavy session inflates its owner's pending_prefill, auto-diverting other sessions while the heavy one stays put (no cold re-prefill).

Results — 600s trace (w600_r0.0015_st30_first600s.jsonl, 807 reqs)

This is the colder regime (theoretical APC ceiling ≈ 70% vs 80% for full w600).

policy knobs TTFT mean TTFT p90 E2E mean E2E p90 E2E p99 TPOT p90 APC req-bal
LPWL 0 3398 7983 8116 19014 87024 26 0.648 1.55×
unified+A+B 3 3876 11562 8199 22569 74266 25 0.661 1.56×
unified default 2 5066 16389 10481 28427 96361 34 0.689 2.28×
LMetric 0 4809 14037 10051 26726 97442 32 0.507 2.11×
sticky 0 5758 20356 10815 34734 82732 28 0.696 3.86×

(latencies ms; req-bal = max:min per-worker request count; this batch predates the GPU-capture harness change so per-worker GPU util reads N/A.)

Findings

  1. LPWL is overall best with zero knobs: TTFT mean 12% / p90 31%, E2E mean ~tie / p90 16% vs the tuned unified+A+B; best request balance; TPOT tied-best. Only loss is E2E p99 (+17%) from heavy-class decode concentration.
  2. The baselines bracket the problem and explain why LPWL works: sticky has the highest APC (0.696) but worst latency (hot-pin, 3.86× imbalance); LMetric has the worst APC (0.507) because ×num_requests swallows the cache signal. LPWL drops exactly that factor, so locality re-emerges (APC 0.648, beside the explicit-affinity policies) while balance stays tight — the sweet spot, no gate, no tuning.
  3. Anti-overfit, demonstrated: unified+A+B was tuned (of=1.3, lmw=0.01) on the full w600; on the colder 600s regime the parameter-free policy beats it by 31% TTFT p90. The tuning did not transfer; LPWL did.

Per-class TTFT (ms, mean / p50 / p90 / p99) — LPWL dominates except the floor

class LPWL p90 A+B p90 LPWL p99 A+B p99
WARM<5k 319 324 1032 2092
MED5-20k 1618 1952 3013 33189
HEAVY20-50k 4851 6198 14599 29044
HEAVY+>50k 28942 33777 52651 50778

LPWL's only weak class is the workload-inherent HEAVY+>50k floor (≈tied across all policies). Elsewhere it avoids the mid-class tails the unified gate creates when it pins a mid request behind a 50k-token turn on a barely-warm owner.

Full-w600 cross-check (1214 reqs, outputs/lpwl_vs_ab_live/)

On the warmer full trace, LPWL vs unified+A+B is a wash: LPWL wins TTFT p90 (14%) but loses TPOT (+38%) and per-worker balance. Combined claim across both regimes: LPWL ∈ [tied, clearly-better] vs a tuned baseline, at zero knobs.

Ablation: derived-κ decode term (leastwork_kappa) — NET-NEGATIVE

Tested the proposed knob-free fix for LPWL's E2E-p99: --policy leastwork_kappa, score = (pending_prefill + new_uncached) × (1 + κ·ongoing_decode_tokens), with κ = 2.5e-6 derived from hardware (KV ~100 KB/tok ÷ HBM 4 TB/s ÷ TPOT 10 ms on H20+Qwen3-30B-A3B), not trace-tuned. Same 600s trace, fresh vLLM, cold APC.

metric leastwork leastwork_kappa Δ
TTFT p90 7983 9390 +18% (worse)
TTFT p99 44891 42370 6%
E2E p90 19014 21674 +14% (worse)
E2E p99 87024 90155 +4% (did NOT fix)
APC 0.648 0.647 tie
req-balance 1.55× 1.97× worse

Verdict: decode-awareness is the wrong lever for agentic. The κ term is correct physics aimed at a negligible effect (decode is cheap, output p50≈80), so it mostly bounces heavy requests off their cache-owner → cold re-prefill elsewhere → new hotspots (balance degrades 1.55×→1.97×). It does NOT fix E2E-p99 because that tail is the structural HEAVY+>50k floor (per-class p99 ≈5152k for all policies), not decode interference — i.e. not routing-fixable. This is a negative result that justifies LPWL's omission of any decode term. The policy is kept in-tree as a documented ablation; do not revive without a decode-heavy regime. (First run on the GPU-capturing harness: per-worker GPU util mean 4283%, 1.95× spread — it even shows the κ-induced imbalance.)

Caveats / open work

  • n=1 per arm. The 600s 31% TTFT p90 is corroborated by mean/p50/per-class, but repeat to bound run-to-run noise (no 3× repeats yet, by request — quick single set first).
  • E2E-p99 deep tail is the one consistent LPWL weak spot (heavy-session decode concentration). Proposed knob-free fix: add + κ·ongoing_decode with κ = measured(TPOT/token) / prefill_throughput (a derived hardware ratio, not a tuned scalar). Not yet implemented.

Repro

# 5-policy, 600s trace (≈18 min/arm, ~90 min total)
OUTROOT=.../outputs/policy5_600s \
    bash microbench/connector_tax/cache_sweep/run_5policy_600s.sh
# unified report
.venv/bin/python scripts/bench_report.py --root .../outputs/policy5_600s \
    leastwork unified_ab unified_def lmetric sticky