Files
agentic-kvc/v2/exp_d_policy_dispatch
Gahow Wang 0b180c191e v2 exp(d): expand figure to 6 panels (TTFT/E2E mean+p90, TPS, per-worker GPU util)
Per request: TTFT mean+p90, E2E mean+p90, decode TPS (output goodput; total/
prefill TPS omitted as cache-miss-inflated), and per-worker GPU-util boxplots
(8 workers/arm, tracets vs thinktime) showing utilization level + balance.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-30 21:10:27 +08:00
..

exp (d) — 5-policy routing under tracets vs thinktime

exp (c) showed the dispatch mode changes measured performance for a single round-robin policy, and predicted: "a cache-aware policy (LPWL) would lower the latencies and likely widen the thinktime advantage." exp (d) tests that with the full routing comparison — and finds something stronger: the dispatch mode flips which policy wins.

Question. Does the parameter-free LPWL still beat the tuned unified+A+B baseline once we benchmark with the faithful thinktime load instead of the tracets burst artifact?

Setup

5 routing policies, each its own fresh vLLM (cold APC) on dash0 8×H20, Qwen3-Coder-30B-A3B, via scripts/b3_isolated_policy.sh. Both dispatch modes run on the same trace traces/w600_r0.0015_st30_first600s_ttp.jsonl (807 reqs, 274 sessions) — the only variable is REPLAY_DISPATCH_MODE (tracets ignores the time_to_parent_chat field, thinktime consumes it). Analyzer: scripts/bench_report.py (summaries in results/).

  • leastworkLPWL, parameter-free (pending_prefill + max(0, inputcache_hit))
  • unified_ab — unified hybrid, tuned A+B (of=1.3, lmw=0.01)
  • unified_def — unified hybrid, defaults (of=2.0, lmw=0.0)
  • lmetric — P_tokens × BS, no affinity
  • sticky — hard session affinity

Result (ms; figs/exp_d_policy_dispatch.png)

The figure has 6 panels — TTFT mean/p90, E2E mean/p90, decode TPS (output goodput), and the per-worker GPU-util box (8 workers/arm). Decode TPS is the honest throughput metric (total/prefill TPS is inflated by cache-miss recompute, e.g. LMetric); thinktime ≥ tracets on it everywhere (the system drains faster with real think-time). The GPU-util box shows LPWL also keeps the tightest worker balance.

policy mode TTFT p90 E2E mean E2E p90 E2E p99 TPOT p90 APC req-bal
LPWL tracets 11099 9827 25366 93929 33 0.650 1.49×
LPWL thinktime 6713 6788 17635 69946 18 0.676 1.94×
unified+A+B tracets 10783 8531 22063 75419 21 0.667 1.54×
unified+A+B thinktime 9736 7131 18690 63788 19 0.676 2.16×
unified default tracets 12997 8366 22819 82257 20 0.693 1.56×
unified default thinktime 11268 7975 24096 72334 22 0.693 2.91×
LMetric tracets 16492 10775 27791 99231 39 0.495 2.19×
LMetric thinktime 15607 9902 27819 73672 30 0.483 2.10×
sticky tracets 15236 10139 27974 82362 31 0.693 2.06×
sticky thinktime 14838 8663 24966 70933 24 0.694 2.48×

Finding 1 — thinktime helps every policy, but helps LPWL the most

Per-policy tracetsthinktime change (negative = thinktime better):

policy ΔTTFT p90 ΔE2E mean ΔTPOT p90
LPWL 40% 31% 45%
unified+A+B 10% 16% 10%
unified default 13% 5% +10%
LMetric 5% 8% 23%
sticky 3% 15% 23%

tracets collapses the inter-turn think-time to ~0 (exp c), manufacturing bursts → peak concurrency → KV pressure → preemption. Those bursts punish exactly the policy that spreads prefill thinly across hosts (LPWL keeps the tightest request balance, 1.49×), because under a burst the spread sacrifices locality without the slack to amortize it. Remove the artifact and LPWL's prefill-aware placement pays.

Finding 2 — the dispatch mode flips the cross-policy ranking

  • TTFT p90: tracetsunified_ab (10.8s) ≈ LPWL (11.1s) — LPWL only ties, even slightly behind. thinktimeLPWL (6.7s) < unified_ab (9.7s): LPWL is first, 31% vs the tuned baseline.
  • E2E mean: tracets → unified_def (8.4s) < unified_ab (8.5s) < LPWL (9.8s) — LPWL is 3rd, behind both unified variants. thinktimeLPWL (6.8s) < unified_ab (7.1s) < unified_def (8.0s): LPWL is first.

So under artificial tracets bursts the parameter-free policy looks tied-or-worse; under the faithful thinktime load it is the clear winner on TTFT and E2E, at zero knobs and best balance.

Conclusion

Benchmark agentic routing with thinktime. Under it, the parameter-free LPWL is the best of the five policies — TTFT p90 31%, E2E mean 5% / p90 6%, best TPOT, tightest balance vs the tuned unified+A+B — and the tracets burst artifact is precisely what erases that advantage (it even drops LPWL to 3rd on E2E). This both confirms exp (c)'s prediction and is independent evidence for the GPU-hit-first routing story: faithful load rewards keeping the active working set GPU-resident.

Caveats

  • n = 1 per arm. The tracets ranking here does not reproduce the earlier dash1 analysis/lpwl_5policy_600s.md (which saw LPWL win TTFT p90 31% in tracets); on dash0 tracets it is a tie. i.e. tracets rankings are run/harness-sensitive — the robust signal is the thinktime advantage, which appears in both environments. Repeat ×3 to bound noise.
  • LPWL's one persistent weak spot is E2E p99 (thinktime 69.9s vs unified_ab 63.8s) — the structural HEAVY+ >50k decode tail, identical across policies, not routing-fixable (see lpwl_5policy_600s.md κ-ablation).
  • thinktime advantage is a capacity-slack effect; under saturation the modes converge (exp c, N=6).

Repro

# 1. annotate the full trace with time_to_parent_chat (dash0; once)
python scripts/add_ttp_streaming.py 051315-051317.jsonl 051315-051317-ttp.jsonl \
    051315-051317-raw.jsonl
# 2. resample (same seed reproduces traces/w600_r0.0015_st30.jsonl + the ttp field;
#    first600s = timestamp<600 filter)
python scripts/sample_trace.py --input 051315-051317-ttp.jsonl \
    --output traces/w600_r0.0015_st30_ttp.jsonl \
    --window-seconds 600 --sample-ratio 0.0015 --max-single-turn-ratio 0.30 --seed 42
# 3. run both modes x 5 policies (~3.5 h, fresh vLLM/arm)
TRACE_FILE=traces/w600_r0.0015_st30_first600s_ttp.jsonl \
    bash microbench/connector_tax/cache_sweep/run_5policy_both_modes.sh
# 4. report + plot
python scripts/bench_report.py --root outputs/policy5_600s_thinktime_<date> \
    --json v2/exp_d_policy_dispatch/results/thinktime.json \
    leastwork unified_ab unified_def lmetric sticky
python v2/exp_d_policy_dispatch/plot.py