agentic-kvc/v2/exp_d_policy_dispatch/README.md

# exp (d) — 5-policy routing under `tracets` vs `thinktime`

exp (c) showed the **dispatch mode** changes measured performance for a single
round-robin policy, and predicted: *"a cache-aware policy (LPWL) would lower the
latencies and likely **widen** the thinktime advantage."* exp (d) tests that with
the full routing comparison — and finds something stronger: **the dispatch mode
flips which policy wins.**

**Question.** Does the parameter-free LPWL still beat the tuned `unified+A+B`
baseline once we benchmark with the *faithful* `thinktime` load instead of the
`tracets` burst artifact?

## Setup

5 routing policies, each its own **fresh vLLM (cold APC)** on dash0 8×H20,
Qwen3-Coder-30B-A3B, via `scripts/b3_isolated_policy.sh`. **Both dispatch modes
run on the *same* trace** `traces/w600_r0.0015_st30_first600s_ttp.jsonl` (807
reqs, 274 sessions) — the only variable is `REPLAY_DISPATCH_MODE`
(`tracets` ignores the `time_to_parent_chat` field, `thinktime` consumes it).
Analyzer: `scripts/bench_report.py` (summaries in `results/`).

- `leastwork` — **LPWL**, parameter-free (`pending_prefill + max(0, input−cache_hit)`)
- `unified_ab` — unified hybrid, tuned A+B′ (`of=1.3, lmw=0.01`)
- `unified_def` — unified hybrid, defaults (`of=2.0, lmw=0.0`)
- `lmetric` — P_tokens × BS, no affinity
- `sticky` — hard session affinity

## Result (ms; `figs/exp_d_policy_dispatch.png`)

The figure has 6 panels — TTFT mean/p90, E2E mean/p90, decode TPS (output
goodput), and the per-worker GPU-util box (8 workers/arm). Decode TPS is the
honest throughput metric (total/prefill TPS is inflated by cache-miss recompute,
e.g. LMetric); thinktime ≥ tracets on it everywhere (the system drains faster
with real think-time). The GPU-util box shows LPWL also keeps the tightest
worker balance.

| policy | mode | TTFT p90 | E2E mean | E2E p90 | E2E p99 | TPOT p90 | APC | req-bal |
|---|---|---:|---:|---:|---:|---:|---:|---:|
| **LPWL** | tracets | 11099 | 9827 | 25366 | 93929 | 33 | 0.650 | **1.49×** |
| **LPWL** | **thinktime** | **6713** | **6788** | **17635** | 69946 | **18** | 0.676 | 1.94× |
| unified+A+B | tracets | 10783 | 8531 | 22063 | 75419 | 21 | 0.667 | 1.54× |
| unified+A+B | thinktime | 9736 | 7131 | 18690 | **63788** | 19 | 0.676 | 2.16× |
| unified default | tracets | 12997 | 8366 | 22819 | 82257 | 20 | 0.693 | 1.56× |
| unified default | thinktime | 11268 | 7975 | 24096 | 72334 | 22 | 0.693 | 2.91× |
| LMetric | tracets | 16492 | 10775 | 27791 | 99231 | 39 | 0.495 | 2.19× |
| LMetric | thinktime | 15607 | 9902 | 27819 | 73672 | 30 | 0.483 | 2.10× |
| sticky | tracets | 15236 | 10139 | 27974 | 82362 | 31 | 0.693 | 2.06× |
| sticky | thinktime | 14838 | 8663 | 24966 | 70933 | 24 | 0.694 | 2.48× |

### Finding 1 — `thinktime` helps every policy, but helps **LPWL the most**

Per-policy `tracets`→`thinktime` change (negative = thinktime better):

| policy | ΔTTFT p90 | ΔE2E mean | ΔTPOT p90 |
|---|---:|---:|---:|
| **LPWL** | **−40%** | **−31%** | **−45%** |
| unified+A+B | −10% | −16% | −10% |
| unified default | −13% | −5% | +10% |
| LMetric | −5% | −8% | −23% |
| sticky | −3% | −15% | −23% |

`tracets` collapses the inter-turn think-time to ~0 (exp c), manufacturing bursts
→ peak concurrency → KV pressure → preemption. Those bursts punish exactly the
policy that spreads prefill thinly across hosts (LPWL keeps the tightest request
balance, 1.49×), because under a burst the spread sacrifices locality without the
slack to amortize it. Remove the artifact and LPWL's prefill-aware placement pays.

### Finding 2 — the dispatch mode **flips the cross-policy ranking**

- **TTFT p90:** `tracets` → `unified_ab (10.8s) ≈ LPWL (11.1s)` — LPWL only *ties*,
  even slightly behind. `thinktime` → **LPWL (6.7s)** < unified_ab (9.7s): LPWL is
  first, **−31%** vs the tuned baseline.
- **E2E mean:** `tracets` → unified_def (8.4s) < unified_ab (8.5s) < **LPWL (9.8s)**
  — LPWL is *3rd, behind both unified variants*. `thinktime` → **LPWL (6.8s)** <
  unified_ab (7.1s) < unified_def (8.0s): LPWL is **first**.

So under artificial `tracets` bursts the parameter-free policy looks tied-or-worse;
under the faithful `thinktime` load it is the clear winner on TTFT and E2E, at
zero knobs and best balance.

## Conclusion

**Benchmark agentic routing with `thinktime`. Under it, the parameter-free LPWL is
the best of the five policies** — TTFT p90 −31%, E2E mean −5% / p90 −6%, best TPOT,
tightest balance vs the *tuned* `unified+A+B` — and the `tracets` burst artifact is
precisely what erases that advantage (it even drops LPWL to 3rd on E2E). This both
confirms exp (c)'s prediction and is independent evidence for the GPU-hit-first
routing story: faithful load rewards keeping the active working set GPU-resident.

## Caveats

- **n = 1 per arm.** The `tracets` ranking here does **not** reproduce the earlier
  dash1 `analysis/lpwl_5policy_600s.md` (which saw LPWL win TTFT p90 −31% *in
  tracets*); on dash0 `tracets` it is a tie. i.e. **`tracets` rankings are
  run/harness-sensitive** — the robust signal is the `thinktime` advantage, which
  appears in *both* environments. Repeat ×3 to bound noise.
- LPWL's one persistent weak spot is **E2E p99** (thinktime 69.9s vs unified_ab
  63.8s) — the structural HEAVY+ >50k decode tail, identical across policies, not
  routing-fixable (see `lpwl_5policy_600s.md` κ-ablation).
- `thinktime` advantage is a capacity-slack effect; under saturation the modes
  converge (exp c, N=6).

## Repro
```bash
# 1. annotate the full trace with time_to_parent_chat (dash0; once)
python scripts/add_ttp_streaming.py 051315-051317.jsonl 051315-051317-ttp.jsonl \
    051315-051317-raw.jsonl
# 2. resample (same seed reproduces traces/w600_r0.0015_st30.jsonl + the ttp field;
#    first600s = timestamp<600 filter)
python scripts/sample_trace.py --input 051315-051317-ttp.jsonl \
    --output traces/w600_r0.0015_st30_ttp.jsonl \
    --window-seconds 600 --sample-ratio 0.0015 --max-single-turn-ratio 0.30 --seed 42
# 3. run both modes x 5 policies (~3.5 h, fresh vLLM/arm)
TRACE_FILE=traces/w600_r0.0015_st30_first600s_ttp.jsonl \
    bash microbench/connector_tax/cache_sweep/run_5policy_both_modes.sh
# 4. report + plot
python scripts/bench_report.py --root outputs/policy5_600s_thinktime_<date> \
    --json v2/exp_d_policy_dispatch/results/thinktime.json \
    leastwork unified_ab unified_def lmetric sticky
python v2/exp_d_policy_dispatch/plot.py
```