Gahow Wang 559faa1e26 B2 finding: TPOT idx peaks at 32k, not 65k — cost migrates to TTFT
The B2 same-worker TPOT p90 idx is non-monotone: 7.89x at 32k drops
to 2.26x at 65k. The naive reading is "interference gets weaker for
huge prefills"; the actual mechanism is a regime shift, and reading
TPOT p90 alone is misleading.

Three superimposed effects:

1. Cost migration TPOT -> TTFT. A 32k prefill is short enough that
   chunked-prefill keeps interleaving decode steps, so overlapping
   decodes trickle tokens out at painful per-token rates. A 65k
   prefill is long enough that overlapping decodes are *fully*
   blocked for ~10s; once they break through, the injection is
   winding down and subsequent iterations run unobstructed. The
   cost lands on the TTFT clock (14s) instead of inflating TPOT.

2. Bimodal TPOT distribution. At 65k overlap, decodes split into
   "blocked entire prefill then normal rate" and "trickled slowly
   through prefill chunks". p99 sits on the second population and
   grows 59 -> 169.5 ms; p90 sits on the first and shrinks.

3. "Clean" stops being clean. With 4x ~10s injections in 60s, the
   110 "clean" decodes at 65k are squeezed into 2-3s recovery
   pockets. TPOT p90 clean rises 6.9 -> 9.6 ms (40%), shrinking
   the denominator of the ratio.

window_1_results.md adds a new B2 subsection laying out the
mechanism with the per-cell data table and the explicit reading
rule: headline interference metric is TTFT idx (monotone); TPOT
p99 is the right tail indicator; TPOT p90 alone is unsafe across
regime shifts. Direct implication: TTFT and TPOT need separate
SLO thresholds under PD-colo, because they measure costs from
different points in the request lifecycle and the cost migration
between them is workload-dependent.

current_results/characterization_claim_matrix.md adds a new
supported claim for the cost migration, listed against the existing
B2 evidence. current_results/reviewer_risk_register.md adds a
low-severity entry warning future readers off TPOT p90 alone.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-26 00:35:45 +08:00
Description
No description provided
48 MiB
Languages
Python 82.9%
Shell 17.1%