B2 finding: TPOT idx peaks at 32k, not 65k — cost migrates to TTFT
The B2 same-worker TPOT p90 idx is non-monotone: 7.89x at 32k drops to 2.26x at 65k. The naive reading is "interference gets weaker for huge prefills"; the actual mechanism is a regime shift, and reading TPOT p90 alone is misleading. Three superimposed effects: 1. Cost migration TPOT -> TTFT. A 32k prefill is short enough that chunked-prefill keeps interleaving decode steps, so overlapping decodes trickle tokens out at painful per-token rates. A 65k prefill is long enough that overlapping decodes are *fully* blocked for ~10s; once they break through, the injection is winding down and subsequent iterations run unobstructed. The cost lands on the TTFT clock (14s) instead of inflating TPOT. 2. Bimodal TPOT distribution. At 65k overlap, decodes split into "blocked entire prefill then normal rate" and "trickled slowly through prefill chunks". p99 sits on the second population and grows 59 -> 169.5 ms; p90 sits on the first and shrinks. 3. "Clean" stops being clean. With 4x ~10s injections in 60s, the 110 "clean" decodes at 65k are squeezed into 2-3s recovery pockets. TPOT p90 clean rises 6.9 -> 9.6 ms (40%), shrinking the denominator of the ratio. window_1_results.md adds a new B2 subsection laying out the mechanism with the per-cell data table and the explicit reading rule: headline interference metric is TTFT idx (monotone); TPOT p99 is the right tail indicator; TPOT p90 alone is unsafe across regime shifts. Direct implication: TTFT and TPOT need separate SLO thresholds under PD-colo, because they measure costs from different points in the request lifecycle and the cost migration between them is workload-dependent. current_results/characterization_claim_matrix.md adds a new supported claim for the cost migration, listed against the existing B2 evidence. current_results/reviewer_risk_register.md adds a low-severity entry warning future readers off TPOT p90 alone. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -13,6 +13,7 @@ sweep, B2 PD-colo interference microbench).
|
||||
| Cache-aware LMetric leaves a measurable locality gap (22.7 pp). | `supported` | lmetric achieved 56.9% vs intra-session ceiling 79.6%; B3 sweep window_1_results/b3_policy_comparison.json. | — | sticky data shows the gap can be recovered by harder affinity. |
|
||||
| Hybrid affinity (`unified`) breaks the locality-vs-latency tradeoff. | `supported` | unified APC 79.4% (97% of intra ceiling) AND TTFT p90 7.24 s (lmetric is 15.6 s). | — | unified concentrates a single very hot worker (engine_4 at 37.7 s p90); hotspot_index 3.35. |
|
||||
| Same-worker prefill-decode interference is causal, not correlation. | `supported` | B2 microbench: different-worker control idx 0.92-1.02 across 32× prefill-size variation; same-worker TTFT idx scales 2.15× (2k) → 218× (65k). window_1_results/b2_sweep_summary.json. | — | Synthetic decode load (256-token prompts at 4 req/s) bounds the realism; production behavior is layered on top of B3. |
|
||||
| The cost of same-worker prefill interference migrates from TPOT to TTFT as prefill size grows past the chunked-prefill horizon. | `supported` | B2 same-worker TPOT p90 idx peaks at 32k (7.89×) and *drops* at 65k (2.26×), while TTFT idx grows monotonically (94.6× → 218×) and TPOT p99 grows monotonically (59 → 169.5 ms). See window_1_results.md "TPOT idx peaks at 32k, not 65k". | — | SLO thresholds for TTFT and TPOT cannot be the same under PD-colo; this should be reflected in B4 SRR sweep design. |
|
||||
| Hard session affinity (`sticky`) inflates same-worker prefill-decode interference. | `supported` | sticky interference_index 13.65 vs lmetric 6.53; sticky's slow-request breakdown 57% same-worker overlap vs lmetric 23%. | — | Confirms the B2 causal claim observed at the system level. |
|
||||
| Heavy-tail sessions are a contributor to hot-spot but not the sole cause. | `supported` | Cap-8 trace (37% requests dropped) reduces hotspot_index only 13% (2.24 → 1.94). | Run capped under unified to see whether unified's hotspot also persists. | Reviewer might counter that cap=8 is too soft; a stricter cap could be tried. |
|
||||
| SRR per policy under SLO is not yet measured. | `not_yet_supported` | B3 was driven by trace timestamps with strict session sequentiality; saturation is reached but not parameterized. | Run B4 with the A4 open-loop Poisson loadgen, per-class SLO, 5 policies × λ binary search. | Without B4 the paper cannot claim "policy X sustains higher load than Y". |
|
||||
|
||||
@@ -13,3 +13,4 @@ Updated 2026-05-25 after Window 1.
|
||||
| Trace-timestamp dispatch with strict session sequentiality stretches replay wall time | medium | lmetric's 600s trace dispatched over 49 min; system over-saturates and the dispatch window expands. | Window 2 uses A4 open-loop Poisson loadgen with explicit arrival rate, decoupling load level from trace structure. |
|
||||
| Capped cap=8 may be too soft | low | Reviewer might prefer cap=2 or cap=4 to test "no multi-turn" extreme. Cap=8 was chosen to sit between turns/session p90 (1) and p99 (18). | Re-run with a stricter cap if reviewer pushes back; underlying capped script is parameterized. |
|
||||
| B2 microbench uses synthetic short-prompt decode load (256 tokens) | low | This bounds the realism of the "decode" workload. Production decode tokens come from prior turns of long context. | The signal magnitude is robust enough that prompt length shouldn't qualitatively change conclusions; B3 sticky's failure breakdown is the production-trace confirmation. |
|
||||
| Reading B2 same-worker interference from TPOT p90 alone gives a non-monotone curve | low | TPOT p90 idx peaks at 32k (7.89×) then drops at 65k (2.26×) even though TTFT idx grows monotonically (94.6× → 218×) and TPOT p99 grows monotonically (59 → 169.5 ms). The drop is regime shift (cost migrates from TPOT to TTFT once prefill blocks first-token long enough), not interference relief. | Reports must lead with TTFT idx; TPOT p99 is the right tail indicator for TPOT. See window_1_results.md §"TPOT idx peaks at 32k, not 65k". |
|
||||
|
||||
@@ -116,15 +116,45 @@ Setup: 2 vLLM instances on GPU 0 (decode endpoint) and GPU 1 (prefill endpoint).
|
||||
| same | 32k | 67 | 173 | **7.89** | **94.6×** |
|
||||
| same | 65k | 130 | 110 | 2.26* | **218×** |
|
||||
|
||||
\*65k TPOT idx is suppressed because n_overlap > n_clean — by the time the 65k prefill is finishing, the 4-second gap to the next injection has already started decoding overlap. The "clean" decodes left are the ones that randomly hit the brief gaps between injections.
|
||||
\*65k TPOT idx is non-monotone — see §"TPOT idx peaks at 32k, not 65k" below.
|
||||
|
||||
Figures: `fig_b2_tpot_vs_prefill.png`, `fig_b2_ttft_vs_prefill.png`.
|
||||
|
||||
**Why this matters**
|
||||
- The `different-worker` control sits at idx ≈ 1.0 across 32× variation in prefill size. This is the cleanest possible disproof of "any prefill anywhere hurts decode": prefill on a *different* worker is invisible to the decode worker.
|
||||
- The `same-worker` curve is monotone in prefill size for TTFT (218× at 65k) and monotone-up-to-32k for TPOT (7.89×). The two ablations together establish causation: prefill-decode interference is a same-worker phenomenon and scales sharply with prefill mass.
|
||||
- The `same-worker` TTFT curve is monotone in prefill size all the way to 218× at 65k. TPOT p90 is monotone only up to 32k (7.89×), then drops at 65k — this is not "interference relaxing", it is the cost regime shifting from TPOT to TTFT (see below).
|
||||
- This is the mechanism behind the B3 sticky interference jump (13.65) and unified's single hot worker (engine_4 at 37.7 s TTFT p90).
|
||||
|
||||
### TPOT idx peaks at 32k, not 65k — regime shift, not relief
|
||||
|
||||
The naïve reading of the table is "interference gets worse up to 32k then drops at 65k". That is wrong; the cost is shifting from per-token rate (TPOT) to first-token wait (TTFT), and `p90 / clean` happens to compress the visible cost. Three superimposed effects.
|
||||
|
||||
Same-variant detail across the regime boundary:
|
||||
|
||||
```
|
||||
32k 65k change
|
||||
n_overlap 67 130 +94% (most decodes now overlap)
|
||||
n_clean 173 110 -37%
|
||||
TPOT p50 overlap (ms) 12.2 20.1 +1.6x
|
||||
TPOT p90 overlap (ms) 54.8 21.7 -2.5x <- "improves"
|
||||
TPOT p99 overlap (ms) 59.0 169.5 +2.9x <- tail explodes
|
||||
TTFT p90 overlap (s) 4.17 14.06 +3.4x
|
||||
TPOT p90 clean (ms) 6.9 9.6 +40%
|
||||
```
|
||||
|
||||
**Mechanism 1 — Cost shifts from TPOT to TTFT.** TPOT is measured only *after* a request starts emitting tokens. A 32 k prefill (~5 s on H20) is short enough that vLLM's chunked-prefill scheduler keeps interleaving decode steps; overlapping decodes trickle tokens out at painfully slow per-token rates → p90 TPOT 54.8 ms. A 65 k prefill (~10 s) is long enough that many overlapping decodes get *zero* tokens for nearly the whole prefill window; when they finally break through, the injection is winding down so subsequent decode iterations are unobstructed. The cost goes onto the TTFT clock (14 s) instead of inflating TPOT.
|
||||
|
||||
**Mechanism 2 — Bimodal TPOT distribution hides under p90.** At 65 k overlap, two populations of decodes coexist:
|
||||
- decodes blocked the entire prefill (high TTFT, then normal per-token rate)
|
||||
- decodes that did trickle slowly through prefill chunks (low TTFT, high TPOT)
|
||||
- The p99 jump 59 → 169.5 ms shows the second population is *worse* at 65 k. p90 happens to fall on the first (fast-after-block) population.
|
||||
|
||||
**Mechanism 3 — "Clean" stops being clean.** With 4 × ~10 s injections spread across 60 s (40 s of injection time, 20 s of gaps), there are very few moments where the worker is truly idle. The 110 "clean" decodes at 65 k are squeezed into 2-3 s pockets where the system is recovering from the previous injection or about to be hit by the next. TPOT p90 clean rises 6.9 → 9.6 ms (the denominator of the idx ratio drifts up by 40%).
|
||||
|
||||
**Reading rule for B2**: TTFT idx is the headline interference metric — it is monotone and reflects user-visible "no tokens for N seconds" latency. **TPOT p99** is the right tail-sensitivity indicator (also monotone). **TPOT p90 is non-monotone across regime shifts and should not be used alone**. This has direct implications for SLO design: TTFT and TPOT cannot share the same violation threshold under PD-colo interference, because they measure costs from *different* points in the request lifecycle and the cost migration between them is workload-dependent.
|
||||
|
||||
This is also a finding the paper should call out: **once same-worker prefill grows beyond a TTFT-block threshold, overlapping decodes "give up" their per-token rate complaint and pay the cost in queueing instead**. The system looks faster on per-token metrics; users experience longer waits.
|
||||
|
||||
## What Window 1 does *not* answer
|
||||
|
||||
These need Window 2 (B4 SRR sweep + B5 failure attribution near SRR boundary):
|
||||
|
||||
Reference in New Issue
Block a user