Files

Gahow Wang 559faa1e26 B2 finding: TPOT idx peaks at 32k, not 65k — cost migrates to TTFT

The B2 same-worker TPOT p90 idx is non-monotone: 7.89x at 32k drops
to 2.26x at 65k. The naive reading is "interference gets weaker for
huge prefills"; the actual mechanism is a regime shift, and reading
TPOT p90 alone is misleading.

Three superimposed effects:

1. Cost migration TPOT -> TTFT. A 32k prefill is short enough that
   chunked-prefill keeps interleaving decode steps, so overlapping
   decodes trickle tokens out at painful per-token rates. A 65k
   prefill is long enough that overlapping decodes are *fully*
   blocked for ~10s; once they break through, the injection is
   winding down and subsequent iterations run unobstructed. The
   cost lands on the TTFT clock (14s) instead of inflating TPOT.

2. Bimodal TPOT distribution. At 65k overlap, decodes split into
   "blocked entire prefill then normal rate" and "trickled slowly
   through prefill chunks". p99 sits on the second population and
   grows 59 -> 169.5 ms; p90 sits on the first and shrinks.

3. "Clean" stops being clean. With 4x ~10s injections in 60s, the
   110 "clean" decodes at 65k are squeezed into 2-3s recovery
   pockets. TPOT p90 clean rises 6.9 -> 9.6 ms (40%), shrinking
   the denominator of the ratio.

window_1_results.md adds a new B2 subsection laying out the
mechanism with the per-cell data table and the explicit reading
rule: headline interference metric is TTFT idx (monotone); TPOT
p99 is the right tail indicator; TPOT p90 alone is unsafe across
regime shifts. Direct implication: TTFT and TPOT need separate
SLO thresholds under PD-colo, because they measure costs from
different points in the request lifecycle and the cost migration
between them is workload-dependent.

current_results/characterization_claim_matrix.md adds a new
supported claim for the cost migration, listed against the existing
B2 evidence. current_results/reviewer_risk_register.md adds a
low-severity entry warning future readers off TPOT p90 alone.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-26 00:35:45 +08:00

3.4 KiB

Raw Blame History

Reviewer Risk Register

Updated 2026-05-25 after Window 1.

Risk	Severity	Evidence	Mitigation
~~Session sequentiality not proven~~	resolved	A1 instrumentation lands per-request t_dispatch/t_first_token/t_finish unix timestamps + proxy_request_id. Smoke validation 2026-05-25 confirms 30/30 join coverage.	All Window 1 runs already use this; Window 2 inherits.
~~Cache reuse decomposition incomplete~~	resolved	Real reuse decomposition computed in `window_1_results/lmetric_reuse.json` from joined records carrying session_id + hash_ids + cached_tokens.	—
APC across hot-sweep policies may be contaminated by prior policy runs	low	First-turn cached_tokens distribution shows < 1% empirical contamination; load_only and sticky vLLMs were not restarted between policies. `unified` and `capped` are isolated cold-start.	Window 2 will isolate each policy launch by default; document in paper that lmetric/load_only/sticky reflect "warm-cache" condition.
Unified missing `interference_index` due to analyzer truncate-write bug	medium	The original `b3_analyze.sh` unconditionally `slice_engine_state.py`'d each policy and used `open("w")`, overwriting unified's correctly-written engine_state with the empty-window slice from the (hot-sweep) shared dir.	Fixed in commit `df32499`. B2 microbench provides the cleaner same-vs-different interference proof, so we do not need to rerun unified.
GPU 0 ghost memory after vLLM crash	low	EngineCore subprocess name is `VLLM::EngineCor`; `pkill -f "vllm serve"` misses it. Killed manually on 2026-05-25; cleanup logic in `b3_sweep.sh` and `b3_isolated_policy.sh` now also targets `EngineCore`.	—
w600 trace is a 1k-request sample, not the full GLM-5.1 trace	low	All B3 + B2 percentiles are on this sample. Full-trace KV-footprint and reuse claims use the 2.11M-request full trace.	Window 2 SRR sweep uses w600; full-trace SRR would need a larger sample and more GPU budget.
Trace-timestamp dispatch with strict session sequentiality stretches replay wall time	medium	lmetric's 600s trace dispatched over 49 min; system over-saturates and the dispatch window expands.	Window 2 uses A4 open-loop Poisson loadgen with explicit arrival rate, decoupling load level from trace structure.
Capped cap=8 may be too soft	low	Reviewer might prefer cap=2 or cap=4 to test "no multi-turn" extreme. Cap=8 was chosen to sit between turns/session p90 (1) and p99 (18).	Re-run with a stricter cap if reviewer pushes back; underlying capped script is parameterized.
B2 microbench uses synthetic short-prompt decode load (256 tokens)	low	This bounds the realism of the "decode" workload. Production decode tokens come from prior turns of long context.	The signal magnitude is robust enough that prompt length shouldn't qualitatively change conclusions; B3 sticky's failure breakdown is the production-trace confirmation.
Reading B2 same-worker interference from TPOT p90 alone gives a non-monotone curve	low	TPOT p90 idx peaks at 32k (7.89×) then drops at 65k (2.26×) even though TTFT idx grows monotonically (94.6× → 218×) and TPOT p99 grows monotonically (59 → 169.5 ms). The drop is regime shift (cost migrates from TPOT to TTFT once prefill blocks first-token long enough), not interference relief.	Reports must lead with TTFT idx; TPOT p99 is the right tail indicator for TPOT. See window_1_results.md §"TPOT idx peaks at 32k, not 65k".

3.4 KiB Raw Blame History Unescape Escape

Reviewer Risk Register

3.4 KiB

Raw Blame History