The B2 same-worker TPOT p90 idx is non-monotone: 7.89x at 32k drops to 2.26x at 65k. The naive reading is "interference gets weaker for huge prefills"; the actual mechanism is a regime shift, and reading TPOT p90 alone is misleading. Three superimposed effects: 1. Cost migration TPOT -> TTFT. A 32k prefill is short enough that chunked-prefill keeps interleaving decode steps, so overlapping decodes trickle tokens out at painful per-token rates. A 65k prefill is long enough that overlapping decodes are *fully* blocked for ~10s; once they break through, the injection is winding down and subsequent iterations run unobstructed. The cost lands on the TTFT clock (14s) instead of inflating TPOT. 2. Bimodal TPOT distribution. At 65k overlap, decodes split into "blocked entire prefill then normal rate" and "trickled slowly through prefill chunks". p99 sits on the second population and grows 59 -> 169.5 ms; p90 sits on the first and shrinks. 3. "Clean" stops being clean. With 4x ~10s injections in 60s, the 110 "clean" decodes at 65k are squeezed into 2-3s recovery pockets. TPOT p90 clean rises 6.9 -> 9.6 ms (40%), shrinking the denominator of the ratio. window_1_results.md adds a new B2 subsection laying out the mechanism with the per-cell data table and the explicit reading rule: headline interference metric is TTFT idx (monotone); TPOT p99 is the right tail indicator; TPOT p90 alone is unsafe across regime shifts. Direct implication: TTFT and TPOT need separate SLO thresholds under PD-colo, because they measure costs from different points in the request lifecycle and the cost migration between them is workload-dependent. current_results/characterization_claim_matrix.md adds a new supported claim for the cost migration, listed against the existing B2 evidence. current_results/reviewer_risk_register.md adds a low-severity entry warning future readers off TPOT p90 alone. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
3.4 KiB
3.4 KiB
Reviewer Risk Register
Updated 2026-05-25 after Window 1.
| Risk | Severity | Evidence | Mitigation |
|---|---|---|---|
| resolved | A1 instrumentation lands per-request t_dispatch/t_first_token/t_finish unix timestamps + proxy_request_id. Smoke validation 2026-05-25 confirms 30/30 join coverage. | All Window 1 runs already use this; Window 2 inherits. | |
| resolved | Real reuse decomposition computed in window_1_results/lmetric_reuse.json from joined records carrying session_id + hash_ids + cached_tokens. |
— | |
| APC across hot-sweep policies may be contaminated by prior policy runs | low | First-turn cached_tokens distribution shows < 1% empirical contamination; load_only and sticky vLLMs were not restarted between policies. unified and capped are isolated cold-start. |
Window 2 will isolate each policy launch by default; document in paper that lmetric/load_only/sticky reflect "warm-cache" condition. |
Unified missing interference_index due to analyzer truncate-write bug |
medium | The original b3_analyze.sh unconditionally slice_engine_state.py'd each policy and used open("w"), overwriting unified's correctly-written engine_state with the empty-window slice from the (hot-sweep) shared dir. |
Fixed in commit df32499. B2 microbench provides the cleaner same-vs-different interference proof, so we do not need to rerun unified. |
| GPU 0 ghost memory after vLLM crash | low | EngineCore subprocess name is VLLM::EngineCor; pkill -f "vllm serve" misses it. Killed manually on 2026-05-25; cleanup logic in b3_sweep.sh and b3_isolated_policy.sh now also targets EngineCore. |
— |
| w600 trace is a 1k-request sample, not the full GLM-5.1 trace | low | All B3 + B2 percentiles are on this sample. Full-trace KV-footprint and reuse claims use the 2.11M-request full trace. | Window 2 SRR sweep uses w600; full-trace SRR would need a larger sample and more GPU budget. |
| Trace-timestamp dispatch with strict session sequentiality stretches replay wall time | medium | lmetric's 600s trace dispatched over 49 min; system over-saturates and the dispatch window expands. | Window 2 uses A4 open-loop Poisson loadgen with explicit arrival rate, decoupling load level from trace structure. |
| Capped cap=8 may be too soft | low | Reviewer might prefer cap=2 or cap=4 to test "no multi-turn" extreme. Cap=8 was chosen to sit between turns/session p90 (1) and p99 (18). | Re-run with a stricter cap if reviewer pushes back; underlying capped script is parameterized. |
| B2 microbench uses synthetic short-prompt decode load (256 tokens) | low | This bounds the realism of the "decode" workload. Production decode tokens come from prior turns of long context. | The signal magnitude is robust enough that prompt length shouldn't qualitatively change conclusions; B3 sticky's failure breakdown is the production-trace confirmation. |
| Reading B2 same-worker interference from TPOT p90 alone gives a non-monotone curve | low | TPOT p90 idx peaks at 32k (7.89×) then drops at 65k (2.26×) even though TTFT idx grows monotonically (94.6× → 218×) and TPOT p99 grows monotonically (59 → 169.5 ms). The drop is regime shift (cost migrates from TPOT to TTFT once prefill blocks first-token long enough), not interference relief. | Reports must lead with TTFT idx; TPOT p99 is the right tail indicator for TPOT. See window_1_results.md §"TPOT idx peaks at 32k, not 65k". |