Refresh the standing audit package now that B1' / B2 / B3 are complete. current_results/characterization_claim_matrix.md Flips seven entries from "not_yet_supported" / "partially_supported" to "supported" with pointers into window_1_results/. New entries cover per-session sequentiality, KV per request, real reuse decomposition, theoretical APC ceiling, the LMetric locality gap, Unified breaking the locality-vs-latency tradeoff, B2 causal interference proof, sticky's interference inflation, and the partial heavy-tail / hot-spot story. B4 SRR + B5 attribution stay "not_yet_supported" (Window 2 work). current_results/main_claim_allowed_runs.md New "Allowed For Routing-Policy Comparison" section pins the five B3 policy directories. New "Allowed For PD-colo Interference" section pins the B2 sweep. Legacy section retained for the pre-instrumentation 200/500/1000-req runs. current_results/reviewer_risk_register.md Marks the two old "high"-severity risks (sequentiality / reuse decomposition) as resolved; adds new entries for the APC contamination empirics, the b3_analyze.sh truncate-write bug that cost unified's interference index, the GPU-0 EngineCore ghost cleanup, the saturated-replay caveat for trace-timestamp dispatch, and the synthetic B2 decode workload. current_results/all_figures_index.md Adds the 8 new Window 1 figures alongside the existing 6 from the legacy summarize_runs run. current_results/reproduction_commands.sh Records the full B3 + B2 + figure pipeline. analysis/characterization_todo_for_interns.md Updates the Progress Snapshot table: B0, B1, B2, B3, B6 all DONE; only B4 and B5 remain (Window 2). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2.9 KiB
2.9 KiB
Reviewer Risk Register
Updated 2026-05-25 after Window 1.
| Risk | Severity | Evidence | Mitigation |
|---|---|---|---|
| resolved | A1 instrumentation lands per-request t_dispatch/t_first_token/t_finish unix timestamps + proxy_request_id. Smoke validation 2026-05-25 confirms 30/30 join coverage. | All Window 1 runs already use this; Window 2 inherits. | |
| resolved | Real reuse decomposition computed in window_1_results/lmetric_reuse.json from joined records carrying session_id + hash_ids + cached_tokens. |
— | |
| APC across hot-sweep policies may be contaminated by prior policy runs | low | First-turn cached_tokens distribution shows < 1% empirical contamination; load_only and sticky vLLMs were not restarted between policies. unified and capped are isolated cold-start. |
Window 2 will isolate each policy launch by default; document in paper that lmetric/load_only/sticky reflect "warm-cache" condition. |
Unified missing interference_index due to analyzer truncate-write bug |
medium | The original b3_analyze.sh unconditionally slice_engine_state.py'd each policy and used open("w"), overwriting unified's correctly-written engine_state with the empty-window slice from the (hot-sweep) shared dir. |
Fixed in commit df32499. B2 microbench provides the cleaner same-vs-different interference proof, so we do not need to rerun unified. |
| GPU 0 ghost memory after vLLM crash | low | EngineCore subprocess name is VLLM::EngineCor; pkill -f "vllm serve" misses it. Killed manually on 2026-05-25; cleanup logic in b3_sweep.sh and b3_isolated_policy.sh now also targets EngineCore. |
— |
| w600 trace is a 1k-request sample, not the full GLM-5.1 trace | low | All B3 + B2 percentiles are on this sample. Full-trace KV-footprint and reuse claims use the 2.11M-request full trace. | Window 2 SRR sweep uses w600; full-trace SRR would need a larger sample and more GPU budget. |
| Trace-timestamp dispatch with strict session sequentiality stretches replay wall time | medium | lmetric's 600s trace dispatched over 49 min; system over-saturates and the dispatch window expands. | Window 2 uses A4 open-loop Poisson loadgen with explicit arrival rate, decoupling load level from trace structure. |
| Capped cap=8 may be too soft | low | Reviewer might prefer cap=2 or cap=4 to test "no multi-turn" extreme. Cap=8 was chosen to sit between turns/session p90 (1) and p99 (18). | Re-run with a stricter cap if reviewer pushes back; underlying capped script is parameterized. |
| B2 microbench uses synthetic short-prompt decode load (256 tokens) | low | This bounds the realism of the "decode" workload. Production decode tokens come from prior turns of long context. | The signal magnitude is robust enough that prompt length shouldn't qualitatively change conclusions; B3 sticky's failure breakdown is the production-trace confirmation. |