Files

Gahow Wang 4722883903 Audit package refresh: Window 1 supported claims + risk register

Refresh the standing audit package now that B1' / B2 / B3 are complete.

current_results/characterization_claim_matrix.md
  Flips seven entries from "not_yet_supported" / "partially_supported"
  to "supported" with pointers into window_1_results/. New entries
  cover per-session sequentiality, KV per request, real reuse
  decomposition, theoretical APC ceiling, the LMetric locality gap,
  Unified breaking the locality-vs-latency tradeoff, B2 causal
  interference proof, sticky's interference inflation, and the
  partial heavy-tail / hot-spot story. B4 SRR + B5 attribution stay
  "not_yet_supported" (Window 2 work).

current_results/main_claim_allowed_runs.md
  New "Allowed For Routing-Policy Comparison" section pins the five
  B3 policy directories. New "Allowed For PD-colo Interference"
  section pins the B2 sweep. Legacy section retained for the
  pre-instrumentation 200/500/1000-req runs.

current_results/reviewer_risk_register.md
  Marks the two old "high"-severity risks (sequentiality / reuse
  decomposition) as resolved; adds new entries for the APC
  contamination empirics, the b3_analyze.sh truncate-write bug that
  cost unified's interference index, the GPU-0 EngineCore ghost
  cleanup, the saturated-replay caveat for trace-timestamp dispatch,
  and the synthetic B2 decode workload.

current_results/all_figures_index.md
  Adds the 8 new Window 1 figures alongside the existing 6 from the
  legacy summarize_runs run.

current_results/reproduction_commands.sh
  Records the full B3 + B2 + figure pipeline.

analysis/characterization_todo_for_interns.md
  Updates the Progress Snapshot table: B0, B1, B2, B3, B6 all DONE;
  only B4 and B5 remain (Window 2).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

2026-05-25 23:25:27 +08:00

2.9 KiB

Raw Blame History

Reviewer Risk Register

Updated 2026-05-25 after Window 1.

Risk	Severity	Evidence	Mitigation
~~Session sequentiality not proven~~	resolved	A1 instrumentation lands per-request t_dispatch/t_first_token/t_finish unix timestamps + proxy_request_id. Smoke validation 2026-05-25 confirms 30/30 join coverage.	All Window 1 runs already use this; Window 2 inherits.
~~Cache reuse decomposition incomplete~~	resolved	Real reuse decomposition computed in `window_1_results/lmetric_reuse.json` from joined records carrying session_id + hash_ids + cached_tokens.	—
APC across hot-sweep policies may be contaminated by prior policy runs	low	First-turn cached_tokens distribution shows < 1% empirical contamination; load_only and sticky vLLMs were not restarted between policies. `unified` and `capped` are isolated cold-start.	Window 2 will isolate each policy launch by default; document in paper that lmetric/load_only/sticky reflect "warm-cache" condition.
Unified missing `interference_index` due to analyzer truncate-write bug	medium	The original `b3_analyze.sh` unconditionally `slice_engine_state.py`'d each policy and used `open("w")`, overwriting unified's correctly-written engine_state with the empty-window slice from the (hot-sweep) shared dir.	Fixed in commit `df32499`. B2 microbench provides the cleaner same-vs-different interference proof, so we do not need to rerun unified.
GPU 0 ghost memory after vLLM crash	low	EngineCore subprocess name is `VLLM::EngineCor`; `pkill -f "vllm serve"` misses it. Killed manually on 2026-05-25; cleanup logic in `b3_sweep.sh` and `b3_isolated_policy.sh` now also targets `EngineCore`.	—
w600 trace is a 1k-request sample, not the full GLM-5.1 trace	low	All B3 + B2 percentiles are on this sample. Full-trace KV-footprint and reuse claims use the 2.11M-request full trace.	Window 2 SRR sweep uses w600; full-trace SRR would need a larger sample and more GPU budget.
Trace-timestamp dispatch with strict session sequentiality stretches replay wall time	medium	lmetric's 600s trace dispatched over 49 min; system over-saturates and the dispatch window expands.	Window 2 uses A4 open-loop Poisson loadgen with explicit arrival rate, decoupling load level from trace structure.
Capped cap=8 may be too soft	low	Reviewer might prefer cap=2 or cap=4 to test "no multi-turn" extreme. Cap=8 was chosen to sit between turns/session p90 (1) and p99 (18).	Re-run with a stricter cap if reviewer pushes back; underlying capped script is parameterized.
B2 microbench uses synthetic short-prompt decode load (256 tokens)	low	This bounds the realism of the "decode" workload. Production decode tokens come from prior turns of long context.	The signal magnitude is robust enough that prompt length shouldn't qualitatively change conclusions; B3 sticky's failure breakdown is the production-trace confirmation.

2.9 KiB Raw Blame History

Reviewer Risk Register

2.9 KiB

Raw Blame History