Files
agentic-kvc/analysis/characterization/current_results/reviewer_risk_register.md
Gahow Wang 4722883903 Audit package refresh: Window 1 supported claims + risk register
Refresh the standing audit package now that B1' / B2 / B3 are complete.

current_results/characterization_claim_matrix.md
  Flips seven entries from "not_yet_supported" / "partially_supported"
  to "supported" with pointers into window_1_results/. New entries
  cover per-session sequentiality, KV per request, real reuse
  decomposition, theoretical APC ceiling, the LMetric locality gap,
  Unified breaking the locality-vs-latency tradeoff, B2 causal
  interference proof, sticky's interference inflation, and the
  partial heavy-tail / hot-spot story. B4 SRR + B5 attribution stay
  "not_yet_supported" (Window 2 work).

current_results/main_claim_allowed_runs.md
  New "Allowed For Routing-Policy Comparison" section pins the five
  B3 policy directories. New "Allowed For PD-colo Interference"
  section pins the B2 sweep. Legacy section retained for the
  pre-instrumentation 200/500/1000-req runs.

current_results/reviewer_risk_register.md
  Marks the two old "high"-severity risks (sequentiality / reuse
  decomposition) as resolved; adds new entries for the APC
  contamination empirics, the b3_analyze.sh truncate-write bug that
  cost unified's interference index, the GPU-0 EngineCore ghost
  cleanup, the saturated-replay caveat for trace-timestamp dispatch,
  and the synthetic B2 decode workload.

current_results/all_figures_index.md
  Adds the 8 new Window 1 figures alongside the existing 6 from the
  legacy summarize_runs run.

current_results/reproduction_commands.sh
  Records the full B3 + B2 + figure pipeline.

analysis/characterization_todo_for_interns.md
  Updates the Progress Snapshot table: B0, B1, B2, B3, B6 all DONE;
  only B4 and B5 remain (Window 2).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 23:25:27 +08:00

2.9 KiB

Reviewer Risk Register

Updated 2026-05-25 after Window 1.

Risk Severity Evidence Mitigation
Session sequentiality not proven resolved A1 instrumentation lands per-request t_dispatch/t_first_token/t_finish unix timestamps + proxy_request_id. Smoke validation 2026-05-25 confirms 30/30 join coverage. All Window 1 runs already use this; Window 2 inherits.
Cache reuse decomposition incomplete resolved Real reuse decomposition computed in window_1_results/lmetric_reuse.json from joined records carrying session_id + hash_ids + cached_tokens.
APC across hot-sweep policies may be contaminated by prior policy runs low First-turn cached_tokens distribution shows < 1% empirical contamination; load_only and sticky vLLMs were not restarted between policies. unified and capped are isolated cold-start. Window 2 will isolate each policy launch by default; document in paper that lmetric/load_only/sticky reflect "warm-cache" condition.
Unified missing interference_index due to analyzer truncate-write bug medium The original b3_analyze.sh unconditionally slice_engine_state.py'd each policy and used open("w"), overwriting unified's correctly-written engine_state with the empty-window slice from the (hot-sweep) shared dir. Fixed in commit df32499. B2 microbench provides the cleaner same-vs-different interference proof, so we do not need to rerun unified.
GPU 0 ghost memory after vLLM crash low EngineCore subprocess name is VLLM::EngineCor; pkill -f "vllm serve" misses it. Killed manually on 2026-05-25; cleanup logic in b3_sweep.sh and b3_isolated_policy.sh now also targets EngineCore.
w600 trace is a 1k-request sample, not the full GLM-5.1 trace low All B3 + B2 percentiles are on this sample. Full-trace KV-footprint and reuse claims use the 2.11M-request full trace. Window 2 SRR sweep uses w600; full-trace SRR would need a larger sample and more GPU budget.
Trace-timestamp dispatch with strict session sequentiality stretches replay wall time medium lmetric's 600s trace dispatched over 49 min; system over-saturates and the dispatch window expands. Window 2 uses A4 open-loop Poisson loadgen with explicit arrival rate, decoupling load level from trace structure.
Capped cap=8 may be too soft low Reviewer might prefer cap=2 or cap=4 to test "no multi-turn" extreme. Cap=8 was chosen to sit between turns/session p90 (1) and p99 (18). Re-run with a stricter cap if reviewer pushes back; underlying capped script is parameterized.
B2 microbench uses synthetic short-prompt decode load (256 tokens) low This bounds the realism of the "decode" workload. Production decode tokens come from prior turns of long context. The signal magnitude is robust enough that prompt length shouldn't qualitatively change conclusions; B3 sticky's failure breakdown is the production-trace confirmation.