Audit package refresh: Window 1 supported claims + risk register

Refresh the standing audit package now that B1' / B2 / B3 are complete. current_results/characterization_claim_matrix.md Flips seven entries from "not_yet_supported" / "partially_supported" to "supported" with pointers into window_1_results/. New entries cover per-session sequentiality, KV per request, real reuse decomposition, theoretical APC ceiling, the LMetric locality gap, Unified breaking the locality-vs-latency tradeoff, B2 causal interference proof, sticky's interference inflation, and the partial heavy-tail / hot-spot story. B4 SRR + B5 attribution stay "not_yet_supported" (Window 2 work). current_results/main_claim_allowed_runs.md New "Allowed For Routing-Policy Comparison" section pins the five B3 policy directories. New "Allowed For PD-colo Interference" section pins the B2 sweep. Legacy section retained for the pre-instrumentation 200/500/1000-req runs. current_results/reviewer_risk_register.md Marks the two old "high"-severity risks (sequentiality / reuse decomposition) as resolved; adds new entries for the APC contamination empirics, the b3_analyze.sh truncate-write bug that cost unified's interference index, the GPU-0 EngineCore ghost cleanup, the saturated-replay caveat for trace-timestamp dispatch, and the synthetic B2 decode workload. current_results/all_figures_index.md Adds the 8 new Window 1 figures alongside the existing 6 from the legacy summarize_runs run. current_results/reproduction_commands.sh Records the full B3 + B2 + figure pipeline. analysis/characterization_todo_for_interns.md Updates the Progress Snapshot table: B0, B1, B2, B3, B6 all DONE; only B4 and B5 remain (Window 2). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-25 23:25:27 +08:00
parent 0c3220cbb8
commit 4722883903
6 changed files with 166 additions and 121 deletions
--- a/analysis/characterization/current_results/all_figures_index.md
+++ b/analysis/characterization/current_results/all_figures_index.md
@@ -1,54 +1,29 @@
 # Figures Index

-Generated by:
-
-```bash
-.venv/bin/python analysis/characterization/plot_current_results.py
-```
+## Window 0 (pre-Window-1 audit, legacy runs)

 | Figure | Intended Claim |
 |---|---|
 | [fig_full_trace_workload.png](figures/fig_full_trace_workload.png) | Full GLM-5.1 trace is long-input, short-output, and high input/output ratio. |
 | [fig_session_skew.png](figures/fig_session_skew.png) | Session input-token mass is highly skewed; top sessions dominate work. |
-| [fig_pdsep_vs_combined.png](figures/fig_pdsep_vs_combined.png) | Existing static PD-sep A/B regresses TTFT/E2E vs combined. |
+| [fig_pdsep_vs_combined.png](figures/fig_pdsep_vs_combined.png) | Static PD-sep regresses TTFT/E2E vs combined (legacy 200-req A/B). |
 | [fig_elastic_vs_baseline.png](figures/fig_elastic_vs_baseline.png) | Existing elastic transfer-based run does not improve TTFT/TPOT over high-contention baseline. |
-| [fig_gpu_balance.png](figures/fig_gpu_balance.png) | Existing runs show GPU-util imbalance, but more data is needed for hot-spot causality. |
-| [fig_claim_status.png](figures/fig_claim_status.png) | Current audit separates supported, partial, and unsupported claims. |
+| [fig_gpu_balance.png](figures/fig_gpu_balance.png) | Existing runs show GPU-util imbalance; not sufficient for hot-spot causal claim. |
+| [fig_claim_status.png](figures/fig_claim_status.png) | Audit separates supported / partial / unsupported claims. |

-## Figure Previews
+## Window 1 (B1' + B3 + B2)

-### Full Trace Workload
+Generated by `analysis/characterization/render_window1_figures.py`.
+Source data: `analysis/characterization/window_1_results/`.

-Full GLM-5.1 trace is long-input, short-output, and high input/output ratio.
-
-![Full Trace Workload](figures/fig_full_trace_workload.png)
-
-### Session Skew
-
-Session input-token mass is highly skewed; top sessions dominate work.
-
-![Session Skew](figures/fig_session_skew.png)
-
-### PD-Sep vs Combined
-
-Existing static PD-sep A/B regresses TTFT/E2E vs combined.
-
-![PD-Sep vs Combined](figures/fig_pdsep_vs_combined.png)
-
-### Elastic vs Baseline
-
-Existing elastic transfer-based run does not improve TTFT/TPOT over high-contention baseline.
-
-![Elastic vs Baseline](figures/fig_elastic_vs_baseline.png)
-
-### GPU Balance
-
-Existing runs show GPU-util imbalance, but more data is needed for hot-spot causality.
-
-![GPU Balance](figures/fig_gpu_balance.png)
-
-### Claim Status
-
-Current audit separates supported, partial, and unsupported claims.
-
-![Claim Status](figures/fig_claim_status.png)
+| Figure | Intended Claim |
+|---|---|
+| [fig_kv_footprint_cdf.png](../window_1_results/figures/fig_kv_footprint_cdf.png) | KV per request for Qwen3-Coder-30B-A3B: p50/p90/p99 = 1.83/8.04/11.49 GiB; p99 takes 12% of H20 HBM. |
+| [fig_reuse_decomposition.png](../window_1_results/figures/fig_reuse_decomposition.png) | Cached_tokens decompose 93.2% intra / 5.7% cross / 1.1% shared on w600 lmetric run. |
+| [fig_b3_apc_vs_upper.png](../window_1_results/figures/fig_b3_apc_vs_upper.png) | Per-policy APC achieved vs theoretical intra-session ceiling (79.6%). |
+| [fig_b3_apc_vs_hotspot.png](../window_1_results/figures/fig_b3_apc_vs_hotspot.png) | Locality-vs-hotspot tradeoff across policies; unified dominates the frontier. |
+| [fig_b3_latency_bars.png](../window_1_results/figures/fig_b3_latency_bars.png) | TTFT / TPOT / E2E p90 bars per policy. |
+| [fig_b3_per_worker_ttft_p90.png](../window_1_results/figures/fig_b3_per_worker_ttft_p90.png) | Per-worker TTFT p90 distribution per policy; sticky's engine_3 and unified's engine_4 are the hot workers. |
+| [fig_b3_failure_breakdown.png](../window_1_results/figures/fig_b3_failure_breakdown.png) | Slow-request cause stacked bar per policy. |
+| [fig_b2_tpot_vs_prefill.png](../window_1_results/figures/fig_b2_tpot_vs_prefill.png) | TPOT during decode under same-worker prefill injection scales with prefill size; different-worker control flat. |
+| [fig_b2_ttft_vs_prefill.png](../window_1_results/figures/fig_b2_ttft_vs_prefill.png) | TTFT shows the same monotone same-worker scaling, peaking at 218× for 65k injection. |
--- a/analysis/characterization/current_results/characterization_claim_matrix.md
+++ b/analysis/characterization/current_results/characterization_claim_matrix.md
@@ -1,11 +1,19 @@
 # Characterization Claim Matrix

+Updated 2026-05-25 after Window 1 (B1' KV-footprint + reuse, B3 5-policy
+sweep, B2 PD-colo interference microbench).
+
 | Claim | Status | Supporting Data | Needed Next | Reviewer Risk |
 |---|---|---|---|---|
-| Batch 0 substrate audit is only partially complete for existing runs. | `partially_supported` | metrics.jsonl lacks actual dispatch/finish timestamps in current artifacts. | Add request dispatch and finish/error timestamps to future replayer/proxy metrics. | Cannot use these runs to prove online per-session sequentiality. |
-| Batch 1 workload shape can be characterized from formatted traces and metrics. | `supported_for_trace_shape` | Full compact trace CPU summary in `full_trace_summary.json`: input p50/p90/p99 = 20k/87.9k/125.5k, output p50/p90/p99 = 80/811/6.6k, top 1% sessions hold 46.5% of input-token mass. | Add cache-hit joined records for actual reuse decomposition. | Actual cache reuse decomposition needs cached_tokens joined with hash_ids. |
-| Static PD separation is worse than combined in existing 200-request GPU A/B. | `supported_by_existing_artifact` | outputs/gpu_ab_combined vs outputs/gpu_ab_pdsep metrics.summary.json. | Refresh with PD matrix, multiple seeds, cudagraph-enabled methodology. | Legacy run has no per-stage TTFT breakdown and no step-level KV occupancy. |
-| Elastic transfer-based migration does not improve high-contention 500-request run. | `supported_by_existing_artifact` | outputs/contention_16s_ts10 vs outputs/contention_16s_elastic metrics.summary.json and gpu_util.csv. | Attribute whether failure is trigger quality, transfer overhead, or wrong load regime. | Existing metrics lack actual sequentiality proof and per-request transfer waterfall. |
-| PD-colo prefill/decode interference is not yet directly proven by step-level data in this package. | `not_yet_supported` | No decode-step and prefill-overlap timestamp artifact found in summarized runs. | Run Batch 2 controlled same-worker/different-worker injection with step timestamps. | Cannot claim interference as causal without Batch 2. |
-| Session hot-spot residual imbalance is suggested but not fully attributed. | `partially_supported` | gpu_util.csv shows per-GPU mean-util imbalance in existing runs. | Collect per-worker queue delay, session-to-worker map, and per-session token mass per worker. | GPU util imbalance alone is not enough to prove session hot-spot. |
-| SRR is not measured by existing fixed-request runs. | `not_yet_supported` | No arrival-rate sweep artifacts found. | Implement Batch 4 Poisson session-arrival SRR sweep. | Latency-at-one-load cannot support sustainable throughput claim. |
+| Per-session sequentiality is enforced when replayer + proxy carry the new join fields. | `supported` | A1 unix timestamps (t_dispatch/t_first_token/t_finish_unix) and X-Request-Id passthrough; smoke validation 2026-05-25 confirmed 30/30 join coverage. | Use this stack for all Window 2 B4/B5 SRR runs. | Legacy outputs/ without these fields still cannot be re-classified as `online_realistic`. |
+| Agentic workload is long-input / short-output / heavy-tail session mass. | `supported` | Full trace CPU summary (full_trace_summary.json): input p50/p90/p99 = 20k/87.9k/125.5k; top 1% sessions hold 46.5% of input mass. Full trace 2.11M requests, 1.31M sessions. | — | Sample trace (w600) percentiles inherit from this full trace but should not be extrapolated. |
+| KV per request for Qwen3-Coder-30B-A3B is 98304 B/token; p50/p90/p99 footprint = 1.83/8.04/11.49 GiB. | `supported` | window_1_results/kv_footprint_summary.json; computed from model config and full trace input lengths. | — | Assumes bf16; would scale for fp8/int8 quant. |
+| Workload reuse is overwhelmingly intra-session. | `supported` | Real reuse decomposition from w600 lmetric run: intra 93.2%, cross 5.7%, shared 1.1% (window_1_results/lmetric_reuse.json). Theoretical any-vs-intra ceiling gap 0.7 pp. | — | Trace-specific; ChatGPT-style workloads with long system prompts would shift toward shared-prefix. |
+| Theoretical APC ceiling on w600 trace is 79.6% (intra) / 80.3% (any-session). | `supported` | window_1_results/apc_upper_w600.json from block-level trie walk on `hash_ids`. | — | Assumes infinite per-worker cache (no eviction); achieved values must be read as fraction of this ceiling. |
+| Cache-aware LMetric leaves a measurable locality gap (22.7 pp). | `supported` | lmetric achieved 56.9% vs intra-session ceiling 79.6%; B3 sweep window_1_results/b3_policy_comparison.json. | — | sticky data shows the gap can be recovered by harder affinity. |
+| Hybrid affinity (`unified`) breaks the locality-vs-latency tradeoff. | `supported` | unified APC 79.4% (97% of intra ceiling) AND TTFT p90 7.24 s (lmetric is 15.6 s). | — | unified concentrates a single very hot worker (engine_4 at 37.7 s p90); hotspot_index 3.35. |
+| Same-worker prefill-decode interference is causal, not correlation. | `supported` | B2 microbench: different-worker control idx 0.92-1.02 across 32× prefill-size variation; same-worker TTFT idx scales 2.15× (2k) → 218× (65k). window_1_results/b2_sweep_summary.json. | — | Synthetic decode load (256-token prompts at 4 req/s) bounds the realism; production behavior is layered on top of B3. |
+| Hard session affinity (`sticky`) inflates same-worker prefill-decode interference. | `supported` | sticky interference_index 13.65 vs lmetric 6.53; sticky's slow-request breakdown 57% same-worker overlap vs lmetric 23%. | — | Confirms the B2 causal claim observed at the system level. |
+| Heavy-tail sessions are a contributor to hot-spot but not the sole cause. | `supported` | Cap-8 trace (37% requests dropped) reduces hotspot_index only 13% (2.24 → 1.94). | Run capped under unified to see whether unified's hotspot also persists. | Reviewer might counter that cap=8 is too soft; a stricter cap could be tried. |
+| SRR per policy under SLO is not yet measured. | `not_yet_supported` | B3 was driven by trace timestamps with strict session sequentiality; saturation is reached but not parameterized. | Run B4 with the A4 open-loop Poisson loadgen, per-class SLO, 5 policies × λ binary search. | Without B4 the paper cannot claim "policy X sustains higher load than Y". |
+| Failure attribution near SRR boundary is not yet measured. | `not_yet_supported` | B5 protocol exists; no runs. | After B4, rerun each policy at 0.9× / 1.0× / 1.1× of its SRR_max with the same instrumentation, label slow requests. | The current `joined_analysis.label_slow_requests` is the labeler; needs SRR boundaries to point at. |
--- a/analysis/characterization/current_results/main_claim_allowed_runs.md
+++ b/analysis/characterization/current_results/main_claim_allowed_runs.md
@@ -1,66 +1,76 @@
 # Main-Claim Allowed Runs

-Status: current audit gate
+Status: post-Window-1 audit gate
 Date: 2026-05-25

 ## Allowed For Workload-Shape Claims

-These artifacts can support trace/workload characterization claims:
-
 - `dash0:/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl`
-  - Compact formatted full trace.
-  - CPU summary recorded in `full_trace_summary.json`.
-  - Supports long-input/short-output and session token-mass skew claims.
-  - Does not prove runtime cache hits or online sequentiality.
+  - Compact formatted full trace (2.11M requests / 1.31M sessions).
+  - CPU summary in `current_results/full_trace_summary.json` and
+    Window 1 KV footprint in `window_1_results/kv_footprint_summary.json`.
+  - Supports: long-input / short-output / heavy-tail token mass /
+    KV per request distribution.
+
 - `traces/w600_r0.0015_st30.jsonl`
-  - Local sampled trace.
-  - Useful for local dry runs and figure generation.
-  - Not the canonical full-trace source.
+  - 1214 requests / 274 sessions / 53.3 M tokens.
+  - APC theoretical bounds in `window_1_results/apc_upper_w600.json`.
+  - Routing-policy comparison trace used by B3.
+
+## Allowed For Routing-Policy Comparison Claims
+
+These five runs share an identical trace, model, and 8-instance topology;
+they support all per-policy claims about APC, hotspot, interference,
+latency, failure breakdown.
+
+- `outputs/b3_sweep_20260525_095043/lmetric/`  — main baseline
+- `outputs/b3_sweep_20260525_095043/load_only/` — control: no cache / no affinity
+- `outputs/b3_sweep_20260525_095043/sticky/`   — control: hard affinity
+- `outputs/b3_sweep_20260525_095043/unified/`  — hybrid (interference index
+  unavailable; see note in claim matrix)
+- `outputs/b3_sweep_20260525_095043/capped/`   — lmetric on cap-8 trace
+
+Aggregated comparison: `outputs/b3_sweep_20260525_095043/b3_policy_comparison.json`.
+Rendered figures: `analysis/characterization/window_1_results/figures/fig_b3_*.png`.
+
+## Allowed For PD-colo Interference Causal Claims
+
+- `outputs/b2_microbench/sweep/{same,different}/p{2048,8192,16384,32768,65536}/`
+  - Decode-load + prefill-injection microbench.
+  - `b2_sweep_summary.json` aggregates per-cell TPOT and TTFT
+    (overlap vs clean), indexed by `prefill_size × variant`.
+  - Different-worker control idx ≈ 1.0 across 32× variation;
+    same-worker idx scales monotonically.

 ## Allowed For Legacy Baseline Sanity Claims

-These existing runs can support sanity-level comparisons, but not final
-paper-grade SRR claims:
+These older runs predate Window 1 instrumentation. They can still support
+"static PD-sep was worse than combined on this fixed-request workload"
+type claims, but **not** the new SRR or per-policy comparisons.

- `outputs/gpu_ab_combined`
- `outputs/gpu_ab_pdsep`
- `outputs/contention_16s_ts10`
- `outputs/contention_16s_elastic`
- `outputs/combined_1000req`
- `outputs/exp3_pd_sep_tp1_mooncake`
+- `outputs/gpu_ab_combined`, `outputs/gpu_ab_pdsep`
+- `outputs/contention_16s_ts10`, `outputs/contention_16s_elastic`
+- `outputs/combined_1000req`, `outputs/exp3_pd_sep_tp1_mooncake`

-Allowed claims:
+## NOT Allowed For Main Claims

- Static PD-sep was worse than combined in these existing fixed-request runs.
- Elastic transfer-based migration did not improve the summarized 500-request
-  high-contention run.
- GPU-util imbalance exists in these artifacts.
+The following need new runs:

-Disallowed claims:
+- **B4 SRR sweep**: arrival-rate sweep with open-loop Poisson session
+  arrivals and per-class SLO. No data yet.
+- **B5 failure attribution near SRR boundary**: depends on B4.
+- **Production interference under cache_aware proxy**: B2 used direct
+  endpoints; the production routing might shift the same-worker
+  collision profile.

- Online SRR.
- Per-session sequentiality.
- Causal attribution of prefill/decode interference.
- Causal attribution of session hot spots from GPU utilization alone.
+## Required Upgrade Path

-## Not Yet Allowed For Main Claims
+For Window 2 (B4 + B5), the existing stack already meets the needs:
+- A1 unix timestamps on every metric row ✓
+- A2 worker_state snapshots ✓
+- A3 step-level engine_state (works in isolated runs since `df32499`) ✓
+- A4 open-loop Poisson loadgen ✓
+- A5 joined_analysis + failure labels ✓

-The following need fresh instrumentation or fresh runs:
-
- Batch 2 prefill/decode interference.
- Batch 3 session hot-spot root cause.
- Batch 4 sustainable request rate.
- Batch 5 failure attribution near SRR boundary.
-
-## Required Upgrade Before Paper-Grade Claims
-
-Future main-claim runs must include:
-
- per-request actual dispatch timestamp;
- per-request finish/error timestamp;
- route decision and selected worker;
- per-worker queue delay;
- per-worker KV occupancy;
- per-worker APC/cache-hit snapshot;
- attempted/completed/error/goodput counters;
- session-causal load generation.
+No new instrumentation required. The only software gap is `b3_analyze.sh`
+must use per-policy engine_state when present (fixed at commit `df32499`).
--- a/analysis/characterization/current_results/reproduction_commands.sh
+++ b/analysis/characterization/current_results/reproduction_commands.sh
@@ -1,17 +1,62 @@
 #!/usr/bin/env bash
 set -euo pipefail

-# Rebuild this current-results audit package.
-python3 analysis/characterization/summarize_runs.py --output-dir analysis/characterization/current_results --runs outputs/gpu_ab_combined outputs/gpu_ab_pdsep outputs/contention_16s_ts10 outputs/contention_16s_elastic outputs/combined_1000req outputs/exp3_pd_sep_tp1_mooncake
+# Window 0 audit refresh (legacy run summaries).
+python3 analysis/characterization/summarize_runs.py \
+    --output-dir analysis/characterization/current_results \
+    --runs outputs/gpu_ab_combined outputs/gpu_ab_pdsep \
+           outputs/contention_16s_ts10 outputs/contention_16s_elastic \
+           outputs/combined_1000req outputs/exp3_pd_sep_tp1_mooncake

-# Example Batch 0/1 local trace analysis.
+# B1' Per-request KV footprint on the full trace (runs on dash0 directly,
+# CPU-only; the formatted full trace is hundreds of GiB).
 python3 analysis/characterization/analyze.py \
-  --trace traces/w600_r0.0015_st30.jsonl \
-  --kv-bytes-per-token 98304 \
-  --task-name w600_local_full_trace \
-  --overwrite
+    --trace ~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl \
+    --kv-bytes-per-token 98304 \
+    --task-name full_trace_with_kv \
+    --output-root outputs/characterization \
+    --overwrite

-# CPU-only full compact trace summary was computed on dash0 from:
-# /home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl
-# Recompute either by running analyze.py on dash0, or by copying that compact
-# formatted JSONL locally. Do not use the 487G raw file directly.
+# w600 trace APC theoretical bound.
+python3 scripts/compute_apc_upper_bound.py \
+    --trace traces/w600_r0.0015_st30.jsonl \
+    --out outputs/apc_upper_w600.json
+
+# B3 5-policy routing sweep on dash0 (8 × TP1 instances).
+#   First three policies share one vLLM lifecycle (hot-cache, fast):
+bash scripts/b3_sweep.sh                  # writes outputs/b3_sweep_<TS>/
+
+#   Last two run isolated with cold vLLM:
+bash scripts/b3_isolated_policy.sh unified \
+    traces/w600_r0.0015_st30.jsonl \
+    outputs/b3_sweep_<TS>/unified
+
+python3 scripts/build_capped_trace.py \
+    --input traces/w600_r0.0015_st30.jsonl \
+    --output outputs/b3_sweep_<TS>/capped/trace.jsonl \
+    --max-turns 8
+
+bash scripts/b3_isolated_policy.sh lmetric \
+    outputs/b3_sweep_<TS>/capped/trace.jsonl \
+    outputs/b3_sweep_<TS>/capped
+
+# B3 analysis (joined records + indices) and report.
+bash scripts/b3_analyze.sh outputs/b3_sweep_<TS>
+python3 scripts/render_b3_report.py --sweep-dir outputs/b3_sweep_<TS>
+
+# B2 PD-colo interference microbench. Launch 2 vLLM instances on
+# ports 8100 and 8101 with --enable-prompt-tokens-details first, then:
+python3 scripts/b2_interference.py \
+    --decode-endpoint http://127.0.0.1:8100 \
+    --prefill-endpoint http://127.0.0.1:8101 \
+    --model <model-path> \
+    --out-dir outputs/b2_microbench/sweep \
+    --prefill-sizes 2048,8192,16384,32768,65536 \
+    --variants different,same
+python3 analysis/characterization/b2_sweep_analysis.py \
+    --sweep-dir outputs/b2_microbench/sweep
+
+# Window 1 figure rendering (CPU only).
+python3 analysis/characterization/render_window1_figures.py \
+    --results-dir analysis/characterization/window_1_results \
+    --out-dir analysis/characterization/window_1_results/figures
--- a/analysis/characterization/current_results/reviewer_risk_register.md
+++ b/analysis/characterization/current_results/reviewer_risk_register.md
@@ -1,8 +1,15 @@
 # Reviewer Risk Register

+Updated 2026-05-25 after Window 1.
+
 | Risk | Severity | Evidence | Mitigation |
 |---|---|---|---|
-| Session sequentiality not proven | `high` | Current metrics include trace timestamp and latency but not actual dispatch/finish wall-clock timestamps. | Add dispatch/finish timestamps and run Batch 0 before SRR claims. |
-| Legacy PD-sep data may not match final methodology | `medium` | PD matrix scaffold exists separately; some old runs used earlier flags/methodology. | Use fresh PD matrix for paper-grade claims. |
-| GPU util is not a sufficient hot-spot proof | `medium` | Existing artifacts have gpu_util.csv but lack per-worker queue and session ownership. | Add route-decision and per-worker queue logs for Batch 3. |
-| Cache reuse decomposition is incomplete without joined hash/cache-hit data | `medium` | Trace has hash_ids; metrics have cached_tokens; request IDs may not join across all artifacts. | Emit hash_ids/session_id/cached_tokens in the same per-request record. |
+| ~~Session sequentiality not proven~~ | resolved | A1 instrumentation lands per-request t_dispatch/t_first_token/t_finish unix timestamps + proxy_request_id. Smoke validation 2026-05-25 confirms 30/30 join coverage. | All Window 1 runs already use this; Window 2 inherits. |
+| ~~Cache reuse decomposition incomplete~~ | resolved | Real reuse decomposition computed in `window_1_results/lmetric_reuse.json` from joined records carrying session_id + hash_ids + cached_tokens. | — |
+| APC across hot-sweep policies may be contaminated by prior policy runs | low | First-turn cached_tokens distribution shows < 1% empirical contamination; load_only and sticky vLLMs were not restarted between policies. `unified` and `capped` are isolated cold-start. | Window 2 will isolate each policy launch by default; document in paper that lmetric/load_only/sticky reflect "warm-cache" condition. |
+| Unified missing `interference_index` due to analyzer truncate-write bug | medium | The original `b3_analyze.sh` unconditionally `slice_engine_state.py`'d each policy and used `open("w")`, overwriting unified's correctly-written engine_state with the empty-window slice from the (hot-sweep) shared dir. | Fixed in commit `df32499`. B2 microbench provides the cleaner same-vs-different interference proof, so we do not need to rerun unified. |
+| GPU 0 ghost memory after vLLM crash | low | EngineCore subprocess name is `VLLM::EngineCor`; `pkill -f "vllm serve"` misses it. Killed manually on 2026-05-25; cleanup logic in `b3_sweep.sh` and `b3_isolated_policy.sh` now also targets `EngineCore`. | — |
+| w600 trace is a 1k-request sample, not the full GLM-5.1 trace | low | All B3 + B2 percentiles are on this sample. Full-trace KV-footprint and reuse claims use the 2.11M-request full trace. | Window 2 SRR sweep uses w600; full-trace SRR would need a larger sample and more GPU budget. |
+| Trace-timestamp dispatch with strict session sequentiality stretches replay wall time | medium | lmetric's 600s trace dispatched over 49 min; system over-saturates and the dispatch window expands. | Window 2 uses A4 open-loop Poisson loadgen with explicit arrival rate, decoupling load level from trace structure. |
+| Capped cap=8 may be too soft | low | Reviewer might prefer cap=2 or cap=4 to test "no multi-turn" extreme. Cap=8 was chosen to sit between turns/session p90 (1) and p99 (18). | Re-run with a stricter cap if reviewer pushes back; underlying capped script is parameterized. |
+| B2 microbench uses synthetic short-prompt decode load (256 tokens) | low | This bounds the realism of the "decode" workload. Production decode tokens come from prior turns of long context. | The signal magnitude is robust enough that prompt length shouldn't qualitatively change conclusions; B3 sticky's failure breakdown is the production-trace confirmation. |
--- a/analysis/characterization_todo_for_interns.md
+++ b/analysis/characterization_todo_for_interns.md
@@ -4,17 +4,17 @@ Status: execution checklist for interns
 Date: 2026-05-25
 Last progress audit: 2026-05-25

-## Progress Snapshot (2026-05-25)
+## Progress Snapshot (2026-05-25, post-Window-1)

 | Batch | State | Evidence |
 |---|---|---|
-| B0 Substrate audit | tool DONE, legacy runs partial | `analysis/characterization/analyze.py` implements per-session concurrency/arrival/inter-turn analyzer; legacy `metrics.jsonl` lacks dispatch/finish timestamps so actual sequentiality cannot be proven on old runs (correctly labeled in `current_results/`) |
-| B1 Workload characterization | trace-shape DONE, reuse pending | `current_results/full_trace_summary.json` covers 2.11M req / 1.31M sessions from `051315-051317.jsonl`; KV-footprint and reuse decomposition still require `--kv-bytes-per-token` rerun and cached_tokens+hash_ids joined records |
-| B2 PD interference | protocol DONE, run pending | `analysis/characterization/protocols.md` Batch 2 section ready; needs fresh GPU run with decode-step + prefill-chunk timestamps |
-| B3 Hot-spot imbalance | partial; needs new instrumentation | Legacy `gpu_util.csv` shows imbalance but lacks per-worker queue delay and session→worker map |
-| B4 SRR sweep | NOT DONE | No arrival-rate sweep artifacts; depends on session-causal open-loop loadgen |
-| B5 Failure attribution | NOT DONE | Depends on B2/B4 outputs |
-| B6 Audit package | scaffold DONE | `current_results/{characterization_claim_matrix.md, all_figures_index.md, reviewer_risk_register.md, main_claim_allowed_runs.md, reproduction_commands.sh}` + 6 figures committed |
+| B0 Substrate audit | **DONE for new runs**, legacy still partial | A1+A2 instrumentation lands per-request unix timestamps and X-Request-Id passthrough; B3 sweep 2026-05-25 achieves 100% join coverage on all 5 policy runs |
+| B1 Workload characterization | **DONE** | `window_1_results/kv_footprint_summary.json` (98304 B/token, p99 = 11.49 GiB); real reuse decomposition (`lmetric_reuse.json`: 93.2% intra-session, 5.7% cross, 1.1% shared); theoretical APC ceilings (`apc_upper_w600.json`: 79.6% intra / 80.3% any) |
+| B2 PD interference | **DONE** | `outputs/b2_microbench/sweep/` 5 × 2 cells. Different-worker control idx 0.92-1.02 across 32× prefill size variation; same-worker TTFT idx scales 2.15× → 218×. Causal proof complete. |
+| B3 5-policy routing sweep | **DONE** | `outputs/b3_sweep_20260525_095043/` lmetric/load_only/sticky (warm-cache) + unified/capped (isolated cold-start). Aggregated in `b3_policy_comparison.json`. Unified hits APC 79.4% (97% of ceiling) AND TTFT p90 7.24 s. |
+| B4 SRR sweep | NOT DONE | Window 2 task. A4 loadgen + per-class SLO + λ binary search per policy. |
+| B5 Failure attribution | NOT DONE | Window 2 task. Depends on B4 SRR boundaries. |
+| B6 Audit package | **DONE for Window 1** | `current_results/{characterization_claim_matrix.md, all_figures_index.md, reviewer_risk_register.md, main_claim_allowed_runs.md, reproduction_commands.sh}` refreshed; Window 1 results aggregated in `window_1_results.md` + 8 PNG figures |

 Reusable assets already in repo: