diff --git a/analysis/characterization/current_results/all_figures_index.md b/analysis/characterization/current_results/all_figures_index.md index 536e607..1349221 100644 --- a/analysis/characterization/current_results/all_figures_index.md +++ b/analysis/characterization/current_results/all_figures_index.md @@ -1,54 +1,29 @@ # Figures Index -Generated by: - -```bash -.venv/bin/python analysis/characterization/plot_current_results.py -``` +## Window 0 (pre-Window-1 audit, legacy runs) | Figure | Intended Claim | |---|---| | [fig_full_trace_workload.png](figures/fig_full_trace_workload.png) | Full GLM-5.1 trace is long-input, short-output, and high input/output ratio. | | [fig_session_skew.png](figures/fig_session_skew.png) | Session input-token mass is highly skewed; top sessions dominate work. | -| [fig_pdsep_vs_combined.png](figures/fig_pdsep_vs_combined.png) | Existing static PD-sep A/B regresses TTFT/E2E vs combined. | +| [fig_pdsep_vs_combined.png](figures/fig_pdsep_vs_combined.png) | Static PD-sep regresses TTFT/E2E vs combined (legacy 200-req A/B). | | [fig_elastic_vs_baseline.png](figures/fig_elastic_vs_baseline.png) | Existing elastic transfer-based run does not improve TTFT/TPOT over high-contention baseline. | -| [fig_gpu_balance.png](figures/fig_gpu_balance.png) | Existing runs show GPU-util imbalance, but more data is needed for hot-spot causality. | -| [fig_claim_status.png](figures/fig_claim_status.png) | Current audit separates supported, partial, and unsupported claims. | +| [fig_gpu_balance.png](figures/fig_gpu_balance.png) | Existing runs show GPU-util imbalance; not sufficient for hot-spot causal claim. | +| [fig_claim_status.png](figures/fig_claim_status.png) | Audit separates supported / partial / unsupported claims. | -## Figure Previews +## Window 1 (B1' + B3 + B2) -### Full Trace Workload +Generated by `analysis/characterization/render_window1_figures.py`. +Source data: `analysis/characterization/window_1_results/`. -Full GLM-5.1 trace is long-input, short-output, and high input/output ratio. - -![Full Trace Workload](figures/fig_full_trace_workload.png) - -### Session Skew - -Session input-token mass is highly skewed; top sessions dominate work. - -![Session Skew](figures/fig_session_skew.png) - -### PD-Sep vs Combined - -Existing static PD-sep A/B regresses TTFT/E2E vs combined. - -![PD-Sep vs Combined](figures/fig_pdsep_vs_combined.png) - -### Elastic vs Baseline - -Existing elastic transfer-based run does not improve TTFT/TPOT over high-contention baseline. - -![Elastic vs Baseline](figures/fig_elastic_vs_baseline.png) - -### GPU Balance - -Existing runs show GPU-util imbalance, but more data is needed for hot-spot causality. - -![GPU Balance](figures/fig_gpu_balance.png) - -### Claim Status - -Current audit separates supported, partial, and unsupported claims. - -![Claim Status](figures/fig_claim_status.png) +| Figure | Intended Claim | +|---|---| +| [fig_kv_footprint_cdf.png](../window_1_results/figures/fig_kv_footprint_cdf.png) | KV per request for Qwen3-Coder-30B-A3B: p50/p90/p99 = 1.83/8.04/11.49 GiB; p99 takes 12% of H20 HBM. | +| [fig_reuse_decomposition.png](../window_1_results/figures/fig_reuse_decomposition.png) | Cached_tokens decompose 93.2% intra / 5.7% cross / 1.1% shared on w600 lmetric run. | +| [fig_b3_apc_vs_upper.png](../window_1_results/figures/fig_b3_apc_vs_upper.png) | Per-policy APC achieved vs theoretical intra-session ceiling (79.6%). | +| [fig_b3_apc_vs_hotspot.png](../window_1_results/figures/fig_b3_apc_vs_hotspot.png) | Locality-vs-hotspot tradeoff across policies; unified dominates the frontier. | +| [fig_b3_latency_bars.png](../window_1_results/figures/fig_b3_latency_bars.png) | TTFT / TPOT / E2E p90 bars per policy. | +| [fig_b3_per_worker_ttft_p90.png](../window_1_results/figures/fig_b3_per_worker_ttft_p90.png) | Per-worker TTFT p90 distribution per policy; sticky's engine_3 and unified's engine_4 are the hot workers. | +| [fig_b3_failure_breakdown.png](../window_1_results/figures/fig_b3_failure_breakdown.png) | Slow-request cause stacked bar per policy. | +| [fig_b2_tpot_vs_prefill.png](../window_1_results/figures/fig_b2_tpot_vs_prefill.png) | TPOT during decode under same-worker prefill injection scales with prefill size; different-worker control flat. | +| [fig_b2_ttft_vs_prefill.png](../window_1_results/figures/fig_b2_ttft_vs_prefill.png) | TTFT shows the same monotone same-worker scaling, peaking at 218× for 65k injection. | diff --git a/analysis/characterization/current_results/characterization_claim_matrix.md b/analysis/characterization/current_results/characterization_claim_matrix.md index cfe0d5b..c8fea9f 100644 --- a/analysis/characterization/current_results/characterization_claim_matrix.md +++ b/analysis/characterization/current_results/characterization_claim_matrix.md @@ -1,11 +1,19 @@ # Characterization Claim Matrix +Updated 2026-05-25 after Window 1 (B1' KV-footprint + reuse, B3 5-policy +sweep, B2 PD-colo interference microbench). + | Claim | Status | Supporting Data | Needed Next | Reviewer Risk | |---|---|---|---|---| -| Batch 0 substrate audit is only partially complete for existing runs. | `partially_supported` | metrics.jsonl lacks actual dispatch/finish timestamps in current artifacts. | Add request dispatch and finish/error timestamps to future replayer/proxy metrics. | Cannot use these runs to prove online per-session sequentiality. | -| Batch 1 workload shape can be characterized from formatted traces and metrics. | `supported_for_trace_shape` | Full compact trace CPU summary in `full_trace_summary.json`: input p50/p90/p99 = 20k/87.9k/125.5k, output p50/p90/p99 = 80/811/6.6k, top 1% sessions hold 46.5% of input-token mass. | Add cache-hit joined records for actual reuse decomposition. | Actual cache reuse decomposition needs cached_tokens joined with hash_ids. | -| Static PD separation is worse than combined in existing 200-request GPU A/B. | `supported_by_existing_artifact` | outputs/gpu_ab_combined vs outputs/gpu_ab_pdsep metrics.summary.json. | Refresh with PD matrix, multiple seeds, cudagraph-enabled methodology. | Legacy run has no per-stage TTFT breakdown and no step-level KV occupancy. | -| Elastic transfer-based migration does not improve high-contention 500-request run. | `supported_by_existing_artifact` | outputs/contention_16s_ts10 vs outputs/contention_16s_elastic metrics.summary.json and gpu_util.csv. | Attribute whether failure is trigger quality, transfer overhead, or wrong load regime. | Existing metrics lack actual sequentiality proof and per-request transfer waterfall. | -| PD-colo prefill/decode interference is not yet directly proven by step-level data in this package. | `not_yet_supported` | No decode-step and prefill-overlap timestamp artifact found in summarized runs. | Run Batch 2 controlled same-worker/different-worker injection with step timestamps. | Cannot claim interference as causal without Batch 2. | -| Session hot-spot residual imbalance is suggested but not fully attributed. | `partially_supported` | gpu_util.csv shows per-GPU mean-util imbalance in existing runs. | Collect per-worker queue delay, session-to-worker map, and per-session token mass per worker. | GPU util imbalance alone is not enough to prove session hot-spot. | -| SRR is not measured by existing fixed-request runs. | `not_yet_supported` | No arrival-rate sweep artifacts found. | Implement Batch 4 Poisson session-arrival SRR sweep. | Latency-at-one-load cannot support sustainable throughput claim. | +| Per-session sequentiality is enforced when replayer + proxy carry the new join fields. | `supported` | A1 unix timestamps (t_dispatch/t_first_token/t_finish_unix) and X-Request-Id passthrough; smoke validation 2026-05-25 confirmed 30/30 join coverage. | Use this stack for all Window 2 B4/B5 SRR runs. | Legacy outputs/ without these fields still cannot be re-classified as `online_realistic`. | +| Agentic workload is long-input / short-output / heavy-tail session mass. | `supported` | Full trace CPU summary (full_trace_summary.json): input p50/p90/p99 = 20k/87.9k/125.5k; top 1% sessions hold 46.5% of input mass. Full trace 2.11M requests, 1.31M sessions. | — | Sample trace (w600) percentiles inherit from this full trace but should not be extrapolated. | +| KV per request for Qwen3-Coder-30B-A3B is 98304 B/token; p50/p90/p99 footprint = 1.83/8.04/11.49 GiB. | `supported` | window_1_results/kv_footprint_summary.json; computed from model config and full trace input lengths. | — | Assumes bf16; would scale for fp8/int8 quant. | +| Workload reuse is overwhelmingly intra-session. | `supported` | Real reuse decomposition from w600 lmetric run: intra 93.2%, cross 5.7%, shared 1.1% (window_1_results/lmetric_reuse.json). Theoretical any-vs-intra ceiling gap 0.7 pp. | — | Trace-specific; ChatGPT-style workloads with long system prompts would shift toward shared-prefix. | +| Theoretical APC ceiling on w600 trace is 79.6% (intra) / 80.3% (any-session). | `supported` | window_1_results/apc_upper_w600.json from block-level trie walk on `hash_ids`. | — | Assumes infinite per-worker cache (no eviction); achieved values must be read as fraction of this ceiling. | +| Cache-aware LMetric leaves a measurable locality gap (22.7 pp). | `supported` | lmetric achieved 56.9% vs intra-session ceiling 79.6%; B3 sweep window_1_results/b3_policy_comparison.json. | — | sticky data shows the gap can be recovered by harder affinity. | +| Hybrid affinity (`unified`) breaks the locality-vs-latency tradeoff. | `supported` | unified APC 79.4% (97% of intra ceiling) AND TTFT p90 7.24 s (lmetric is 15.6 s). | — | unified concentrates a single very hot worker (engine_4 at 37.7 s p90); hotspot_index 3.35. | +| Same-worker prefill-decode interference is causal, not correlation. | `supported` | B2 microbench: different-worker control idx 0.92-1.02 across 32× prefill-size variation; same-worker TTFT idx scales 2.15× (2k) → 218× (65k). window_1_results/b2_sweep_summary.json. | — | Synthetic decode load (256-token prompts at 4 req/s) bounds the realism; production behavior is layered on top of B3. | +| Hard session affinity (`sticky`) inflates same-worker prefill-decode interference. | `supported` | sticky interference_index 13.65 vs lmetric 6.53; sticky's slow-request breakdown 57% same-worker overlap vs lmetric 23%. | — | Confirms the B2 causal claim observed at the system level. | +| Heavy-tail sessions are a contributor to hot-spot but not the sole cause. | `supported` | Cap-8 trace (37% requests dropped) reduces hotspot_index only 13% (2.24 → 1.94). | Run capped under unified to see whether unified's hotspot also persists. | Reviewer might counter that cap=8 is too soft; a stricter cap could be tried. | +| SRR per policy under SLO is not yet measured. | `not_yet_supported` | B3 was driven by trace timestamps with strict session sequentiality; saturation is reached but not parameterized. | Run B4 with the A4 open-loop Poisson loadgen, per-class SLO, 5 policies × λ binary search. | Without B4 the paper cannot claim "policy X sustains higher load than Y". | +| Failure attribution near SRR boundary is not yet measured. | `not_yet_supported` | B5 protocol exists; no runs. | After B4, rerun each policy at 0.9× / 1.0× / 1.1× of its SRR_max with the same instrumentation, label slow requests. | The current `joined_analysis.label_slow_requests` is the labeler; needs SRR boundaries to point at. | diff --git a/analysis/characterization/current_results/main_claim_allowed_runs.md b/analysis/characterization/current_results/main_claim_allowed_runs.md index 56fa7fe..2f1f36b 100644 --- a/analysis/characterization/current_results/main_claim_allowed_runs.md +++ b/analysis/characterization/current_results/main_claim_allowed_runs.md @@ -1,66 +1,76 @@ # Main-Claim Allowed Runs -Status: current audit gate +Status: post-Window-1 audit gate Date: 2026-05-25 ## Allowed For Workload-Shape Claims -These artifacts can support trace/workload characterization claims: - - `dash0:/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl` - - Compact formatted full trace. - - CPU summary recorded in `full_trace_summary.json`. - - Supports long-input/short-output and session token-mass skew claims. - - Does not prove runtime cache hits or online sequentiality. + - Compact formatted full trace (2.11M requests / 1.31M sessions). + - CPU summary in `current_results/full_trace_summary.json` and + Window 1 KV footprint in `window_1_results/kv_footprint_summary.json`. + - Supports: long-input / short-output / heavy-tail token mass / + KV per request distribution. + - `traces/w600_r0.0015_st30.jsonl` - - Local sampled trace. - - Useful for local dry runs and figure generation. - - Not the canonical full-trace source. + - 1214 requests / 274 sessions / 53.3 M tokens. + - APC theoretical bounds in `window_1_results/apc_upper_w600.json`. + - Routing-policy comparison trace used by B3. + +## Allowed For Routing-Policy Comparison Claims + +These five runs share an identical trace, model, and 8-instance topology; +they support all per-policy claims about APC, hotspot, interference, +latency, failure breakdown. + +- `outputs/b3_sweep_20260525_095043/lmetric/` — main baseline +- `outputs/b3_sweep_20260525_095043/load_only/` — control: no cache / no affinity +- `outputs/b3_sweep_20260525_095043/sticky/` — control: hard affinity +- `outputs/b3_sweep_20260525_095043/unified/` — hybrid (interference index + unavailable; see note in claim matrix) +- `outputs/b3_sweep_20260525_095043/capped/` — lmetric on cap-8 trace + +Aggregated comparison: `outputs/b3_sweep_20260525_095043/b3_policy_comparison.json`. +Rendered figures: `analysis/characterization/window_1_results/figures/fig_b3_*.png`. + +## Allowed For PD-colo Interference Causal Claims + +- `outputs/b2_microbench/sweep/{same,different}/p{2048,8192,16384,32768,65536}/` + - Decode-load + prefill-injection microbench. + - `b2_sweep_summary.json` aggregates per-cell TPOT and TTFT + (overlap vs clean), indexed by `prefill_size × variant`. + - Different-worker control idx ≈ 1.0 across 32× variation; + same-worker idx scales monotonically. ## Allowed For Legacy Baseline Sanity Claims -These existing runs can support sanity-level comparisons, but not final -paper-grade SRR claims: +These older runs predate Window 1 instrumentation. They can still support +"static PD-sep was worse than combined on this fixed-request workload" +type claims, but **not** the new SRR or per-policy comparisons. -- `outputs/gpu_ab_combined` -- `outputs/gpu_ab_pdsep` -- `outputs/contention_16s_ts10` -- `outputs/contention_16s_elastic` -- `outputs/combined_1000req` -- `outputs/exp3_pd_sep_tp1_mooncake` +- `outputs/gpu_ab_combined`, `outputs/gpu_ab_pdsep` +- `outputs/contention_16s_ts10`, `outputs/contention_16s_elastic` +- `outputs/combined_1000req`, `outputs/exp3_pd_sep_tp1_mooncake` -Allowed claims: +## NOT Allowed For Main Claims -- Static PD-sep was worse than combined in these existing fixed-request runs. -- Elastic transfer-based migration did not improve the summarized 500-request - high-contention run. -- GPU-util imbalance exists in these artifacts. +The following need new runs: -Disallowed claims: +- **B4 SRR sweep**: arrival-rate sweep with open-loop Poisson session + arrivals and per-class SLO. No data yet. +- **B5 failure attribution near SRR boundary**: depends on B4. +- **Production interference under cache_aware proxy**: B2 used direct + endpoints; the production routing might shift the same-worker + collision profile. -- Online SRR. -- Per-session sequentiality. -- Causal attribution of prefill/decode interference. -- Causal attribution of session hot spots from GPU utilization alone. +## Required Upgrade Path -## Not Yet Allowed For Main Claims +For Window 2 (B4 + B5), the existing stack already meets the needs: +- A1 unix timestamps on every metric row ✓ +- A2 worker_state snapshots ✓ +- A3 step-level engine_state (works in isolated runs since `df32499`) ✓ +- A4 open-loop Poisson loadgen ✓ +- A5 joined_analysis + failure labels ✓ -The following need fresh instrumentation or fresh runs: - -- Batch 2 prefill/decode interference. -- Batch 3 session hot-spot root cause. -- Batch 4 sustainable request rate. -- Batch 5 failure attribution near SRR boundary. - -## Required Upgrade Before Paper-Grade Claims - -Future main-claim runs must include: - -- per-request actual dispatch timestamp; -- per-request finish/error timestamp; -- route decision and selected worker; -- per-worker queue delay; -- per-worker KV occupancy; -- per-worker APC/cache-hit snapshot; -- attempted/completed/error/goodput counters; -- session-causal load generation. +No new instrumentation required. The only software gap is `b3_analyze.sh` +must use per-policy engine_state when present (fixed at commit `df32499`). diff --git a/analysis/characterization/current_results/reproduction_commands.sh b/analysis/characterization/current_results/reproduction_commands.sh index f427e6f..b07d33c 100644 --- a/analysis/characterization/current_results/reproduction_commands.sh +++ b/analysis/characterization/current_results/reproduction_commands.sh @@ -1,17 +1,62 @@ #!/usr/bin/env bash set -euo pipefail -# Rebuild this current-results audit package. -python3 analysis/characterization/summarize_runs.py --output-dir analysis/characterization/current_results --runs outputs/gpu_ab_combined outputs/gpu_ab_pdsep outputs/contention_16s_ts10 outputs/contention_16s_elastic outputs/combined_1000req outputs/exp3_pd_sep_tp1_mooncake +# Window 0 audit refresh (legacy run summaries). +python3 analysis/characterization/summarize_runs.py \ + --output-dir analysis/characterization/current_results \ + --runs outputs/gpu_ab_combined outputs/gpu_ab_pdsep \ + outputs/contention_16s_ts10 outputs/contention_16s_elastic \ + outputs/combined_1000req outputs/exp3_pd_sep_tp1_mooncake -# Example Batch 0/1 local trace analysis. +# B1' Per-request KV footprint on the full trace (runs on dash0 directly, +# CPU-only; the formatted full trace is hundreds of GiB). python3 analysis/characterization/analyze.py \ - --trace traces/w600_r0.0015_st30.jsonl \ - --kv-bytes-per-token 98304 \ - --task-name w600_local_full_trace \ - --overwrite + --trace ~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl \ + --kv-bytes-per-token 98304 \ + --task-name full_trace_with_kv \ + --output-root outputs/characterization \ + --overwrite -# CPU-only full compact trace summary was computed on dash0 from: -# /home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl -# Recompute either by running analyze.py on dash0, or by copying that compact -# formatted JSONL locally. Do not use the 487G raw file directly. +# w600 trace APC theoretical bound. +python3 scripts/compute_apc_upper_bound.py \ + --trace traces/w600_r0.0015_st30.jsonl \ + --out outputs/apc_upper_w600.json + +# B3 5-policy routing sweep on dash0 (8 × TP1 instances). +# First three policies share one vLLM lifecycle (hot-cache, fast): +bash scripts/b3_sweep.sh # writes outputs/b3_sweep_/ + +# Last two run isolated with cold vLLM: +bash scripts/b3_isolated_policy.sh unified \ + traces/w600_r0.0015_st30.jsonl \ + outputs/b3_sweep_/unified + +python3 scripts/build_capped_trace.py \ + --input traces/w600_r0.0015_st30.jsonl \ + --output outputs/b3_sweep_/capped/trace.jsonl \ + --max-turns 8 + +bash scripts/b3_isolated_policy.sh lmetric \ + outputs/b3_sweep_/capped/trace.jsonl \ + outputs/b3_sweep_/capped + +# B3 analysis (joined records + indices) and report. +bash scripts/b3_analyze.sh outputs/b3_sweep_ +python3 scripts/render_b3_report.py --sweep-dir outputs/b3_sweep_ + +# B2 PD-colo interference microbench. Launch 2 vLLM instances on +# ports 8100 and 8101 with --enable-prompt-tokens-details first, then: +python3 scripts/b2_interference.py \ + --decode-endpoint http://127.0.0.1:8100 \ + --prefill-endpoint http://127.0.0.1:8101 \ + --model \ + --out-dir outputs/b2_microbench/sweep \ + --prefill-sizes 2048,8192,16384,32768,65536 \ + --variants different,same +python3 analysis/characterization/b2_sweep_analysis.py \ + --sweep-dir outputs/b2_microbench/sweep + +# Window 1 figure rendering (CPU only). +python3 analysis/characterization/render_window1_figures.py \ + --results-dir analysis/characterization/window_1_results \ + --out-dir analysis/characterization/window_1_results/figures diff --git a/analysis/characterization/current_results/reviewer_risk_register.md b/analysis/characterization/current_results/reviewer_risk_register.md index d89d693..0997c7c 100644 --- a/analysis/characterization/current_results/reviewer_risk_register.md +++ b/analysis/characterization/current_results/reviewer_risk_register.md @@ -1,8 +1,15 @@ # Reviewer Risk Register +Updated 2026-05-25 after Window 1. + | Risk | Severity | Evidence | Mitigation | |---|---|---|---| -| Session sequentiality not proven | `high` | Current metrics include trace timestamp and latency but not actual dispatch/finish wall-clock timestamps. | Add dispatch/finish timestamps and run Batch 0 before SRR claims. | -| Legacy PD-sep data may not match final methodology | `medium` | PD matrix scaffold exists separately; some old runs used earlier flags/methodology. | Use fresh PD matrix for paper-grade claims. | -| GPU util is not a sufficient hot-spot proof | `medium` | Existing artifacts have gpu_util.csv but lack per-worker queue and session ownership. | Add route-decision and per-worker queue logs for Batch 3. | -| Cache reuse decomposition is incomplete without joined hash/cache-hit data | `medium` | Trace has hash_ids; metrics have cached_tokens; request IDs may not join across all artifacts. | Emit hash_ids/session_id/cached_tokens in the same per-request record. | +| ~~Session sequentiality not proven~~ | resolved | A1 instrumentation lands per-request t_dispatch/t_first_token/t_finish unix timestamps + proxy_request_id. Smoke validation 2026-05-25 confirms 30/30 join coverage. | All Window 1 runs already use this; Window 2 inherits. | +| ~~Cache reuse decomposition incomplete~~ | resolved | Real reuse decomposition computed in `window_1_results/lmetric_reuse.json` from joined records carrying session_id + hash_ids + cached_tokens. | — | +| APC across hot-sweep policies may be contaminated by prior policy runs | low | First-turn cached_tokens distribution shows < 1% empirical contamination; load_only and sticky vLLMs were not restarted between policies. `unified` and `capped` are isolated cold-start. | Window 2 will isolate each policy launch by default; document in paper that lmetric/load_only/sticky reflect "warm-cache" condition. | +| Unified missing `interference_index` due to analyzer truncate-write bug | medium | The original `b3_analyze.sh` unconditionally `slice_engine_state.py`'d each policy and used `open("w")`, overwriting unified's correctly-written engine_state with the empty-window slice from the (hot-sweep) shared dir. | Fixed in commit `df32499`. B2 microbench provides the cleaner same-vs-different interference proof, so we do not need to rerun unified. | +| GPU 0 ghost memory after vLLM crash | low | EngineCore subprocess name is `VLLM::EngineCor`; `pkill -f "vllm serve"` misses it. Killed manually on 2026-05-25; cleanup logic in `b3_sweep.sh` and `b3_isolated_policy.sh` now also targets `EngineCore`. | — | +| w600 trace is a 1k-request sample, not the full GLM-5.1 trace | low | All B3 + B2 percentiles are on this sample. Full-trace KV-footprint and reuse claims use the 2.11M-request full trace. | Window 2 SRR sweep uses w600; full-trace SRR would need a larger sample and more GPU budget. | +| Trace-timestamp dispatch with strict session sequentiality stretches replay wall time | medium | lmetric's 600s trace dispatched over 49 min; system over-saturates and the dispatch window expands. | Window 2 uses A4 open-loop Poisson loadgen with explicit arrival rate, decoupling load level from trace structure. | +| Capped cap=8 may be too soft | low | Reviewer might prefer cap=2 or cap=4 to test "no multi-turn" extreme. Cap=8 was chosen to sit between turns/session p90 (1) and p99 (18). | Re-run with a stricter cap if reviewer pushes back; underlying capped script is parameterized. | +| B2 microbench uses synthetic short-prompt decode load (256 tokens) | low | This bounds the realism of the "decode" workload. Production decode tokens come from prior turns of long context. | The signal magnitude is robust enough that prompt length shouldn't qualitatively change conclusions; B3 sticky's failure breakdown is the production-trace confirmation. | diff --git a/analysis/characterization_todo_for_interns.md b/analysis/characterization_todo_for_interns.md index 0317092..e75772f 100644 --- a/analysis/characterization_todo_for_interns.md +++ b/analysis/characterization_todo_for_interns.md @@ -4,17 +4,17 @@ Status: execution checklist for interns Date: 2026-05-25 Last progress audit: 2026-05-25 -## Progress Snapshot (2026-05-25) +## Progress Snapshot (2026-05-25, post-Window-1) | Batch | State | Evidence | |---|---|---| -| B0 Substrate audit | tool DONE, legacy runs partial | `analysis/characterization/analyze.py` implements per-session concurrency/arrival/inter-turn analyzer; legacy `metrics.jsonl` lacks dispatch/finish timestamps so actual sequentiality cannot be proven on old runs (correctly labeled in `current_results/`) | -| B1 Workload characterization | trace-shape DONE, reuse pending | `current_results/full_trace_summary.json` covers 2.11M req / 1.31M sessions from `051315-051317.jsonl`; KV-footprint and reuse decomposition still require `--kv-bytes-per-token` rerun and cached_tokens+hash_ids joined records | -| B2 PD interference | protocol DONE, run pending | `analysis/characterization/protocols.md` Batch 2 section ready; needs fresh GPU run with decode-step + prefill-chunk timestamps | -| B3 Hot-spot imbalance | partial; needs new instrumentation | Legacy `gpu_util.csv` shows imbalance but lacks per-worker queue delay and session→worker map | -| B4 SRR sweep | NOT DONE | No arrival-rate sweep artifacts; depends on session-causal open-loop loadgen | -| B5 Failure attribution | NOT DONE | Depends on B2/B4 outputs | -| B6 Audit package | scaffold DONE | `current_results/{characterization_claim_matrix.md, all_figures_index.md, reviewer_risk_register.md, main_claim_allowed_runs.md, reproduction_commands.sh}` + 6 figures committed | +| B0 Substrate audit | **DONE for new runs**, legacy still partial | A1+A2 instrumentation lands per-request unix timestamps and X-Request-Id passthrough; B3 sweep 2026-05-25 achieves 100% join coverage on all 5 policy runs | +| B1 Workload characterization | **DONE** | `window_1_results/kv_footprint_summary.json` (98304 B/token, p99 = 11.49 GiB); real reuse decomposition (`lmetric_reuse.json`: 93.2% intra-session, 5.7% cross, 1.1% shared); theoretical APC ceilings (`apc_upper_w600.json`: 79.6% intra / 80.3% any) | +| B2 PD interference | **DONE** | `outputs/b2_microbench/sweep/` 5 × 2 cells. Different-worker control idx 0.92-1.02 across 32× prefill size variation; same-worker TTFT idx scales 2.15× → 218×. Causal proof complete. | +| B3 5-policy routing sweep | **DONE** | `outputs/b3_sweep_20260525_095043/` lmetric/load_only/sticky (warm-cache) + unified/capped (isolated cold-start). Aggregated in `b3_policy_comparison.json`. Unified hits APC 79.4% (97% of ceiling) AND TTFT p90 7.24 s. | +| B4 SRR sweep | NOT DONE | Window 2 task. A4 loadgen + per-class SLO + λ binary search per policy. | +| B5 Failure attribution | NOT DONE | Window 2 task. Depends on B4 SRR boundaries. | +| B6 Audit package | **DONE for Window 1** | `current_results/{characterization_claim_matrix.md, all_figures_index.md, reviewer_risk_register.md, main_claim_allowed_runs.md, reproduction_commands.sh}` refreshed; Window 1 results aggregated in `window_1_results.md` + 8 PNG figures | Reusable assets already in repo: