Audit package refresh: Window 1 supported claims + risk register
Refresh the standing audit package now that B1' / B2 / B3 are complete. current_results/characterization_claim_matrix.md Flips seven entries from "not_yet_supported" / "partially_supported" to "supported" with pointers into window_1_results/. New entries cover per-session sequentiality, KV per request, real reuse decomposition, theoretical APC ceiling, the LMetric locality gap, Unified breaking the locality-vs-latency tradeoff, B2 causal interference proof, sticky's interference inflation, and the partial heavy-tail / hot-spot story. B4 SRR + B5 attribution stay "not_yet_supported" (Window 2 work). current_results/main_claim_allowed_runs.md New "Allowed For Routing-Policy Comparison" section pins the five B3 policy directories. New "Allowed For PD-colo Interference" section pins the B2 sweep. Legacy section retained for the pre-instrumentation 200/500/1000-req runs. current_results/reviewer_risk_register.md Marks the two old "high"-severity risks (sequentiality / reuse decomposition) as resolved; adds new entries for the APC contamination empirics, the b3_analyze.sh truncate-write bug that cost unified's interference index, the GPU-0 EngineCore ghost cleanup, the saturated-replay caveat for trace-timestamp dispatch, and the synthetic B2 decode workload. current_results/all_figures_index.md Adds the 8 new Window 1 figures alongside the existing 6 from the legacy summarize_runs run. current_results/reproduction_commands.sh Records the full B3 + B2 + figure pipeline. analysis/characterization_todo_for_interns.md Updates the Progress Snapshot table: B0, B1, B2, B3, B6 all DONE; only B4 and B5 remain (Window 2). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -1,54 +1,29 @@
|
||||
# Figures Index
|
||||
|
||||
Generated by:
|
||||
|
||||
```bash
|
||||
.venv/bin/python analysis/characterization/plot_current_results.py
|
||||
```
|
||||
## Window 0 (pre-Window-1 audit, legacy runs)
|
||||
|
||||
| Figure | Intended Claim |
|
||||
|---|---|
|
||||
| [fig_full_trace_workload.png](figures/fig_full_trace_workload.png) | Full GLM-5.1 trace is long-input, short-output, and high input/output ratio. |
|
||||
| [fig_session_skew.png](figures/fig_session_skew.png) | Session input-token mass is highly skewed; top sessions dominate work. |
|
||||
| [fig_pdsep_vs_combined.png](figures/fig_pdsep_vs_combined.png) | Existing static PD-sep A/B regresses TTFT/E2E vs combined. |
|
||||
| [fig_pdsep_vs_combined.png](figures/fig_pdsep_vs_combined.png) | Static PD-sep regresses TTFT/E2E vs combined (legacy 200-req A/B). |
|
||||
| [fig_elastic_vs_baseline.png](figures/fig_elastic_vs_baseline.png) | Existing elastic transfer-based run does not improve TTFT/TPOT over high-contention baseline. |
|
||||
| [fig_gpu_balance.png](figures/fig_gpu_balance.png) | Existing runs show GPU-util imbalance, but more data is needed for hot-spot causality. |
|
||||
| [fig_claim_status.png](figures/fig_claim_status.png) | Current audit separates supported, partial, and unsupported claims. |
|
||||
| [fig_gpu_balance.png](figures/fig_gpu_balance.png) | Existing runs show GPU-util imbalance; not sufficient for hot-spot causal claim. |
|
||||
| [fig_claim_status.png](figures/fig_claim_status.png) | Audit separates supported / partial / unsupported claims. |
|
||||
|
||||
## Figure Previews
|
||||
## Window 1 (B1' + B3 + B2)
|
||||
|
||||
### Full Trace Workload
|
||||
Generated by `analysis/characterization/render_window1_figures.py`.
|
||||
Source data: `analysis/characterization/window_1_results/`.
|
||||
|
||||
Full GLM-5.1 trace is long-input, short-output, and high input/output ratio.
|
||||
|
||||

|
||||
|
||||
### Session Skew
|
||||
|
||||
Session input-token mass is highly skewed; top sessions dominate work.
|
||||
|
||||

|
||||
|
||||
### PD-Sep vs Combined
|
||||
|
||||
Existing static PD-sep A/B regresses TTFT/E2E vs combined.
|
||||
|
||||

|
||||
|
||||
### Elastic vs Baseline
|
||||
|
||||
Existing elastic transfer-based run does not improve TTFT/TPOT over high-contention baseline.
|
||||
|
||||

|
||||
|
||||
### GPU Balance
|
||||
|
||||
Existing runs show GPU-util imbalance, but more data is needed for hot-spot causality.
|
||||
|
||||

|
||||
|
||||
### Claim Status
|
||||
|
||||
Current audit separates supported, partial, and unsupported claims.
|
||||
|
||||

|
||||
| Figure | Intended Claim |
|
||||
|---|---|
|
||||
| [fig_kv_footprint_cdf.png](../window_1_results/figures/fig_kv_footprint_cdf.png) | KV per request for Qwen3-Coder-30B-A3B: p50/p90/p99 = 1.83/8.04/11.49 GiB; p99 takes 12% of H20 HBM. |
|
||||
| [fig_reuse_decomposition.png](../window_1_results/figures/fig_reuse_decomposition.png) | Cached_tokens decompose 93.2% intra / 5.7% cross / 1.1% shared on w600 lmetric run. |
|
||||
| [fig_b3_apc_vs_upper.png](../window_1_results/figures/fig_b3_apc_vs_upper.png) | Per-policy APC achieved vs theoretical intra-session ceiling (79.6%). |
|
||||
| [fig_b3_apc_vs_hotspot.png](../window_1_results/figures/fig_b3_apc_vs_hotspot.png) | Locality-vs-hotspot tradeoff across policies; unified dominates the frontier. |
|
||||
| [fig_b3_latency_bars.png](../window_1_results/figures/fig_b3_latency_bars.png) | TTFT / TPOT / E2E p90 bars per policy. |
|
||||
| [fig_b3_per_worker_ttft_p90.png](../window_1_results/figures/fig_b3_per_worker_ttft_p90.png) | Per-worker TTFT p90 distribution per policy; sticky's engine_3 and unified's engine_4 are the hot workers. |
|
||||
| [fig_b3_failure_breakdown.png](../window_1_results/figures/fig_b3_failure_breakdown.png) | Slow-request cause stacked bar per policy. |
|
||||
| [fig_b2_tpot_vs_prefill.png](../window_1_results/figures/fig_b2_tpot_vs_prefill.png) | TPOT during decode under same-worker prefill injection scales with prefill size; different-worker control flat. |
|
||||
| [fig_b2_ttft_vs_prefill.png](../window_1_results/figures/fig_b2_ttft_vs_prefill.png) | TTFT shows the same monotone same-worker scaling, peaking at 218× for 65k injection. |
|
||||
|
||||
@@ -1,11 +1,19 @@
|
||||
# Characterization Claim Matrix
|
||||
|
||||
Updated 2026-05-25 after Window 1 (B1' KV-footprint + reuse, B3 5-policy
|
||||
sweep, B2 PD-colo interference microbench).
|
||||
|
||||
| Claim | Status | Supporting Data | Needed Next | Reviewer Risk |
|
||||
|---|---|---|---|---|
|
||||
| Batch 0 substrate audit is only partially complete for existing runs. | `partially_supported` | metrics.jsonl lacks actual dispatch/finish timestamps in current artifacts. | Add request dispatch and finish/error timestamps to future replayer/proxy metrics. | Cannot use these runs to prove online per-session sequentiality. |
|
||||
| Batch 1 workload shape can be characterized from formatted traces and metrics. | `supported_for_trace_shape` | Full compact trace CPU summary in `full_trace_summary.json`: input p50/p90/p99 = 20k/87.9k/125.5k, output p50/p90/p99 = 80/811/6.6k, top 1% sessions hold 46.5% of input-token mass. | Add cache-hit joined records for actual reuse decomposition. | Actual cache reuse decomposition needs cached_tokens joined with hash_ids. |
|
||||
| Static PD separation is worse than combined in existing 200-request GPU A/B. | `supported_by_existing_artifact` | outputs/gpu_ab_combined vs outputs/gpu_ab_pdsep metrics.summary.json. | Refresh with PD matrix, multiple seeds, cudagraph-enabled methodology. | Legacy run has no per-stage TTFT breakdown and no step-level KV occupancy. |
|
||||
| Elastic transfer-based migration does not improve high-contention 500-request run. | `supported_by_existing_artifact` | outputs/contention_16s_ts10 vs outputs/contention_16s_elastic metrics.summary.json and gpu_util.csv. | Attribute whether failure is trigger quality, transfer overhead, or wrong load regime. | Existing metrics lack actual sequentiality proof and per-request transfer waterfall. |
|
||||
| PD-colo prefill/decode interference is not yet directly proven by step-level data in this package. | `not_yet_supported` | No decode-step and prefill-overlap timestamp artifact found in summarized runs. | Run Batch 2 controlled same-worker/different-worker injection with step timestamps. | Cannot claim interference as causal without Batch 2. |
|
||||
| Session hot-spot residual imbalance is suggested but not fully attributed. | `partially_supported` | gpu_util.csv shows per-GPU mean-util imbalance in existing runs. | Collect per-worker queue delay, session-to-worker map, and per-session token mass per worker. | GPU util imbalance alone is not enough to prove session hot-spot. |
|
||||
| SRR is not measured by existing fixed-request runs. | `not_yet_supported` | No arrival-rate sweep artifacts found. | Implement Batch 4 Poisson session-arrival SRR sweep. | Latency-at-one-load cannot support sustainable throughput claim. |
|
||||
| Per-session sequentiality is enforced when replayer + proxy carry the new join fields. | `supported` | A1 unix timestamps (t_dispatch/t_first_token/t_finish_unix) and X-Request-Id passthrough; smoke validation 2026-05-25 confirmed 30/30 join coverage. | Use this stack for all Window 2 B4/B5 SRR runs. | Legacy outputs/ without these fields still cannot be re-classified as `online_realistic`. |
|
||||
| Agentic workload is long-input / short-output / heavy-tail session mass. | `supported` | Full trace CPU summary (full_trace_summary.json): input p50/p90/p99 = 20k/87.9k/125.5k; top 1% sessions hold 46.5% of input mass. Full trace 2.11M requests, 1.31M sessions. | — | Sample trace (w600) percentiles inherit from this full trace but should not be extrapolated. |
|
||||
| KV per request for Qwen3-Coder-30B-A3B is 98304 B/token; p50/p90/p99 footprint = 1.83/8.04/11.49 GiB. | `supported` | window_1_results/kv_footprint_summary.json; computed from model config and full trace input lengths. | — | Assumes bf16; would scale for fp8/int8 quant. |
|
||||
| Workload reuse is overwhelmingly intra-session. | `supported` | Real reuse decomposition from w600 lmetric run: intra 93.2%, cross 5.7%, shared 1.1% (window_1_results/lmetric_reuse.json). Theoretical any-vs-intra ceiling gap 0.7 pp. | — | Trace-specific; ChatGPT-style workloads with long system prompts would shift toward shared-prefix. |
|
||||
| Theoretical APC ceiling on w600 trace is 79.6% (intra) / 80.3% (any-session). | `supported` | window_1_results/apc_upper_w600.json from block-level trie walk on `hash_ids`. | — | Assumes infinite per-worker cache (no eviction); achieved values must be read as fraction of this ceiling. |
|
||||
| Cache-aware LMetric leaves a measurable locality gap (22.7 pp). | `supported` | lmetric achieved 56.9% vs intra-session ceiling 79.6%; B3 sweep window_1_results/b3_policy_comparison.json. | — | sticky data shows the gap can be recovered by harder affinity. |
|
||||
| Hybrid affinity (`unified`) breaks the locality-vs-latency tradeoff. | `supported` | unified APC 79.4% (97% of intra ceiling) AND TTFT p90 7.24 s (lmetric is 15.6 s). | — | unified concentrates a single very hot worker (engine_4 at 37.7 s p90); hotspot_index 3.35. |
|
||||
| Same-worker prefill-decode interference is causal, not correlation. | `supported` | B2 microbench: different-worker control idx 0.92-1.02 across 32× prefill-size variation; same-worker TTFT idx scales 2.15× (2k) → 218× (65k). window_1_results/b2_sweep_summary.json. | — | Synthetic decode load (256-token prompts at 4 req/s) bounds the realism; production behavior is layered on top of B3. |
|
||||
| Hard session affinity (`sticky`) inflates same-worker prefill-decode interference. | `supported` | sticky interference_index 13.65 vs lmetric 6.53; sticky's slow-request breakdown 57% same-worker overlap vs lmetric 23%. | — | Confirms the B2 causal claim observed at the system level. |
|
||||
| Heavy-tail sessions are a contributor to hot-spot but not the sole cause. | `supported` | Cap-8 trace (37% requests dropped) reduces hotspot_index only 13% (2.24 → 1.94). | Run capped under unified to see whether unified's hotspot also persists. | Reviewer might counter that cap=8 is too soft; a stricter cap could be tried. |
|
||||
| SRR per policy under SLO is not yet measured. | `not_yet_supported` | B3 was driven by trace timestamps with strict session sequentiality; saturation is reached but not parameterized. | Run B4 with the A4 open-loop Poisson loadgen, per-class SLO, 5 policies × λ binary search. | Without B4 the paper cannot claim "policy X sustains higher load than Y". |
|
||||
| Failure attribution near SRR boundary is not yet measured. | `not_yet_supported` | B5 protocol exists; no runs. | After B4, rerun each policy at 0.9× / 1.0× / 1.1× of its SRR_max with the same instrumentation, label slow requests. | The current `joined_analysis.label_slow_requests` is the labeler; needs SRR boundaries to point at. |
|
||||
|
||||
@@ -1,66 +1,76 @@
|
||||
# Main-Claim Allowed Runs
|
||||
|
||||
Status: current audit gate
|
||||
Status: post-Window-1 audit gate
|
||||
Date: 2026-05-25
|
||||
|
||||
## Allowed For Workload-Shape Claims
|
||||
|
||||
These artifacts can support trace/workload characterization claims:
|
||||
|
||||
- `dash0:/home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl`
|
||||
- Compact formatted full trace.
|
||||
- CPU summary recorded in `full_trace_summary.json`.
|
||||
- Supports long-input/short-output and session token-mass skew claims.
|
||||
- Does not prove runtime cache hits or online sequentiality.
|
||||
- Compact formatted full trace (2.11M requests / 1.31M sessions).
|
||||
- CPU summary in `current_results/full_trace_summary.json` and
|
||||
Window 1 KV footprint in `window_1_results/kv_footprint_summary.json`.
|
||||
- Supports: long-input / short-output / heavy-tail token mass /
|
||||
KV per request distribution.
|
||||
|
||||
- `traces/w600_r0.0015_st30.jsonl`
|
||||
- Local sampled trace.
|
||||
- Useful for local dry runs and figure generation.
|
||||
- Not the canonical full-trace source.
|
||||
- 1214 requests / 274 sessions / 53.3 M tokens.
|
||||
- APC theoretical bounds in `window_1_results/apc_upper_w600.json`.
|
||||
- Routing-policy comparison trace used by B3.
|
||||
|
||||
## Allowed For Routing-Policy Comparison Claims
|
||||
|
||||
These five runs share an identical trace, model, and 8-instance topology;
|
||||
they support all per-policy claims about APC, hotspot, interference,
|
||||
latency, failure breakdown.
|
||||
|
||||
- `outputs/b3_sweep_20260525_095043/lmetric/` — main baseline
|
||||
- `outputs/b3_sweep_20260525_095043/load_only/` — control: no cache / no affinity
|
||||
- `outputs/b3_sweep_20260525_095043/sticky/` — control: hard affinity
|
||||
- `outputs/b3_sweep_20260525_095043/unified/` — hybrid (interference index
|
||||
unavailable; see note in claim matrix)
|
||||
- `outputs/b3_sweep_20260525_095043/capped/` — lmetric on cap-8 trace
|
||||
|
||||
Aggregated comparison: `outputs/b3_sweep_20260525_095043/b3_policy_comparison.json`.
|
||||
Rendered figures: `analysis/characterization/window_1_results/figures/fig_b3_*.png`.
|
||||
|
||||
## Allowed For PD-colo Interference Causal Claims
|
||||
|
||||
- `outputs/b2_microbench/sweep/{same,different}/p{2048,8192,16384,32768,65536}/`
|
||||
- Decode-load + prefill-injection microbench.
|
||||
- `b2_sweep_summary.json` aggregates per-cell TPOT and TTFT
|
||||
(overlap vs clean), indexed by `prefill_size × variant`.
|
||||
- Different-worker control idx ≈ 1.0 across 32× variation;
|
||||
same-worker idx scales monotonically.
|
||||
|
||||
## Allowed For Legacy Baseline Sanity Claims
|
||||
|
||||
These existing runs can support sanity-level comparisons, but not final
|
||||
paper-grade SRR claims:
|
||||
These older runs predate Window 1 instrumentation. They can still support
|
||||
"static PD-sep was worse than combined on this fixed-request workload"
|
||||
type claims, but **not** the new SRR or per-policy comparisons.
|
||||
|
||||
- `outputs/gpu_ab_combined`
|
||||
- `outputs/gpu_ab_pdsep`
|
||||
- `outputs/contention_16s_ts10`
|
||||
- `outputs/contention_16s_elastic`
|
||||
- `outputs/combined_1000req`
|
||||
- `outputs/exp3_pd_sep_tp1_mooncake`
|
||||
- `outputs/gpu_ab_combined`, `outputs/gpu_ab_pdsep`
|
||||
- `outputs/contention_16s_ts10`, `outputs/contention_16s_elastic`
|
||||
- `outputs/combined_1000req`, `outputs/exp3_pd_sep_tp1_mooncake`
|
||||
|
||||
Allowed claims:
|
||||
## NOT Allowed For Main Claims
|
||||
|
||||
- Static PD-sep was worse than combined in these existing fixed-request runs.
|
||||
- Elastic transfer-based migration did not improve the summarized 500-request
|
||||
high-contention run.
|
||||
- GPU-util imbalance exists in these artifacts.
|
||||
The following need new runs:
|
||||
|
||||
Disallowed claims:
|
||||
- **B4 SRR sweep**: arrival-rate sweep with open-loop Poisson session
|
||||
arrivals and per-class SLO. No data yet.
|
||||
- **B5 failure attribution near SRR boundary**: depends on B4.
|
||||
- **Production interference under cache_aware proxy**: B2 used direct
|
||||
endpoints; the production routing might shift the same-worker
|
||||
collision profile.
|
||||
|
||||
- Online SRR.
|
||||
- Per-session sequentiality.
|
||||
- Causal attribution of prefill/decode interference.
|
||||
- Causal attribution of session hot spots from GPU utilization alone.
|
||||
## Required Upgrade Path
|
||||
|
||||
## Not Yet Allowed For Main Claims
|
||||
For Window 2 (B4 + B5), the existing stack already meets the needs:
|
||||
- A1 unix timestamps on every metric row ✓
|
||||
- A2 worker_state snapshots ✓
|
||||
- A3 step-level engine_state (works in isolated runs since `df32499`) ✓
|
||||
- A4 open-loop Poisson loadgen ✓
|
||||
- A5 joined_analysis + failure labels ✓
|
||||
|
||||
The following need fresh instrumentation or fresh runs:
|
||||
|
||||
- Batch 2 prefill/decode interference.
|
||||
- Batch 3 session hot-spot root cause.
|
||||
- Batch 4 sustainable request rate.
|
||||
- Batch 5 failure attribution near SRR boundary.
|
||||
|
||||
## Required Upgrade Before Paper-Grade Claims
|
||||
|
||||
Future main-claim runs must include:
|
||||
|
||||
- per-request actual dispatch timestamp;
|
||||
- per-request finish/error timestamp;
|
||||
- route decision and selected worker;
|
||||
- per-worker queue delay;
|
||||
- per-worker KV occupancy;
|
||||
- per-worker APC/cache-hit snapshot;
|
||||
- attempted/completed/error/goodput counters;
|
||||
- session-causal load generation.
|
||||
No new instrumentation required. The only software gap is `b3_analyze.sh`
|
||||
must use per-policy engine_state when present (fixed at commit `df32499`).
|
||||
|
||||
@@ -1,17 +1,62 @@
|
||||
#!/usr/bin/env bash
|
||||
set -euo pipefail
|
||||
|
||||
# Rebuild this current-results audit package.
|
||||
python3 analysis/characterization/summarize_runs.py --output-dir analysis/characterization/current_results --runs outputs/gpu_ab_combined outputs/gpu_ab_pdsep outputs/contention_16s_ts10 outputs/contention_16s_elastic outputs/combined_1000req outputs/exp3_pd_sep_tp1_mooncake
|
||||
# Window 0 audit refresh (legacy run summaries).
|
||||
python3 analysis/characterization/summarize_runs.py \
|
||||
--output-dir analysis/characterization/current_results \
|
||||
--runs outputs/gpu_ab_combined outputs/gpu_ab_pdsep \
|
||||
outputs/contention_16s_ts10 outputs/contention_16s_elastic \
|
||||
outputs/combined_1000req outputs/exp3_pd_sep_tp1_mooncake
|
||||
|
||||
# Example Batch 0/1 local trace analysis.
|
||||
# B1' Per-request KV footprint on the full trace (runs on dash0 directly,
|
||||
# CPU-only; the formatted full trace is hundreds of GiB).
|
||||
python3 analysis/characterization/analyze.py \
|
||||
--trace traces/w600_r0.0015_st30.jsonl \
|
||||
--kv-bytes-per-token 98304 \
|
||||
--task-name w600_local_full_trace \
|
||||
--overwrite
|
||||
--trace ~/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl \
|
||||
--kv-bytes-per-token 98304 \
|
||||
--task-name full_trace_with_kv \
|
||||
--output-root outputs/characterization \
|
||||
--overwrite
|
||||
|
||||
# CPU-only full compact trace summary was computed on dash0 from:
|
||||
# /home/admin/cpfs/wjh/ali-trace/trace-glm5.1-formatted/051315-051317.jsonl
|
||||
# Recompute either by running analyze.py on dash0, or by copying that compact
|
||||
# formatted JSONL locally. Do not use the 487G raw file directly.
|
||||
# w600 trace APC theoretical bound.
|
||||
python3 scripts/compute_apc_upper_bound.py \
|
||||
--trace traces/w600_r0.0015_st30.jsonl \
|
||||
--out outputs/apc_upper_w600.json
|
||||
|
||||
# B3 5-policy routing sweep on dash0 (8 × TP1 instances).
|
||||
# First three policies share one vLLM lifecycle (hot-cache, fast):
|
||||
bash scripts/b3_sweep.sh # writes outputs/b3_sweep_<TS>/
|
||||
|
||||
# Last two run isolated with cold vLLM:
|
||||
bash scripts/b3_isolated_policy.sh unified \
|
||||
traces/w600_r0.0015_st30.jsonl \
|
||||
outputs/b3_sweep_<TS>/unified
|
||||
|
||||
python3 scripts/build_capped_trace.py \
|
||||
--input traces/w600_r0.0015_st30.jsonl \
|
||||
--output outputs/b3_sweep_<TS>/capped/trace.jsonl \
|
||||
--max-turns 8
|
||||
|
||||
bash scripts/b3_isolated_policy.sh lmetric \
|
||||
outputs/b3_sweep_<TS>/capped/trace.jsonl \
|
||||
outputs/b3_sweep_<TS>/capped
|
||||
|
||||
# B3 analysis (joined records + indices) and report.
|
||||
bash scripts/b3_analyze.sh outputs/b3_sweep_<TS>
|
||||
python3 scripts/render_b3_report.py --sweep-dir outputs/b3_sweep_<TS>
|
||||
|
||||
# B2 PD-colo interference microbench. Launch 2 vLLM instances on
|
||||
# ports 8100 and 8101 with --enable-prompt-tokens-details first, then:
|
||||
python3 scripts/b2_interference.py \
|
||||
--decode-endpoint http://127.0.0.1:8100 \
|
||||
--prefill-endpoint http://127.0.0.1:8101 \
|
||||
--model <model-path> \
|
||||
--out-dir outputs/b2_microbench/sweep \
|
||||
--prefill-sizes 2048,8192,16384,32768,65536 \
|
||||
--variants different,same
|
||||
python3 analysis/characterization/b2_sweep_analysis.py \
|
||||
--sweep-dir outputs/b2_microbench/sweep
|
||||
|
||||
# Window 1 figure rendering (CPU only).
|
||||
python3 analysis/characterization/render_window1_figures.py \
|
||||
--results-dir analysis/characterization/window_1_results \
|
||||
--out-dir analysis/characterization/window_1_results/figures
|
||||
|
||||
@@ -1,8 +1,15 @@
|
||||
# Reviewer Risk Register
|
||||
|
||||
Updated 2026-05-25 after Window 1.
|
||||
|
||||
| Risk | Severity | Evidence | Mitigation |
|
||||
|---|---|---|---|
|
||||
| Session sequentiality not proven | `high` | Current metrics include trace timestamp and latency but not actual dispatch/finish wall-clock timestamps. | Add dispatch/finish timestamps and run Batch 0 before SRR claims. |
|
||||
| Legacy PD-sep data may not match final methodology | `medium` | PD matrix scaffold exists separately; some old runs used earlier flags/methodology. | Use fresh PD matrix for paper-grade claims. |
|
||||
| GPU util is not a sufficient hot-spot proof | `medium` | Existing artifacts have gpu_util.csv but lack per-worker queue and session ownership. | Add route-decision and per-worker queue logs for Batch 3. |
|
||||
| Cache reuse decomposition is incomplete without joined hash/cache-hit data | `medium` | Trace has hash_ids; metrics have cached_tokens; request IDs may not join across all artifacts. | Emit hash_ids/session_id/cached_tokens in the same per-request record. |
|
||||
| ~~Session sequentiality not proven~~ | resolved | A1 instrumentation lands per-request t_dispatch/t_first_token/t_finish unix timestamps + proxy_request_id. Smoke validation 2026-05-25 confirms 30/30 join coverage. | All Window 1 runs already use this; Window 2 inherits. |
|
||||
| ~~Cache reuse decomposition incomplete~~ | resolved | Real reuse decomposition computed in `window_1_results/lmetric_reuse.json` from joined records carrying session_id + hash_ids + cached_tokens. | — |
|
||||
| APC across hot-sweep policies may be contaminated by prior policy runs | low | First-turn cached_tokens distribution shows < 1% empirical contamination; load_only and sticky vLLMs were not restarted between policies. `unified` and `capped` are isolated cold-start. | Window 2 will isolate each policy launch by default; document in paper that lmetric/load_only/sticky reflect "warm-cache" condition. |
|
||||
| Unified missing `interference_index` due to analyzer truncate-write bug | medium | The original `b3_analyze.sh` unconditionally `slice_engine_state.py`'d each policy and used `open("w")`, overwriting unified's correctly-written engine_state with the empty-window slice from the (hot-sweep) shared dir. | Fixed in commit `df32499`. B2 microbench provides the cleaner same-vs-different interference proof, so we do not need to rerun unified. |
|
||||
| GPU 0 ghost memory after vLLM crash | low | EngineCore subprocess name is `VLLM::EngineCor`; `pkill -f "vllm serve"` misses it. Killed manually on 2026-05-25; cleanup logic in `b3_sweep.sh` and `b3_isolated_policy.sh` now also targets `EngineCore`. | — |
|
||||
| w600 trace is a 1k-request sample, not the full GLM-5.1 trace | low | All B3 + B2 percentiles are on this sample. Full-trace KV-footprint and reuse claims use the 2.11M-request full trace. | Window 2 SRR sweep uses w600; full-trace SRR would need a larger sample and more GPU budget. |
|
||||
| Trace-timestamp dispatch with strict session sequentiality stretches replay wall time | medium | lmetric's 600s trace dispatched over 49 min; system over-saturates and the dispatch window expands. | Window 2 uses A4 open-loop Poisson loadgen with explicit arrival rate, decoupling load level from trace structure. |
|
||||
| Capped cap=8 may be too soft | low | Reviewer might prefer cap=2 or cap=4 to test "no multi-turn" extreme. Cap=8 was chosen to sit between turns/session p90 (1) and p99 (18). | Re-run with a stricter cap if reviewer pushes back; underlying capped script is parameterized. |
|
||||
| B2 microbench uses synthetic short-prompt decode load (256 tokens) | low | This bounds the realism of the "decode" workload. Production decode tokens come from prior turns of long context. | The signal magnitude is robust enough that prompt length shouldn't qualitatively change conclusions; B3 sticky's failure breakdown is the production-trace confirmation. |
|
||||
|
||||
@@ -4,17 +4,17 @@ Status: execution checklist for interns
|
||||
Date: 2026-05-25
|
||||
Last progress audit: 2026-05-25
|
||||
|
||||
## Progress Snapshot (2026-05-25)
|
||||
## Progress Snapshot (2026-05-25, post-Window-1)
|
||||
|
||||
| Batch | State | Evidence |
|
||||
|---|---|---|
|
||||
| B0 Substrate audit | tool DONE, legacy runs partial | `analysis/characterization/analyze.py` implements per-session concurrency/arrival/inter-turn analyzer; legacy `metrics.jsonl` lacks dispatch/finish timestamps so actual sequentiality cannot be proven on old runs (correctly labeled in `current_results/`) |
|
||||
| B1 Workload characterization | trace-shape DONE, reuse pending | `current_results/full_trace_summary.json` covers 2.11M req / 1.31M sessions from `051315-051317.jsonl`; KV-footprint and reuse decomposition still require `--kv-bytes-per-token` rerun and cached_tokens+hash_ids joined records |
|
||||
| B2 PD interference | protocol DONE, run pending | `analysis/characterization/protocols.md` Batch 2 section ready; needs fresh GPU run with decode-step + prefill-chunk timestamps |
|
||||
| B3 Hot-spot imbalance | partial; needs new instrumentation | Legacy `gpu_util.csv` shows imbalance but lacks per-worker queue delay and session→worker map |
|
||||
| B4 SRR sweep | NOT DONE | No arrival-rate sweep artifacts; depends on session-causal open-loop loadgen |
|
||||
| B5 Failure attribution | NOT DONE | Depends on B2/B4 outputs |
|
||||
| B6 Audit package | scaffold DONE | `current_results/{characterization_claim_matrix.md, all_figures_index.md, reviewer_risk_register.md, main_claim_allowed_runs.md, reproduction_commands.sh}` + 6 figures committed |
|
||||
| B0 Substrate audit | **DONE for new runs**, legacy still partial | A1+A2 instrumentation lands per-request unix timestamps and X-Request-Id passthrough; B3 sweep 2026-05-25 achieves 100% join coverage on all 5 policy runs |
|
||||
| B1 Workload characterization | **DONE** | `window_1_results/kv_footprint_summary.json` (98304 B/token, p99 = 11.49 GiB); real reuse decomposition (`lmetric_reuse.json`: 93.2% intra-session, 5.7% cross, 1.1% shared); theoretical APC ceilings (`apc_upper_w600.json`: 79.6% intra / 80.3% any) |
|
||||
| B2 PD interference | **DONE** | `outputs/b2_microbench/sweep/` 5 × 2 cells. Different-worker control idx 0.92-1.02 across 32× prefill size variation; same-worker TTFT idx scales 2.15× → 218×. Causal proof complete. |
|
||||
| B3 5-policy routing sweep | **DONE** | `outputs/b3_sweep_20260525_095043/` lmetric/load_only/sticky (warm-cache) + unified/capped (isolated cold-start). Aggregated in `b3_policy_comparison.json`. Unified hits APC 79.4% (97% of ceiling) AND TTFT p90 7.24 s. |
|
||||
| B4 SRR sweep | NOT DONE | Window 2 task. A4 loadgen + per-class SLO + λ binary search per policy. |
|
||||
| B5 Failure attribution | NOT DONE | Window 2 task. Depends on B4 SRR boundaries. |
|
||||
| B6 Audit package | **DONE for Window 1** | `current_results/{characterization_claim_matrix.md, all_figures_index.md, reviewer_risk_register.md, main_claim_allowed_runs.md, reproduction_commands.sh}` refreshed; Window 1 results aggregated in `window_1_results.md` + 8 PNG figures |
|
||||
|
||||
Reusable assets already in repo:
|
||||
|
||||
|
||||
Reference in New Issue
Block a user