Refresh the standing audit package now that B1' / B2 / B3 are complete. current_results/characterization_claim_matrix.md Flips seven entries from "not_yet_supported" / "partially_supported" to "supported" with pointers into window_1_results/. New entries cover per-session sequentiality, KV per request, real reuse decomposition, theoretical APC ceiling, the LMetric locality gap, Unified breaking the locality-vs-latency tradeoff, B2 causal interference proof, sticky's interference inflation, and the partial heavy-tail / hot-spot story. B4 SRR + B5 attribution stay "not_yet_supported" (Window 2 work). current_results/main_claim_allowed_runs.md New "Allowed For Routing-Policy Comparison" section pins the five B3 policy directories. New "Allowed For PD-colo Interference" section pins the B2 sweep. Legacy section retained for the pre-instrumentation 200/500/1000-req runs. current_results/reviewer_risk_register.md Marks the two old "high"-severity risks (sequentiality / reuse decomposition) as resolved; adds new entries for the APC contamination empirics, the b3_analyze.sh truncate-write bug that cost unified's interference index, the GPU-0 EngineCore ghost cleanup, the saturated-replay caveat for trace-timestamp dispatch, and the synthetic B2 decode workload. current_results/all_figures_index.md Adds the 8 new Window 1 figures alongside the existing 6 from the legacy summarize_runs run. current_results/reproduction_commands.sh Records the full B3 + B2 + figure pipeline. analysis/characterization_todo_for_interns.md Updates the Progress Snapshot table: B0, B1, B2, B3, B6 all DONE; only B4 and B5 remain (Window 2). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
685 lines
21 KiB
Markdown
685 lines
21 KiB
Markdown
# Agentic Workload Characterization TODO
|
||
|
||
Status: execution checklist for interns
|
||
Date: 2026-05-25
|
||
Last progress audit: 2026-05-25
|
||
|
||
## Progress Snapshot (2026-05-25, post-Window-1)
|
||
|
||
| Batch | State | Evidence |
|
||
|---|---|---|
|
||
| B0 Substrate audit | **DONE for new runs**, legacy still partial | A1+A2 instrumentation lands per-request unix timestamps and X-Request-Id passthrough; B3 sweep 2026-05-25 achieves 100% join coverage on all 5 policy runs |
|
||
| B1 Workload characterization | **DONE** | `window_1_results/kv_footprint_summary.json` (98304 B/token, p99 = 11.49 GiB); real reuse decomposition (`lmetric_reuse.json`: 93.2% intra-session, 5.7% cross, 1.1% shared); theoretical APC ceilings (`apc_upper_w600.json`: 79.6% intra / 80.3% any) |
|
||
| B2 PD interference | **DONE** | `outputs/b2_microbench/sweep/` 5 × 2 cells. Different-worker control idx 0.92-1.02 across 32× prefill size variation; same-worker TTFT idx scales 2.15× → 218×. Causal proof complete. |
|
||
| B3 5-policy routing sweep | **DONE** | `outputs/b3_sweep_20260525_095043/` lmetric/load_only/sticky (warm-cache) + unified/capped (isolated cold-start). Aggregated in `b3_policy_comparison.json`. Unified hits APC 79.4% (97% of ceiling) AND TTFT p90 7.24 s. |
|
||
| B4 SRR sweep | NOT DONE | Window 2 task. A4 loadgen + per-class SLO + λ binary search per policy. |
|
||
| B5 Failure attribution | NOT DONE | Window 2 task. Depends on B4 SRR boundaries. |
|
||
| B6 Audit package | **DONE for Window 1** | `current_results/{characterization_claim_matrix.md, all_figures_index.md, reviewer_risk_register.md, main_claim_allowed_runs.md, reproduction_commands.sh}` refreshed; Window 1 results aggregated in `window_1_results.md` + 8 PNG figures |
|
||
|
||
Reusable assets already in repo:
|
||
|
||
- `analysis/characterization/analyze.py` — B0+B1 CPU-only analyzer
|
||
- `analysis/characterization/summarize_runs.py` — existing-run audit producing the B6 scaffold
|
||
- `analysis/characterization/plot_current_results.py` — figure regeneration script
|
||
- `analysis/characterization/protocols.md` — B2–B6 protocol with required instrumentation, sweep, pass condition
|
||
- `analysis/characterization/current_results/` — current audit package (claim matrix + risk register + allowed-runs gate + 6 PNG figures)
|
||
|
||
Hard gates still blocking main claims:
|
||
|
||
1. Replayer/proxy must emit per-request dispatch + finish/error wall-clock timestamps (blocks B0 actual sequentiality, B4 SRR validity).
|
||
2. Per-request record must carry `session_id` + `hash_ids` + `cached_tokens` jointly (blocks B1 reuse decomposition).
|
||
3. Engine/proxy must log decode-step and prefill-chunk timestamps with worker id (blocks B2 interference index).
|
||
4. Proxy must log route decision, chosen worker, candidate scores, per-worker queue/KV/APC snapshot (blocks B3 hot-spot proof).
|
||
|
||
|
||
## 0. Purpose
|
||
|
||
We are not starting from the assumption that Unified routing or PUSH
|
||
migration is already the answer.
|
||
|
||
The first goal is to build a rigorous characterization package that proves:
|
||
|
||
1. which dimensions make agentic serving different;
|
||
2. where static PD-disaggregation works poorly;
|
||
3. where PD-colocation/cache-aware routing still has residual failure modes;
|
||
4. how these failure modes reduce sustainable request rate under SLO.
|
||
|
||
Only after these facts are established should we refine the positive system
|
||
design.
|
||
|
||
Primary system goal:
|
||
|
||
```text
|
||
maximize sustainable request rate under SLO
|
||
```
|
||
|
||
Prefill-decode interference and session hot-spot imbalance are mechanisms
|
||
that may reduce SRR. They are not the final metric by themselves.
|
||
|
||
## 1. Global Delivery Rules
|
||
|
||
Every task must produce data, figures, and an audit trail. A task is not
|
||
complete if it only produces a written conclusion.
|
||
|
||
Use this output layout:
|
||
|
||
```text
|
||
outputs/characterization/<date>/<task_name>/
|
||
├── manifest.json
|
||
├── raw/
|
||
├── summary.json
|
||
├── summary.md
|
||
├── figures/
|
||
└── audit.md
|
||
```
|
||
|
||
Required fields in `manifest.json`:
|
||
|
||
```json
|
||
{
|
||
"git_commit": "",
|
||
"host": "",
|
||
"gpu_type": "",
|
||
"gpu_count": 0,
|
||
"trace_path": "",
|
||
"trace_sha256": "",
|
||
"policy": "",
|
||
"launch_command": "",
|
||
"request_limit": null,
|
||
"time_scale": null,
|
||
"session_sampling_method": "",
|
||
"session_sequential": true,
|
||
"start_time": "",
|
||
"end_time": ""
|
||
}
|
||
```
|
||
|
||
Every comparison must report:
|
||
|
||
- attempted requests
|
||
- completed requests
|
||
- errors / timeouts
|
||
- goodput
|
||
- TTFT p50/p90/p99
|
||
- E2E p50/p90/p99
|
||
- TPOT p50/p90/p99
|
||
- per-worker queue metrics
|
||
- per-worker GPU utilization
|
||
- per-worker KV occupancy if available
|
||
- per-worker APC / cache-hit metrics
|
||
|
||
Every figure must be reproducible from raw data by a script committed or
|
||
saved alongside the artifact.
|
||
|
||
## 2. Batch 0: Benchmark Substrate Audit
|
||
|
||
Status: analyzer DONE (`analyze.py`); legacy-run sequentiality claim BLOCKED by missing dispatch/finish timestamps in `metrics.jsonl`. New replayer must add those fields before any `online_realistic` classification is allowed.
|
||
|
||
### Goal
|
||
|
||
Prove the load generator and trace replay are valid before trusting any
|
||
performance result.
|
||
|
||
The most important invariant:
|
||
|
||
```text
|
||
For online agentic serving, each session must have at most one in-flight turn.
|
||
Turn N+1 must not be sent before turn N completes.
|
||
```
|
||
|
||
### TODO
|
||
|
||
1. Implement or run an analyzer that reconstructs per-session request
|
||
intervals:
|
||
- dispatch timestamp
|
||
- first-token timestamp
|
||
- finish timestamp
|
||
- error / timeout timestamp
|
||
2. Compute max concurrent in-flight turns per session.
|
||
3. Compute session start-time distribution.
|
||
4. Compute turn inter-arrival distribution.
|
||
5. Classify each existing run as one of:
|
||
- `online_realistic`
|
||
- `burst_stress`
|
||
- `synthetic_microbench`
|
||
- `invalid_for_online_claim`
|
||
6. For any run where session sequentiality is violated, write down exactly
|
||
which claim it can still support.
|
||
|
||
### Data Artifacts
|
||
|
||
- `session_concurrency.json`
|
||
- `session_arrival_stats.json`
|
||
- `turn_interval_stats.json`
|
||
- `trace_profile.json`
|
||
- `invalid_runs.md`
|
||
|
||
### Figures
|
||
|
||
- session start-time CDF
|
||
- per-session max in-flight histogram
|
||
- turns per session CDF
|
||
- turn inter-arrival CDF
|
||
|
||
### Audit Checks
|
||
|
||
The `audit.md` must answer:
|
||
|
||
1. Does the main trace satisfy `max_inflight_per_session == 1`?
|
||
2. If not, is the run explicitly labeled as stress or invalid?
|
||
3. Are attempted/completed/error counts included?
|
||
4. Are latency percentiles computed only over successes, and if so, is
|
||
goodput also reported?
|
||
|
||
### Pass Criteria
|
||
|
||
- Main online-serving experiments must have `max_inflight_per_session == 1`.
|
||
- Any violation must be clearly labeled and excluded from SRR claims.
|
||
|
||
## 3. Batch 1: Workload Characterization
|
||
|
||
Status: trace-shape items (1, 2, 3, 6, 8) DONE on full 7200 s GLM-5.1 trace; recorded in `current_results/full_trace_summary.json`. Items 4 (KV footprint), 5 (reuse decomposition), 7 (uncached append delta) are PENDING because they need `--kv-bytes-per-token` for the production model and joinable `cached_tokens`+`hash_ids` per request.
|
||
|
||
### Goal
|
||
|
||
Establish agentic workload facts independent of any proposed system.
|
||
|
||
Required facts:
|
||
|
||
1. long input, short output;
|
||
2. large per-request KV footprint;
|
||
3. reuse is mostly intra-session;
|
||
4. session token mass is heavy-tailed;
|
||
5. total prompt length and effective uncached prefill work are different.
|
||
|
||
### TODO
|
||
|
||
1. Compute input token CDF.
|
||
2. Compute output token CDF.
|
||
3. Compute input/output ratio.
|
||
4. Estimate KV footprint per request:
|
||
|
||
```text
|
||
kv_bytes_per_request = input_tokens * kv_bytes_per_token
|
||
```
|
||
|
||
5. Decompose reusable KV into:
|
||
- intra-session reuse
|
||
- cross-session reuse
|
||
- shared/system-prefix reuse
|
||
6. Compute session-level skew:
|
||
- turns per session
|
||
- cumulative input tokens per session
|
||
- cumulative output tokens per session
|
||
- cumulative uncached tokens per session
|
||
- top-k session contribution
|
||
7. Compute append / effective-prefill distribution:
|
||
|
||
```text
|
||
uncached_tokens = input_tokens - cached_tokens
|
||
```
|
||
|
||
8. Compare total input length vs uncached tokens.
|
||
|
||
### Data Artifacts
|
||
|
||
- `workload_summary.json`
|
||
- `kv_footprint_summary.json`
|
||
- `reuse_decomposition.json`
|
||
- `session_skew.json`
|
||
- `append_delta_stats.json`
|
||
|
||
### Figures
|
||
|
||
- input/output token CDF
|
||
- input/output ratio CDF
|
||
- KV footprint CDF
|
||
- reuse decomposition stacked bar
|
||
- turns per session CDF
|
||
- per-session token mass Lorenz curve
|
||
- top-k sessions token contribution bar
|
||
- total input vs uncached tokens scatter
|
||
|
||
### Audit Checks
|
||
|
||
The `audit.md` must answer:
|
||
|
||
1. What are input p50/p90/p99?
|
||
2. What are output p50/p90/p99?
|
||
3. What is the estimated KV footprint p50/p90/p99?
|
||
4. What fraction of reuse is intra-session?
|
||
5. What fraction of total token mass comes from top 1% / 5% sessions?
|
||
6. Are long prompts often small appends after cache reuse?
|
||
|
||
### Pass Criteria
|
||
|
||
The batch passes only if these facts can be stated numerically with raw data
|
||
links and plotted figures.
|
||
|
||
## 4. Batch 2: PD-Colo Prefill-Decode Interference Proof
|
||
|
||
Status: protocol DONE (`analysis/characterization/protocols.md` §"Batch 2 Protocol"); execution NOT STARTED — needs new engine instrumentation for decode-step and prefill-chunk timestamps.
|
||
|
||
### Goal
|
||
|
||
Prove that PD-colocation can suffer from prefill-decode interference under
|
||
high load, and quantify how much this affects TPOT, decode queueing, and SLO.
|
||
|
||
Hypothesis:
|
||
|
||
```text
|
||
When heavy uncached prefill overlaps with active decode on the same worker,
|
||
decode TPOT and/or decode queue delay increases.
|
||
```
|
||
|
||
### TODO
|
||
|
||
1. Run controlled microbenchmarks:
|
||
- decode-only steady load;
|
||
- decode load plus same-worker heavy prefill injection;
|
||
- decode load plus different-worker heavy prefill injection.
|
||
2. Sweep uncached prefill sizes:
|
||
- 2k
|
||
- 8k
|
||
- 16k
|
||
- 32k
|
||
- 64k
|
||
3. If supported, sweep chunked prefill size.
|
||
4. Log timestamps for:
|
||
- decode steps;
|
||
- prefill start/end;
|
||
- prefill chunks;
|
||
- queue admission;
|
||
- request completion.
|
||
5. In trace replay, label decode steps by whether they overlap with
|
||
same-worker prefill.
|
||
6. Compute:
|
||
|
||
```text
|
||
interference_index =
|
||
TPOT_p90(decode steps overlapping same-worker prefill)
|
||
/ TPOT_p90(decode steps without same-worker prefill)
|
||
```
|
||
|
||
7. Compare same-worker vs different-worker controls.
|
||
|
||
### Data Artifacts
|
||
|
||
- `interference_microbench_summary.json`
|
||
- `decode_step_timeseries.csv`
|
||
- `prefill_overlap_events.jsonl`
|
||
- `interference_index.json`
|
||
- `trace_overlap_summary.json`
|
||
|
||
### Figures
|
||
|
||
- TPOT time series with prefill overlap annotation
|
||
- interference index vs uncached prefill size
|
||
- same-worker vs different-worker TPOT boxplot
|
||
- chunk size vs TTFT/TPOT tradeoff
|
||
- trace replay overlap vs non-overlap TPOT comparison
|
||
|
||
### Audit Checks
|
||
|
||
The `audit.md` must answer:
|
||
|
||
1. Is the interference observed on the same worker?
|
||
2. Is the different-worker control significantly weaker?
|
||
3. Does interference grow with uncached prefill size?
|
||
4. Does the phenomenon appear in real trace replay, not only microbench?
|
||
5. Could the result be explained by global load instead of local colocation?
|
||
|
||
### Pass Criteria
|
||
|
||
- Same-worker overlap must measurably increase TPOT or decode queue delay.
|
||
- The effect must be weaker or absent in the different-worker control.
|
||
- The effect must be visible in at least one trace replay setting.
|
||
|
||
## 5. Batch 3: Session Hot-Spot Residual Imbalance Proof
|
||
|
||
Status: protocol DONE; partial signal from legacy `gpu_util.csv` (GPU-util imbalance visible) but causal proof NOT STARTED — needs per-worker queue/KV/APC and session→worker map from instrumented proxy.
|
||
|
||
### Goal
|
||
|
||
Prove that cache-aware/LMetric is a strong baseline but still leaves residual
|
||
hot-worker imbalance due to session skew and locality.
|
||
|
||
Hypothesis:
|
||
|
||
```text
|
||
Cache-aware routing preserves locality by attracting future turns to cached
|
||
workers. This is usually good, but heavy-tailed sessions can create hot
|
||
workers whose queue delay/SLO violations are much worse than the median
|
||
worker even when other workers still have headroom.
|
||
```
|
||
|
||
### TODO
|
||
|
||
1. Run the same session-causal trace with:
|
||
- corrected LMetric/cache-aware;
|
||
- load-only routing;
|
||
- hard sticky routing;
|
||
- current Unified hybrid, if available.
|
||
2. For each worker, record:
|
||
- assigned session count;
|
||
- cumulative input tokens;
|
||
- cumulative uncached tokens;
|
||
- cumulative output tokens;
|
||
- request queue delay;
|
||
- decode queue delay;
|
||
- GPU utilization;
|
||
- KV occupancy;
|
||
- APC / cache-hit rate;
|
||
- SLO violations.
|
||
3. For each session, record:
|
||
- worker set used;
|
||
- primary worker;
|
||
- cumulative token mass;
|
||
- number of turns;
|
||
- latency contribution;
|
||
- whether it appears in slow-request set.
|
||
4. Create a session-mass capped or equalized replay:
|
||
- cap max session turns or token mass;
|
||
- rerun LMetric/cache-aware;
|
||
- compare hot-spot index.
|
||
5. Compute:
|
||
|
||
```text
|
||
hotspot_index =
|
||
max_worker_queue_delay_p90 / median_worker_queue_delay_p90
|
||
```
|
||
|
||
6. Compute locality/load tradeoff:
|
||
|
||
```text
|
||
locality_gain = APC(policy) - APC(load_only)
|
||
imbalance_cost =
|
||
max_worker_latency_p90(policy) - median_worker_latency_p90(policy)
|
||
```
|
||
|
||
### Data Artifacts
|
||
|
||
- `worker_balance_summary.json`
|
||
- `session_to_worker_map.json`
|
||
- `session_mass_summary.json`
|
||
- `routing_policy_comparison.json`
|
||
- `hotspot_index.json`
|
||
- `capped_session_replay_summary.json`
|
||
|
||
### Figures
|
||
|
||
- per-worker queue delay bar
|
||
- per-worker token mass bar
|
||
- GPU utilization timeline by worker
|
||
- KV occupancy timeline by worker
|
||
- APC vs queue delay scatter
|
||
- top sessions contribution bar
|
||
- policy tradeoff plot: APC vs hotspot_index
|
||
- original vs session-capped hot-spot comparison
|
||
|
||
### Audit Checks
|
||
|
||
The `audit.md` must answer:
|
||
|
||
1. Does LMetric/cache-aware still show worker-level skew?
|
||
2. Are SLO violations concentrated on hot workers or hot sessions?
|
||
3. Does load-only routing improve balance but reduce APC/locality?
|
||
4. Does hard sticky improve locality but worsen hot-spot/HOL?
|
||
5. Does session-mass capping reduce hot spots?
|
||
|
||
### Pass Criteria
|
||
|
||
- LMetric/cache-aware must be shown as strong but imperfect.
|
||
- There must be measurable residual hot-worker imbalance.
|
||
- The imbalance must correlate with session token mass or locality.
|
||
|
||
## 6. Batch 4: Sustainable Request Rate Sweep
|
||
|
||
Status: protocol DONE; execution NOT STARTED — requires open-loop session-causal loadgen and policy-comparable arrival process.
|
||
|
||
### Goal
|
||
|
||
Connect interference and hot-spot mechanisms to the final metric:
|
||
|
||
```text
|
||
SRR(SLO) = max arrival rate satisfying SLO in steady state
|
||
```
|
||
|
||
### TODO
|
||
|
||
1. Define provisional SLO thresholds. Use configurable values, for example:
|
||
|
||
```text
|
||
TTFT_p90 <= T_ttft
|
||
E2E_p90 <= T_e2e
|
||
TPOT_p90 <= T_tpot
|
||
error_rate <= epsilon
|
||
queue length stable
|
||
KV occupancy stable
|
||
```
|
||
|
||
2. Implement arrival-rate sweep:
|
||
- Poisson session arrivals;
|
||
- session-internal sequentiality;
|
||
- warmup window;
|
||
- steady-state measurement window.
|
||
3. For each arrival rate `lambda`, run:
|
||
- PD-colo cache-aware/LMetric;
|
||
- static PD-disagg;
|
||
- current Unified hybrid;
|
||
- optional hard sticky;
|
||
- optional load-only.
|
||
4. Find maximum sustainable lambda for each policy.
|
||
5. Report instability reasons:
|
||
- SLO violation;
|
||
- queue growth;
|
||
- KV occupancy growth;
|
||
- error/timeout growth.
|
||
|
||
### Data Artifacts
|
||
|
||
- `srr_curve.json`
|
||
- `lambda_runs/<lambda>/summary.json`
|
||
- `slo_violation_reason.json`
|
||
- `goodput_vs_arrival_rate.json`
|
||
- `stability_summary.json`
|
||
|
||
### Figures
|
||
|
||
- SRR bar chart
|
||
- TTFT p90 vs arrival rate
|
||
- E2E p90 vs arrival rate
|
||
- TPOT p90 vs arrival rate
|
||
- goodput vs arrival rate
|
||
- error rate vs arrival rate
|
||
- queue length over time near failure point
|
||
- KV occupancy over time near failure point
|
||
|
||
### Audit Checks
|
||
|
||
The `audit.md` must answer:
|
||
|
||
1. Are session arrivals open-loop and Poisson?
|
||
2. Is session-internal sequentiality enforced?
|
||
3. How long are warmup and steady-state windows?
|
||
4. Is SRR failure persistent rather than transient?
|
||
5. Are completed/requested counts reported at every lambda?
|
||
6. Are policies compared on the same trace and same arrival process?
|
||
|
||
### Pass Criteria
|
||
|
||
- Each policy must have a measured SRR under the same SLO.
|
||
- Failure must be attributed to persistent SLO violation, queue growth, KV
|
||
growth, or error growth.
|
||
- Data must be session-causal.
|
||
|
||
## 7. Batch 5: Failure Attribution Near SRR Boundary
|
||
|
||
Status: protocol DONE; execution NOT STARTED — depends on B2 instrumentation and B4 SRR boundary.
|
||
|
||
### Goal
|
||
|
||
At and around the PD-colo/LMetric failure point, determine whether SLO
|
||
violations are caused by prefill-decode interference, session hot spots, KV
|
||
pressure, cache misses, or other mechanisms.
|
||
|
||
### TODO
|
||
|
||
1. Select three arrival rates:
|
||
|
||
```text
|
||
lambda = 0.9 * SRR
|
||
lambda = 1.0 * SRR
|
||
lambda = 1.1 * SRR
|
||
```
|
||
|
||
2. For every slow or SLO-violating request, assign labels:
|
||
- same-worker prefill overlap;
|
||
- hot worker queue;
|
||
- high KV occupancy;
|
||
- cache miss / large uncached append;
|
||
- transfer wait;
|
||
- P queue wait;
|
||
- D admission wait;
|
||
- unknown.
|
||
3. Produce per-request waterfall for representative slow requests.
|
||
4. Produce per-worker timeline around failure windows.
|
||
5. Summarize cause distribution.
|
||
|
||
### Data Artifacts
|
||
|
||
- `slow_request_attribution.jsonl`
|
||
- `failure_breakdown.json`
|
||
- `case_studies.md`
|
||
- `worker_failure_windows.json`
|
||
|
||
### Figures
|
||
|
||
- SLO violation cause stacked bar
|
||
- slow request waterfall
|
||
- worker timeline near failure
|
||
- prefill/decode/KV/queue stacked breakdown
|
||
- failure cause vs arrival rate
|
||
|
||
### Audit Checks
|
||
|
||
The `audit.md` must answer:
|
||
|
||
1. What fraction of slow requests overlap same-worker prefill?
|
||
2. What fraction are on hot workers?
|
||
3. What fraction happen under high KV occupancy?
|
||
4. What fraction are large uncached append requests?
|
||
5. For PD-disagg/Unified migration, how much time is transfer/P queue/D wait?
|
||
6. What remains unexplained?
|
||
|
||
### Pass Criteria
|
||
|
||
The batch must answer:
|
||
|
||
1. Why PD-colo/LMetric hits its SRR limit.
|
||
2. Why static PD-disagg hits its SRR limit.
|
||
3. If Unified/PUSH underperforms, whether the cause is trigger quality, cost
|
||
model, transfer overhead, wrong load regime, or something else.
|
||
|
||
## 8. Batch 6: Audit Package
|
||
|
||
Status: scaffold DONE — all five final artifacts exist under `analysis/characterization/current_results/` and are regenerated by `summarize_runs.py` + `plot_current_results.py`. Future B2–B5 outputs must be merged into the same package by re-running `summarize_runs.py` after new runs.
|
||
|
||
### Goal
|
||
|
||
Make the whole characterization package reviewable by a strict systems
|
||
reviewer.
|
||
|
||
### TODO
|
||
|
||
1. Write a claim matrix:
|
||
|
||
```text
|
||
claim -> data artifact -> figure -> script -> caveat -> reviewer risk
|
||
```
|
||
|
||
2. Write a figure index:
|
||
- figure filename;
|
||
- source data;
|
||
- generation command;
|
||
- intended claim.
|
||
3. Write a reviewer risk register:
|
||
- loadgen validity risks;
|
||
- trace representativeness risks;
|
||
- metric bias risks;
|
||
- implementation-specific risks;
|
||
- generalization risks.
|
||
4. Write a reproduction script or command list.
|
||
5. Mark experiments that cannot support main claims.
|
||
|
||
### Final Artifacts
|
||
|
||
- `characterization_claim_matrix.md`
|
||
- `all_figures_index.md`
|
||
- `reviewer_risk_register.md`
|
||
- `reproduction_commands.sh`
|
||
- `main_claim_allowed_runs.md`
|
||
|
||
### Audit Checks
|
||
|
||
The final package must satisfy:
|
||
|
||
1. Every claim links to raw data.
|
||
2. Every figure can be regenerated.
|
||
3. Every experiment has a manifest.
|
||
4. Every caveat is explicit.
|
||
5. Invalid or stress-only runs are not used for online-serving claims.
|
||
|
||
## 9. Priority Order
|
||
|
||
### Priority 1
|
||
|
||
Do these first:
|
||
|
||
1. Batch 0: Benchmark Substrate Audit
|
||
2. Batch 1: Workload Characterization
|
||
3. Batch 3: Session Hot-Spot Residual Imbalance Proof
|
||
|
||
Reason:
|
||
|
||
These define whether the trace and routing problem are real. Without them,
|
||
SRR sweeps and system experiments are not trustworthy.
|
||
|
||
### Priority 2
|
||
|
||
Do these after the substrate and workload facts are stable:
|
||
|
||
1. Batch 2: PD-Colo Prefill-Decode Interference Proof
|
||
2. Batch 5: Failure Attribution Near SRR Boundary
|
||
|
||
Reason:
|
||
|
||
These explain the mechanisms behind SLO/SRR failure and determine what the
|
||
positive system should actually fix.
|
||
|
||
### Priority 3
|
||
|
||
Do these after instrumentation and attribution are ready:
|
||
|
||
1. Batch 4: Sustainable Request Rate Sweep
|
||
2. Batch 6: Audit Package
|
||
|
||
Reason:
|
||
|
||
SRR sweeps are expensive. They should run only after trace validity,
|
||
logging, and attribution labels are ready.
|
||
|
||
## 10. Non-Negotiable Reviewer Rules
|
||
|
||
1. Do not use session-nonsequential loadgen for online-serving claims.
|
||
2. Do not compare latency percentiles without attempted/completed/error counts.
|
||
3. Do not use APC alone as a success metric.
|
||
4. Do not use average GPU utilization as proof of load balance.
|
||
5. Do not compare policies on different traces unless explicitly labeled.
|
||
6. Do not hide failed requests or timeouts.
|
||
7. Do not claim Unified/PUSH is the answer before failure attribution proves
|
||
the relevant bottleneck and cost budget.
|
||
8. Treat corrected LMetric/cache-aware PD-colo as the main baseline.
|
||
9. Treat static PD-disagg as an important baseline, not a strawman.
|
||
10. Every result must be reproducible from raw artifacts and commands.
|