# Experiment Workflow

用于系统实验的设计、执行、分析和审计。目标是让每个性能 claim 都能回到 workload、baseline、metric、平台、统计和原始 artifact。

## Inputs

- claim：要证明什么。
- system boundary：评测的是哪个系统、组件、接口或 workload。
- artifact：实验计划、脚本、日志、CSV、图表或论文段落。
- stage：`design / run / analyze / audit`。

## Stage: Design

调用：

- `skills/benchmark-crime-auditor.md`
- mode: `experiment-design-review`

必须产出：

- headline claim。
- workload matrix。
- baseline matrix。
- metric definitions。
- platform specification template。
- run protocol。
- expected failure or degradation cases。

设计检查：

- 每个 claim 至少有一个直接 metric。
- 每个 headline claim 有公平 baseline。
- 每个优化机制有一个 ablation。
- 每个关键 assumption 有一个 sensitivity test。
- 至少包含一个机制上可能输的场景。

## Stage: Run

必须记录：

- git commit / binary hash。
- command line。
- config。
- machine fingerprint。
- OS/kernel/compiler/runtime versions。
- timestamp。
- raw output path。

运行纪律：

- warmup 和 measured runs 分开。
- 多次重复并记录 trial id。
- 正向/反向或随机化运行顺序。
- 数据校验。
- 资源利用率与端到端结果一起记录。

## Stage: Analyze

必须计算：

- 绝对数。
- 相对数。
- 方差或置信区间。
- 几何平均，如果聚合 normalized scores。
- per-unit resource cost，例如 cycles/op、ms/request、J/op、bytes/request。

分析检查：

- 不用 throughput degradation 直接代表 overhead。
- 不只看平均值；检查 tail。
- 不只展示 winning workload。
- 每张图都能回答一个具体问题。

## Stage: Audit

调用：

- `skills/benchmark-crime-auditor.md`
- mode: `pre-submission` 或 `claim-spot-check`

输出：

- audit table。
- blocking crimes。
- required reruns。
- claim rewrite suggestions。

## Minimal Experiment Record

```yaml
claim:
system:
baseline:
workload:
metric:
platform:
commands:
raw_data:
analysis_script:
figure:
statistics:
known_limits:
```