gahow/agentic-ctx

Files

Gahow Wang af6ba2aa16 initial commit

2026-05-13 21:36:34 +08:00

2.2 KiB

Raw Permalink Blame History

Experiment Workflow

用于系统实验的设计、执行、分析和审计。目标是让每个性能 claim 都能回到 workload、baseline、metric、平台、统计和原始 artifact。

Inputs

claim：要证明什么。
system boundary：评测的是哪个系统、组件、接口或 workload。
artifact：实验计划、脚本、日志、CSV、图表或论文段落。
stage：design / run / analyze / audit。

Stage: Design

调用：

skills/benchmark-crime-auditor.md
mode: experiment-design-review

必须产出：

headline claim。
workload matrix。
baseline matrix。
metric definitions。
platform specification template。
run protocol。
expected failure or degradation cases。

设计检查：

每个 claim 至少有一个直接 metric。
每个 headline claim 有公平 baseline。
每个优化机制有一个 ablation。
每个关键 assumption 有一个 sensitivity test。
至少包含一个机制上可能输的场景。

Stage: Run

必须记录：

git commit / binary hash。
command line。
config。
machine fingerprint。
OS/kernel/compiler/runtime versions。
timestamp。
raw output path。

运行纪律：

warmup 和 measured runs 分开。
多次重复并记录 trial id。
正向/反向或随机化运行顺序。
数据校验。
资源利用率与端到端结果一起记录。

Stage: Analyze

必须计算：

绝对数。
相对数。
方差或置信区间。
几何平均，如果聚合 normalized scores。
per-unit resource cost，例如 cycles/op、ms/request、J/op、bytes/request。

分析检查：

不用 throughput degradation 直接代表 overhead。
不只看平均值；检查 tail。
不只展示 winning workload。
每张图都能回答一个具体问题。

Stage: Audit

调用：

skills/benchmark-crime-auditor.md
mode: pre-submission 或 claim-spot-check

输出：

audit table。
blocking crimes。
required reruns。
claim rewrite suggestions。

Minimal Experiment Record

claim:
system:
baseline:
workload:
metric:
platform:
commands:
raw_data:
analysis_script:
figure:
statistics:
known_limits: