initial commit

2026-05-13 21:36:34 +08:00
commit af6ba2aa16
11 changed files with 1113 additions and 0 deletions
--- a/references/benchmarking-crimes.md
+++ b/references/benchmarking-crimes.md
@@ -0,0 +1,73 @@
+# Benchmarking Crimes Reference
+
+Source: Gernot Heiser, “Systems Benchmarking Crimes”
+https://gernot-heiser.org/benchmarking-crimes.html
+
+This file is a compact local index. Use `skills/benchmark-crime-auditor.md` for the operational checklist.
+
+## Crime Index
+
+### 1. Selective Benchmarking
+
+- Not evaluating potential performance degradation.
+- Benchmark subsetting without strong justification.
+- Selective data set hiding deficiencies.
+
+Core idea: a fair performance evaluation must show both improvement in the target regime and acceptable behavior elsewhere.
+
+### 2. Improper Handling of Benchmark Results
+
+- Pretending microbenchmarks represent overall performance.
+- Treating throughput degradation as equal to overhead.
+- Downplaying overheads through wrong arithmetic or wrong denominators.
+- No indication of statistical significance.
+- Arithmetic mean over normalized benchmark scores.
+
+Core idea: performance numbers must preserve the meaning of the measured quantity. Ratios, means, averages, and overheads are not interchangeable.
+
+### 3. Using the Wrong Benchmarks
+
+- Benchmarking simplified simulated systems without validating assumptions.
+- Inappropriate or misleading benchmarks.
+- Using the same data for calibration and validation.
+
+Core idea: the workload must stress the phenomenon behind the claim.
+
+### 4. Improper Comparison of Benchmark Results
+
+- No proper baseline.
+- Only evaluate against yourself.
+- Unfair benchmarking of competitors.
+- Inflating gains by not comparing against the state of the art.
+
+Core idea: a comparison is only meaningful if the reference point is the one readers need to judge the claim.
+
+### 5. Missing Information
+
+- Missing platform specification.
+- Missing sub-benchmark results.
+- Relative numbers only.
+
+Core idea: readers need enough absolute and contextual information to reproduce, sanity-check, and interpret results.
+
+## Best Practice Index
+
+- Ensure quiescence before timing.
+- Make benchmarking part of regression testing.
+- Document commands, platform, versions, and configuration.
+- Verify transferred or stored data.
+- Use different data per run when stale data or caching could matter.
+- Repeat points consecutively and separated in time.
+- Invert measurement order to detect interference.
+- Include non-regular and pathological data points.
+- Compare configurations at exactly the same data points.
+- Run several trials and report variance.
+- Use principled outlier handling.
+- Warm up before timed runs.
+- Use enough iterations to reduce clock granularity.
+- Eliminate loop overhead.
+- Inspect generated machine code for low-level timing loops when needed.
+
+## Local Policy
+
+In this agent context, benchmark crimes are handled as research-validity issues. A headline claim affected by selective benchmarking, unfair baseline, wrong overhead accounting, or SOTA omission is `Blocking` until fixed.