# Benchmarking Crimes Reference Source: Gernot Heiser, “Systems Benchmarking Crimes” https://gernot-heiser.org/benchmarking-crimes.html This file is a compact local index. Use `skills/benchmark-crime-auditor.md` for the operational checklist. ## Crime Index ### 1. Selective Benchmarking - Not evaluating potential performance degradation. - Benchmark subsetting without strong justification. - Selective data set hiding deficiencies. Core idea: a fair performance evaluation must show both improvement in the target regime and acceptable behavior elsewhere. ### 2. Improper Handling of Benchmark Results - Pretending microbenchmarks represent overall performance. - Treating throughput degradation as equal to overhead. - Downplaying overheads through wrong arithmetic or wrong denominators. - No indication of statistical significance. - Arithmetic mean over normalized benchmark scores. Core idea: performance numbers must preserve the meaning of the measured quantity. Ratios, means, averages, and overheads are not interchangeable. ### 3. Using the Wrong Benchmarks - Benchmarking simplified simulated systems without validating assumptions. - Inappropriate or misleading benchmarks. - Using the same data for calibration and validation. Core idea: the workload must stress the phenomenon behind the claim. ### 4. Improper Comparison of Benchmark Results - No proper baseline. - Only evaluate against yourself. - Unfair benchmarking of competitors. - Inflating gains by not comparing against the state of the art. Core idea: a comparison is only meaningful if the reference point is the one readers need to judge the claim. ### 5. Missing Information - Missing platform specification. - Missing sub-benchmark results. - Relative numbers only. Core idea: readers need enough absolute and contextual information to reproduce, sanity-check, and interpret results. ## Best Practice Index - Ensure quiescence before timing. - Make benchmarking part of regression testing. - Document commands, platform, versions, and configuration. - Verify transferred or stored data. - Use different data per run when stale data or caching could matter. - Repeat points consecutively and separated in time. - Invert measurement order to detect interference. - Include non-regular and pathological data points. - Compare configurations at exactly the same data points. - Run several trials and report variance. - Use principled outlier handling. - Warm up before timed runs. - Use enough iterations to reduce clock granularity. - Eliminate loop overhead. - Inspect generated machine code for low-level timing loops when needed. ## Local Policy In this agent context, benchmark crimes are handled as research-validity issues. A headline claim affected by selective benchmarking, unfair baseline, wrong overhead accounting, or SOTA omission is `Blocking` until fixed.