Benchmarking Crimes Reference

Source: Gernot Heiser, “Systems Benchmarking Crimes” https://gernot-heiser.org/benchmarking-crimes.html

This file is a compact local index. Use skills/benchmark-crime-auditor.md for the operational checklist.

Crime Index

1. Selective Benchmarking

Not evaluating potential performance degradation.
Benchmark subsetting without strong justification.
Selective data set hiding deficiencies.

Core idea: a fair performance evaluation must show both improvement in the target regime and acceptable behavior elsewhere.

2. Improper Handling of Benchmark Results

Pretending microbenchmarks represent overall performance.
Treating throughput degradation as equal to overhead.
Downplaying overheads through wrong arithmetic or wrong denominators.
No indication of statistical significance.
Arithmetic mean over normalized benchmark scores.

Core idea: performance numbers must preserve the meaning of the measured quantity. Ratios, means, averages, and overheads are not interchangeable.

3. Using the Wrong Benchmarks

Benchmarking simplified simulated systems without validating assumptions.
Inappropriate or misleading benchmarks.
Using the same data for calibration and validation.

Core idea: the workload must stress the phenomenon behind the claim.

4. Improper Comparison of Benchmark Results

No proper baseline.
Only evaluate against yourself.
Unfair benchmarking of competitors.
Inflating gains by not comparing against the state of the art.

Core idea: a comparison is only meaningful if the reference point is the one readers need to judge the claim.

5. Missing Information

Missing platform specification.
Missing sub-benchmark results.
Relative numbers only.

Core idea: readers need enough absolute and contextual information to reproduce, sanity-check, and interpret results.

Best Practice Index

Ensure quiescence before timing.
Make benchmarking part of regression testing.
Document commands, platform, versions, and configuration.
Verify transferred or stored data.
Use different data per run when stale data or caching could matter.
Repeat points consecutively and separated in time.
Invert measurement order to detect interference.
Include non-regular and pathological data points.
Compare configurations at exactly the same data points.
Run several trials and report variance.
Use principled outlier handling.
Warm up before timed runs.
Use enough iterations to reduce clock granularity.
Eliminate loop overhead.
Inspect generated machine code for low-level timing loops when needed.

Local Policy

In this agent context, benchmark crimes are handled as research-validity issues. A headline claim affected by selective benchmarking, unfair baseline, wrong overhead accounting, or SOTA omission is Blocking until fixed.

2.8 KiB Raw Permalink Blame History