agentic-ctx/references/benchmarking-crimes.md

# Benchmarking Crimes Reference

Source: Gernot Heiser, “Systems Benchmarking Crimes”
https://gernot-heiser.org/benchmarking-crimes.html

This file is a compact local index. Use `skills/benchmark-crime-auditor.md` for the operational checklist.

## Crime Index

### 1. Selective Benchmarking

- Not evaluating potential performance degradation.
- Benchmark subsetting without strong justification.
- Selective data set hiding deficiencies.

Core idea: a fair performance evaluation must show both improvement in the target regime and acceptable behavior elsewhere.

### 2. Improper Handling of Benchmark Results

- Pretending microbenchmarks represent overall performance.
- Treating throughput degradation as equal to overhead.
- Downplaying overheads through wrong arithmetic or wrong denominators.
- No indication of statistical significance.
- Arithmetic mean over normalized benchmark scores.

Core idea: performance numbers must preserve the meaning of the measured quantity. Ratios, means, averages, and overheads are not interchangeable.

### 3. Using the Wrong Benchmarks

- Benchmarking simplified simulated systems without validating assumptions.
- Inappropriate or misleading benchmarks.
- Using the same data for calibration and validation.

Core idea: the workload must stress the phenomenon behind the claim.

### 4. Improper Comparison of Benchmark Results

- No proper baseline.
- Only evaluate against yourself.
- Unfair benchmarking of competitors.
- Inflating gains by not comparing against the state of the art.

Core idea: a comparison is only meaningful if the reference point is the one readers need to judge the claim.

### 5. Missing Information

- Missing platform specification.
- Missing sub-benchmark results.
- Relative numbers only.

Core idea: readers need enough absolute and contextual information to reproduce, sanity-check, and interpret results.

## Best Practice Index

- Ensure quiescence before timing.
- Make benchmarking part of regression testing.
- Document commands, platform, versions, and configuration.
- Verify transferred or stored data.
- Use different data per run when stale data or caching could matter.
- Repeat points consecutively and separated in time.
- Invert measurement order to detect interference.
- Include non-regular and pathological data points.
- Compare configurations at exactly the same data points.
- Run several trials and report variance.
- Use principled outlier handling.
- Warm up before timed runs.
- Use enough iterations to reduce clock granularity.
- Eliminate loop overhead.
- Inspect generated machine code for low-level timing loops when needed.

## Local Policy

In this agent context, benchmark crimes are handled as research-validity issues. A headline claim affected by selective benchmarking, unfair baseline, wrong overhead accounting, or SOTA omission is `Blocking` until fixed.