Files
agentic-ctx/references/benchmarking-crimes.md
2026-05-13 21:36:34 +08:00

2.8 KiB

Benchmarking Crimes Reference

Source: Gernot Heiser, “Systems Benchmarking Crimes” https://gernot-heiser.org/benchmarking-crimes.html

This file is a compact local index. Use skills/benchmark-crime-auditor.md for the operational checklist.

Crime Index

1. Selective Benchmarking

  • Not evaluating potential performance degradation.
  • Benchmark subsetting without strong justification.
  • Selective data set hiding deficiencies.

Core idea: a fair performance evaluation must show both improvement in the target regime and acceptable behavior elsewhere.

2. Improper Handling of Benchmark Results

  • Pretending microbenchmarks represent overall performance.
  • Treating throughput degradation as equal to overhead.
  • Downplaying overheads through wrong arithmetic or wrong denominators.
  • No indication of statistical significance.
  • Arithmetic mean over normalized benchmark scores.

Core idea: performance numbers must preserve the meaning of the measured quantity. Ratios, means, averages, and overheads are not interchangeable.

3. Using the Wrong Benchmarks

  • Benchmarking simplified simulated systems without validating assumptions.
  • Inappropriate or misleading benchmarks.
  • Using the same data for calibration and validation.

Core idea: the workload must stress the phenomenon behind the claim.

4. Improper Comparison of Benchmark Results

  • No proper baseline.
  • Only evaluate against yourself.
  • Unfair benchmarking of competitors.
  • Inflating gains by not comparing against the state of the art.

Core idea: a comparison is only meaningful if the reference point is the one readers need to judge the claim.

5. Missing Information

  • Missing platform specification.
  • Missing sub-benchmark results.
  • Relative numbers only.

Core idea: readers need enough absolute and contextual information to reproduce, sanity-check, and interpret results.

Best Practice Index

  • Ensure quiescence before timing.
  • Make benchmarking part of regression testing.
  • Document commands, platform, versions, and configuration.
  • Verify transferred or stored data.
  • Use different data per run when stale data or caching could matter.
  • Repeat points consecutively and separated in time.
  • Invert measurement order to detect interference.
  • Include non-regular and pathological data points.
  • Compare configurations at exactly the same data points.
  • Run several trials and report variance.
  • Use principled outlier handling.
  • Warm up before timed runs.
  • Use enough iterations to reduce clock granularity.
  • Eliminate loop overhead.
  • Inspect generated machine code for low-level timing loops when needed.

Local Policy

In this agent context, benchmark crimes are handled as research-validity issues. A headline claim affected by selective benchmarking, unfair baseline, wrong overhead accounting, or SOTA omission is Blocking until fixed.