Implements docs/EVALUATION_PROTOCOL_ZH.md §2.2 (M2 fix):
mechanism A vs B comparisons on the same trace must be
paired on same-trial-mask, with errors and aborts surfaced
rather than silently dropped.
How it differs from scripts/analysis/compare_no_error.py:
- works on raw request-metrics.jsonl (not pre-aggregated
summary.json) so it can recompute paired masks
- reports 95% bootstrap CIs for mean / p50 / p90
- exposes intersection size + per-side failure count in
the intersection so the reader can see how many rows
were dropped from the comparison and whether the
candidate's win came from selection effects
stdlib only — random.Random for bootstrap, no scipy/numpy.
Default 2000 bootstrap iterations; seed is configurable
for reproducibility.
Verified locally on a synthetic 20-row pair (5s constant
delta + one candidate failure): correctly reports
paired_size=19, candidate_fail_in_common=1, mean delta
-5.000s, 19/0/0 win/loss/tie.
CLI:
scripts/analysis/paired_compare.py \\
--baseline outputs/run-dp/request-metrics.jsonl \\
--candidate outputs/run-kvc/request-metrics.jsonl \\
[--metric latency_s|ttft_s|tpot_s] \\
[--bootstrap 5000] [--seed 42] [--json]