Files
agentic-pd-hybrid/scripts
Gahow Wang dbb9eee471 feat(analysis): paired comparison with bootstrap CI
Implements docs/EVALUATION_PROTOCOL_ZH.md §2.2 (M2 fix):
mechanism A vs B comparisons on the same trace must be
paired on same-trial-mask, with errors and aborts surfaced
rather than silently dropped.

How it differs from scripts/analysis/compare_no_error.py:
  - works on raw request-metrics.jsonl (not pre-aggregated
    summary.json) so it can recompute paired masks
  - reports 95% bootstrap CIs for mean / p50 / p90
  - exposes intersection size + per-side failure count in
    the intersection so the reader can see how many rows
    were dropped from the comparison and whether the
    candidate's win came from selection effects

stdlib only — random.Random for bootstrap, no scipy/numpy.
Default 2000 bootstrap iterations; seed is configurable
for reproducibility.

Verified locally on a synthetic 20-row pair (5s constant
delta + one candidate failure): correctly reports
paired_size=19, candidate_fail_in_common=1, mean delta
-5.000s, 19/0/0 win/loss/tie.

CLI:
  scripts/analysis/paired_compare.py \\
      --baseline outputs/run-dp/request-metrics.jsonl \\
      --candidate outputs/run-kvc/request-metrics.jsonl \\
      [--metric latency_s|ttft_s|tpot_s] \\
      [--bootstrap 5000] [--seed 42] [--json]
2026-05-12 23:57:57 +08:00
..