Implements docs/EVALUATION_PROTOCOL_ZH.md §1.3 (M3 fix):
headline numbers must be accompanied by stratified
breakdowns so reviewers can see which slice the gains
come from.
The script reads one or more request-metrics.jsonl files
and buckets rows along four orthogonal dimensions:
- turn_id : {1, 2-5, 6-20, 21+}
- input_length : {<=8K, 8K-64K, >64K}
- overlap_ratio : {<=0.3, 0.3-0.7, >0.7}
- append_tokens : {<=128, 128-1K, 1K-8K, >8K}
Per bucket: n, n_ok, err_pct, latency/ttft mean+p50+p90+p99.
Output is markdown by default, --json for machine read.
stdlib only — no pandas/numpy. Verified on a synthetic
5-row jsonl (turn=1 with one error correctly reports
33.3% err% on the bucket).
Why this script and not pandas:
- the existing scripts/analysis/* are stdlib-only;
keeping consistency
- reviewers can run it on the artifact without
pip-installing anything beyond pytest
- speed irrelevant; runs in <1s on the largest existing
sweep (4449 rows)
Usage shown in EVALUATION_PROTOCOL_ZH §3.