Commit Graph

1 Commits

Author SHA1 Message Date
4021f27ee2 feat(analysis): stratified latency / TTFT reporter
Implements docs/EVALUATION_PROTOCOL_ZH.md §1.3 (M3 fix):
headline numbers must be accompanied by stratified
breakdowns so reviewers can see which slice the gains
come from.

The script reads one or more request-metrics.jsonl files
and buckets rows along four orthogonal dimensions:
  - turn_id        : {1, 2-5, 6-20, 21+}
  - input_length   : {<=8K, 8K-64K, >64K}
  - overlap_ratio  : {<=0.3, 0.3-0.7, >0.7}
  - append_tokens  : {<=128, 128-1K, 1K-8K, >8K}

Per bucket: n, n_ok, err_pct, latency/ttft mean+p50+p90+p99.
Output is markdown by default, --json for machine read.

stdlib only — no pandas/numpy. Verified on a synthetic
5-row jsonl (turn=1 with one error correctly reports
33.3% err% on the bucket).

Why this script and not pandas:
  - the existing scripts/analysis/* are stdlib-only;
    keeping consistency
  - reviewers can run it on the artifact without
    pip-installing anything beyond pytest
  - speed irrelevant; runs in <1s on the largest existing
    sweep (4449 rows)

Usage shown in EVALUATION_PROTOCOL_ZH §3.
2026-05-12 23:57:13 +08:00