agentic-pd-hybrid

gahow/agentic-pd-hybrid

Fork 0

Commit Graph

Author	SHA1	Message	Date
Gahow Wang	4021f27ee2	feat(analysis): stratified latency / TTFT reporter Implements docs/EVALUATION_PROTOCOL_ZH.md §1.3 (M3 fix): headline numbers must be accompanied by stratified breakdowns so reviewers can see which slice the gains come from. The script reads one or more request-metrics.jsonl files and buckets rows along four orthogonal dimensions: - turn_id : {1, 2-5, 6-20, 21+} - input_length : {<=8K, 8K-64K, >64K} - overlap_ratio : {<=0.3, 0.3-0.7, >0.7} - append_tokens : {<=128, 128-1K, 1K-8K, >8K} Per bucket: n, n_ok, err_pct, latency/ttft mean+p50+p90+p99. Output is markdown by default, --json for machine read. stdlib only — no pandas/numpy. Verified on a synthetic 5-row jsonl (turn=1 with one error correctly reports 33.3% err% on the bucket). Why this script and not pandas: - the existing scripts/analysis/* are stdlib-only; keeping consistency - reviewers can run it on the artifact without pip-installing anything beyond pytest - speed irrelevant; runs in <1s on the largest existing sweep (4449 rows) Usage shown in EVALUATION_PROTOCOL_ZH §3.	2026-05-12 23:57:13 +08:00

Author

SHA1

Message

Date

Gahow Wang

4021f27ee2

feat(analysis): stratified latency / TTFT reporter

Implements docs/EVALUATION_PROTOCOL_ZH.md §1.3 (M3 fix):
headline numbers must be accompanied by stratified
breakdowns so reviewers can see which slice the gains
come from.

The script reads one or more request-metrics.jsonl files
and buckets rows along four orthogonal dimensions:
  - turn_id        : {1, 2-5, 6-20, 21+}
  - input_length   : {<=8K, 8K-64K, >64K}
  - overlap_ratio  : {<=0.3, 0.3-0.7, >0.7}
  - append_tokens  : {<=128, 128-1K, 1K-8K, >8K}

Per bucket: n, n_ok, err_pct, latency/ttft mean+p50+p90+p99.
Output is markdown by default, --json for machine read.

stdlib only — no pandas/numpy. Verified on a synthetic
5-row jsonl (turn=1 with one error correctly reports
33.3% err% on the bucket).

Why this script and not pandas:
  - the existing scripts/analysis/* are stdlib-only;
    keeping consistency
  - reviewers can run it on the artifact without
    pip-installing anything beyond pytest
  - speed irrelevant; runs in <1s on the largest existing
    sweep (4449 rows)

Usage shown in EVALUATION_PROTOCOL_ZH §3.

2026-05-12 23:57:13 +08:00

1 Commits