Files
xserv/tools/bench/summarize_fullq.py
Gahow Wang d5dcf1a5ab bench: PP harness (xserv --pp vs llama.cpp -sm layer)
runner/servers: add --pp for both engines (xserv --pp N; llama.cpp
-sm layer over N GPUs). New drivers: pp_final.sh (sequential latency +
per-GPU VRAM + byte-exact correctness), pp_diag.sh (single x2 vs pp4 x2
determinism control), pp_quality_full.sh / pp_llama_47.sh (AIME+GSM8K
matrix, xserv on 0-3 || llama on 4-7), summarize_pp/summarize_fullq,
pp_time.py latency probe.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 18:45:59 +08:00

18 lines
964 B
Python

"""Summarize the full quality matrix: bench-out/fullq-{xserv,llama}-pp{1,2,4}.
Prints one row per (engine, pp, task) with accuracy + latency."""
import glob, json, os, sys
base = sys.argv[1] if len(sys.argv) > 1 else "bench-out"
print("%-6s %-3s %-9s %-8s %6s %9s %9s %10s" %
("engine","PP","task","correct","acc%","mean_tok","TTFT_ms","TPOT_ms"))
for eng in ("xserv","llama"):
for pp in (1,2,4):
files = sorted(glob.glob(os.path.join(base, f"fullq-{eng}-pp{pp}", "comparison-*.json")))
if not files:
print(f"{eng:<6} {pp:<3} (no results)"); continue
d = json.load(open(files[-1]))
for r in d.get("quality",{}).get("summary",[]):
print("%-6s %-3d %-9s %-8s %5.1f%% %9.0f %9.1f %10.2f" % (
eng, pp, r["task"], f'{r["n_correct"]}/{r["n_total"]}',
r["accuracy"]*100, r.get("mean_completion_tokens",0),
r.get("mean_ttft_ms",0), r.get("mean_tpot_ms",0)))