runner/servers: add --pp for both engines (xserv --pp N; llama.cpp -sm layer over N GPUs). New drivers: pp_final.sh (sequential latency + per-GPU VRAM + byte-exact correctness), pp_diag.sh (single x2 vs pp4 x2 determinism control), pp_quality_full.sh / pp_llama_47.sh (AIME+GSM8K matrix, xserv on 0-3 || llama on 4-7), summarize_pp/summarize_fullq, pp_time.py latency probe. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
18 lines
964 B
Python
18 lines
964 B
Python
"""Summarize the full quality matrix: bench-out/fullq-{xserv,llama}-pp{1,2,4}.
|
|
Prints one row per (engine, pp, task) with accuracy + latency."""
|
|
import glob, json, os, sys
|
|
base = sys.argv[1] if len(sys.argv) > 1 else "bench-out"
|
|
print("%-6s %-3s %-9s %-8s %6s %9s %9s %10s" %
|
|
("engine","PP","task","correct","acc%","mean_tok","TTFT_ms","TPOT_ms"))
|
|
for eng in ("xserv","llama"):
|
|
for pp in (1,2,4):
|
|
files = sorted(glob.glob(os.path.join(base, f"fullq-{eng}-pp{pp}", "comparison-*.json")))
|
|
if not files:
|
|
print(f"{eng:<6} {pp:<3} (no results)"); continue
|
|
d = json.load(open(files[-1]))
|
|
for r in d.get("quality",{}).get("summary",[]):
|
|
print("%-6s %-3d %-9s %-8s %5.1f%% %9.0f %9.1f %10.2f" % (
|
|
eng, pp, r["task"], f'{r["n_correct"]}/{r["n_total"]}',
|
|
r["accuracy"]*100, r.get("mean_completion_tokens",0),
|
|
r.get("mean_ttft_ms",0), r.get("mean_tpot_ms",0)))
|