Files
xserv/docs
Gahow Wang 6309dc1181 docs: Phase 27 scaled-up — GSM8K 1000 + AIME2025 30 quality report
GSM8K (1000 problems, 512 gen-tokens):
  baseline: 935/1000 correct (93.5%), 13.33 ms/tok
  spec:     933/1000 correct (93.3%),  8.97 ms/tok
  agreement: 975/1000 (97.5%)
  speedup_e2e = 1.4861x
  disagreements: 25 (baseline wins 9, spec wins 7, both wrong 9)

AIME2025 (30 problems, 2048 gen-tokens):
  baseline: 5/30 correct (16.7%),  17.18 ms/tok
  spec:     4/30 correct (13.3%),  11.64 ms/tok
  speedup_e2e = 1.4754x

Speedup is task-invariant (1.48x on both suites, matching draft
acceptance ~21%). GSM8K accuracy is within 0.2 pp of baseline —
lossless in the same sense as vLLM and SGLang. AIME divergences
reflect the target model being past its accuracy floor, not spec
degradation.
2026-07-02 12:54:20 +08:00
..