Gahow Wang
6309dc1181
docs: Phase 27 scaled-up — GSM8K 1000 + AIME2025 30 quality report
GSM8K (1000 problems, 512 gen-tokens):
baseline: 935/1000 correct (93.5%), 13.33 ms/tok
spec: 933/1000 correct (93.3%), 8.97 ms/tok
agreement: 975/1000 (97.5%)
speedup_e2e = 1.4861x
disagreements: 25 (baseline wins 9, spec wins 7, both wrong 9)
AIME2025 (30 problems, 2048 gen-tokens):
baseline: 5/30 correct (16.7%), 17.18 ms/tok
spec: 4/30 correct (13.3%), 11.64 ms/tok
speedup_e2e = 1.4754x
Speedup is task-invariant (1.48x on both suites, matching draft
acceptance ~21%). GSM8K accuracy is within 0.2 pp of baseline —
lossless in the same sense as vLLM and SGLang. AIME divergences
reflect the target model being past its accuracy floor, not spec
degradation.
2026-07-02 12:54:20 +08:00
..
2026-06-12 20:12:37 +08:00
2026-06-12 20:12:37 +08:00
2026-05-21 20:59:45 +08:00
2026-05-21 20:59:45 +08:00
2026-05-21 20:59:45 +08:00
2026-05-21 21:07:24 +08:00
2026-05-21 21:17:23 +08:00
2026-05-21 22:04:00 +08:00
2026-05-21 22:04:00 +08:00
2026-05-21 22:04:00 +08:00
2026-05-21 23:39:41 +08:00
2026-05-22 17:53:28 +08:00
2026-05-22 18:51:29 +08:00
2026-05-22 18:51:29 +08:00
2026-05-22 13:15:27 +08:00
2026-05-22 18:51:29 +08:00
2026-05-23 00:39:27 +08:00
2026-05-28 21:32:14 +08:00
2026-05-29 11:10:03 +08:00
2026-05-29 18:57:09 +08:00
2026-06-12 17:02:59 +08:00
2026-06-12 16:29:10 +08:00
2026-06-12 20:12:37 +08:00
2026-07-01 14:16:30 +08:00
2026-07-01 14:16:30 +08:00
2026-07-01 15:35:11 +08:00
2026-07-01 16:53:37 +08:00
2026-07-01 20:46:28 +08:00
2026-07-02 12:54:20 +08:00