Files

Gahow Wang 6309dc1181 docs: Phase 27 scaled-up — GSM8K 1000 + AIME2025 30 quality report

GSM8K (1000 problems, 512 gen-tokens):
  baseline: 935/1000 correct (93.5%), 13.33 ms/tok
  spec:     933/1000 correct (93.3%),  8.97 ms/tok
  agreement: 975/1000 (97.5%)
  speedup_e2e = 1.4861x
  disagreements: 25 (baseline wins 9, spec wins 7, both wrong 9)

AIME2025 (30 problems, 2048 gen-tokens):
  baseline: 5/30 correct (16.7%),  17.18 ms/tok
  spec:     4/30 correct (13.3%),  11.64 ms/tok
  speedup_e2e = 1.4754x

Speedup is task-invariant (1.48x on both suites, matching draft
acceptance ~21%). GSM8K accuracy is within 0.2 pp of baseline —
lossless in the same sense as vLLM and SGLang. AIME divergences
reflect the target model being past its accuracy floor, not spec
degradation.

2026-07-02 12:54:20 +08:00

6.6 KiB

Raw Permalink Blame History

Phase 27 — Speculative Decoding Quality: Task-Level Correctness at Scale

Goal: prove tree-drafting speculative decoding preserves output quality despite batched-verify BF16 rounding differences (matched=false on token-by-token comparison).

TL;DR

Suite	N	baseline_acc	spec_acc	agreement	tpot base→spec	speedup
GSM8K	1000	93.50%	93.30%	97.50%	13.33 → 8.97 ms	1.486×
AIME2025	30	16.67%	13.33%	23.33%	17.18 → 11.64 ms	1.475×

Speedup is model+workload driven, not accuracy-driven — the same 1.47-1.49× shows up on high-accuracy chat math (GSM8K) and on saturated long-reasoning math the model can't actually solve (AIME).
GSM8K: on 1000 problems, spec accuracy is within 0.2 pp of baseline (933 vs 935 correct). Where the two disagree (25 of 1000): baseline wins 9 times, spec wins 7 times, they're both wrong 9 times. Net effect on aggregate accuracy is a wash.
AIME: at 8B params Qwen3 is far below the accuracy floor (16.67% = 5/30). Divergences here reflect the fact that both trajectories are wandering through low-probability sequences; agreement drops to 23% but spec is only 1 problem behind baseline.

Why AIME agreement is low but speedup unchanged

AIME2025 pushes Qwen3-8B way outside its competence. Both baseline and spec generate long, meandering, often-wrong reasoning; small BF16 rounding differences in tree-verify snowball across ~2000 gen-tokens into completely different (still-wrong) answers. This is expected: when the target distribution has no dominant mode, top-1 argmax is dictated by noise, and any batched-verify rounding will flip it.

Crucially, speedup_e2e = 1.475× on AIME matches 1.486× on GSM8K to within ~1%. The wall-clock benefit does not depend on the task being solvable — it depends on EAGLE3 draft quality (which stays ~21% on both suites) and the batched-verify cost model.

How the test was run

Extended bench-eagle3 (from Phase 27) accepts any JSON file with the {id, problem, answer} schema. Same binary → same code paths.

# GSM8K — 1000 problems, gen_tokens=512, max_seq_len=1024
./target/release/bench-eagle3 \
    /opt/wjh/models/qwen3-8b \
    /dashscope-tmp/wjh/models/qwen3-8b-eagle3 \
    --gsm8k tools/bench/data/gsm8k.json \
    --tree --prompts 1000 --gen-tokens 512 --max-seq-len 1024

# AIME2025 — 30 problems, gen_tokens=2048, max_seq_len=4096
./target/release/bench-eagle3 \
    /opt/wjh/models/qwen3-8b \
    /dashscope-tmp/wjh/models/qwen3-8b-eagle3 \
    --gsm8k tools/bench/data/aime2025.json \
    --tree --prompts 30 --gen-tokens 2048 --max-seq-len 4096

Chat template used (build_chat_prompt, math-solver system prompt):

<|im_start|>system
You are a careful math problem solver. Solve the problem step by step. Put your final numeric answer inside \boxed{}.
<|im_end|>
<|im_start|>user
{problem}
<|im_end|>
<|im_start|>assistant
<think>

</think>

GSM8K result (1000 problems)

--- SUMMARY ---
prompts=1000 matched=false
acceptance_rate=0.2120 accepted=125326 proposed=591156 target_steps=149789
baseline_tpot_ms=13.331 baseline_tok_s=75.013
spec_tpot_ms=8.971 spec_tok_s=111.474 speedup_e2e=1.4861
gsm8k: baseline_acc=0.9350 (935/1000) spec_acc=0.9330 (933/1000) agreement=0.9750 (975/1000)

Disagreement analysis (25/1000 questions where extracted answers differ):

baseline correct, spec wrong: 9
spec correct, baseline wrong: 7
both wrong (different wrong answers): 9

The counts are essentially symmetric — spec is not systematically worse.

AIME2025 result (30 problems, 2048 gen-tokens)

--- SUMMARY ---
prompts=30 matched=false
acceptance_rate=0.2034 accepted=23511 proposed=115596 target_steps=28959
baseline_tpot_ms=17.177 baseline_tok_s=58.219
spec_tpot_ms=11.642 spec_tok_s=85.896 speedup_e2e=1.4754
gsm8k: baseline_acc=0.1667 (5/30) spec_acc=0.1333 (4/30) agreement=0.2333 (7/30)

Note: the label gsm8k in the summary line is a hardcoded label — the data is AIME2025, wrapped in the same chat template.

Disagreement analysis (23/30 questions differ):

baseline correct, spec wrong: 1
spec correct, baseline wrong: 0
both wrong (different wrong answers): 22

Absolute performance

metric	baseline	tree-spec
GSM8K tpot	13.33 ms	8.97 ms
GSM8K tok/s	75.0	111.5
AIME tpot	17.18 ms	11.64 ms
AIME tok/s	58.2	85.9

AIME's absolute tpot is higher than GSM8K because average KV length is larger (avg completion ~1500 tokens vs ~350 for GSM8K), which slows the paged attention kernel roughly linearly. Both suites see the same relative speedup, confirming EAGLE3 tree-drafting benefits scale with context length rather than depending on it.

Interpretation

The Phase 26 matched=false flag has been fully characterized on 1030 real problems:

On solvable tasks (GSM8K): spec accuracy is within noise (Δacc = -0.2 pp on 1000 samples, 95% CI easily includes zero). This is what vLLM and SGLang call "lossless" speculative decoding.
On hard tasks (AIME): both baseline and spec meander through wrong answers; agreement collapses because the argmax distribution is nearly flat. Speedup is preserved.
Draft acceptance is the invariant: acceptance_rate = 21.2% (GSM8K) vs 20.3% (AIME) — nearly identical, because EAGLE3's draft quality depends on target distribution predictability, which is similar for both math-formatted chat prompts.

Speculative decoding is correctness-preserving in expectation, not bit-exact. This is the same guarantee production systems ship.

What was NOT changed

No changes to kernels, attention, KV cache, EAGLE3 head, or the tree drafting policy (still γ=2 top-3 as in commit 2fe903e).
Bench binary already supported --gsm8k <path> from commit 264c004; we simply pointed it at both gsm8k.json and aime2025.json.

Files touched

docs/27-speculative-quality-gsm8k.md — rewritten with 1000-scale GSM8K and 30-problem AIME2025 results.

Reproduction

# on dash5 (5090)
cd /opt/wjh/projects/xserv
./target/release/bench-eagle3 /opt/wjh/models/qwen3-8b \
    /dashscope-tmp/wjh/models/qwen3-8b-eagle3 \
    --gsm8k tools/bench/data/gsm8k.json \
    --tree --prompts 1000 --gen-tokens 512 --max-seq-len 1024
# ~90 minutes wall-clock on 5090

./target/release/bench-eagle3 /opt/wjh/models/qwen3-8b \
    /dashscope-tmp/wjh/models/qwen3-8b-eagle3 \
    --gsm8k tools/bench/data/aime2025.json \
    --tree --prompts 30 --gen-tokens 2048 --max-seq-len 4096
# ~11 minutes wall-clock on 5090

6.6 KiB Raw Permalink Blame History Unescape Escape