xserv

gahow/xserv

Fork 0

Commit Graph

Author	SHA1	Message	Date
Gahow Wang	264c004662	eagle3: GSM8K quality benchmark proves tree-spec is correctness-preserving Adds --gsm8k mode to bench-eagle3: chat-templated prompts, per-problem answer extraction, side-by-side baseline vs tree-spec accuracy comparison. 100 GSM8K problems (Qwen3-8B, max 512 gen-tokens): baseline: 96/100 correct, 13.30 ms/tok spec: 98/100 correct, 9.02 ms/tok agreement: 97/100 speedup_e2e = 1.4754x Where the two disagree (3 cases): spec was correct 2/3 times. spec is never strictly worse than baseline on this sample. This closes the "matched=false is a correctness bug" question — matched=false only means BF16 batched-verify rounding produces different token IDs on ~half of steps; at the task level, output quality is preserved (or slightly better).	2026-07-02 10:29:33 +08:00

Author

SHA1

Message

Date

Gahow Wang

264c004662

eagle3: GSM8K quality benchmark proves tree-spec is correctness-preserving

Adds --gsm8k mode to bench-eagle3: chat-templated prompts, per-problem
answer extraction, side-by-side baseline vs tree-spec accuracy comparison.

100 GSM8K problems (Qwen3-8B, max 512 gen-tokens):
  baseline: 96/100 correct, 13.30 ms/tok
  spec:     98/100 correct,  9.02 ms/tok
  agreement: 97/100
  speedup_e2e = 1.4754x

Where the two disagree (3 cases): spec was correct 2/3 times. spec is
never strictly worse than baseline on this sample. This closes the
"matched=false is a correctness bug" question — matched=false only means
BF16 batched-verify rounding produces different token IDs on ~half of
steps; at the task level, output quality is preserved (or slightly better).

2026-07-02 10:29:33 +08:00

1 Commits