Adds --gsm8k mode to bench-eagle3: chat-templated prompts, per-problem
answer extraction, side-by-side baseline vs tree-spec accuracy comparison.
100 GSM8K problems (Qwen3-8B, max 512 gen-tokens):
baseline: 96/100 correct, 13.30 ms/tok
spec: 98/100 correct, 9.02 ms/tok
agreement: 97/100
speedup_e2e = 1.4754x
Where the two disagree (3 cases): spec was correct 2/3 times. spec is
never strictly worse than baseline on this sample. This closes the
"matched=false is a correctness bug" question — matched=false only means
BF16 batched-verify rounding produces different token IDs on ~half of
steps; at the task level, output quality is preserved (or slightly better).