xserv/docs at main - xserv - Local Gitea

gahow/xserv

Files

History

Gahow Wang 6309dc1181 docs: Phase 27 scaled-up — GSM8K 1000 + AIME2025 30 quality report

GSM8K (1000 problems, 512 gen-tokens):
  baseline: 935/1000 correct (93.5%), 13.33 ms/tok
  spec:     933/1000 correct (93.3%),  8.97 ms/tok
  agreement: 975/1000 (97.5%)
  speedup_e2e = 1.4861x
  disagreements: 25 (baseline wins 9, spec wins 7, both wrong 9)

AIME2025 (30 problems, 2048 gen-tokens):
  baseline: 5/30 correct (16.7%),  17.18 ms/tok
  spec:     4/30 correct (13.3%),  11.64 ms/tok
  speedup_e2e = 1.4754x

Speedup is task-invariant (1.48x on both suites, matching draft
acceptance ~21%). GSM8K accuracy is within 0.2 pp of baseline —
lossless in the same sense as vLLM and SGLang. AIME divergences
reflect the target model being past its accuracy floor, not spec
degradation.

2026-07-02 12:54:20 +08:00

..

docs: Phase 21 — decode CUDA graph + GPU argmax results

2026-06-12 20:12:37 +08:00

00-roadmap.md

docs: Phase 21 — decode CUDA graph + GPU argmax results

2026-06-12 20:12:37 +08:00

01-cuda-ffi.md

docs: add design docs + takeaways for Phase 2 and Phase 3

2026-05-21 20:59:45 +08:00

02-tensor.md

docs: add design docs + takeaways for Phase 2 and Phase 3

2026-05-21 20:59:45 +08:00

03-gemm.md

docs: add design docs + takeaways for Phase 2 and Phase 3

2026-05-21 20:59:45 +08:00

04-transformer-kernels.md

phase 4: transformer core kernels

2026-05-21 21:07:24 +08:00

05-attention.md

phase 5: naive multi-head attention

2026-05-21 21:17:23 +08:00

06-model-loading.md

phase 6+7+8: model loading, BPE tokenizer, GPT-2 inference (Milestone ①)

2026-05-21 22:04:00 +08:00

07-tokenizer.md

phase 6+7+8: model loading, BPE tokenizer, GPT-2 inference (Milestone ①)

2026-05-21 22:04:00 +08:00

08-gpt2.md

phase 6+7+8: model loading, BPE tokenizer, GPT-2 inference (Milestone ①)

2026-05-21 22:04:00 +08:00

09-kv-cache.md

phase 9: KV cache + autoregressive generation

2026-05-21 23:39:41 +08:00

10-qwen3.md

fix: comprehensive review + 14 bug fixes + Phase 12/14 overhaul

2026-05-22 17:53:28 +08:00

11-paged-attention.md

docs: Phase 14 design doc + benchmark, fix Phase 11/12 honesty

2026-05-22 18:51:29 +08:00

12-continuous-batching.md

docs: Phase 14 design doc + benchmark, fix Phase 11/12 honesty

2026-05-22 18:51:29 +08:00

13-http-api.md

docs: split Phase 12 and Phase 13 into separate design documents

2026-05-22 13:15:27 +08:00

14-flash-attention.md

docs: Phase 14 design doc + benchmark, fix Phase 11/12 honesty

2026-05-22 18:51:29 +08:00

15-performance.md

docs: Phase 15 design doc + benchmark report

2026-05-23 00:39:27 +08:00

16-llama-cpp-comparison.md

docs: update llama.cpp comparison with 8192 results (OOM fixed)

2026-05-28 21:32:14 +08:00

17-tensor-parallelism.md

docs: Phase 17 tensor parallelism design

2026-05-29 11:10:03 +08:00

18-pipeline-parallelism.md

docs: Phase 18 pipeline parallelism — design + benchmark results

2026-05-29 18:57:09 +08:00

19-gpt-oss-moe.md

docs: fill the Phase 19 gap, refresh README/roadmap to actual state

2026-06-12 17:02:59 +08:00

20-sparse-moe.md

moe: sparse top-k decode — compute only routed experts (1.8x, beats llama TP=2)

2026-06-12 16:29:10 +08:00

21-cuda-graph-decode.md

docs: Phase 21 — decode CUDA graph + GPU argmax results

2026-06-12 20:12:37 +08:00

22-speculative-decoding.md

speculative: Qwen3 draft-model v0 with paged verify parity

2026-07-01 14:16:30 +08:00

23-speculative-verify-parity.md

speculative: Qwen3 draft-model v0 with paged verify parity

2026-07-01 14:16:30 +08:00

24-speculative-batched-verify.md

docs: Phase 24 investigation notes and revised speedup plan

2026-07-01 15:35:11 +08:00

25-speculative-methods-comparison.md

docs: Phase 25 — three speculative-decoding paradigms compared

2026-07-01 16:53:37 +08:00

26-eagle3-bug-hunt.md

docs: Phase 26 epilogue 2 — tree kernel landed; KV remap is the remaining blocker

2026-07-01 20:46:28 +08:00

27-speculative-quality-gsm8k.md

docs: Phase 27 scaled-up — GSM8K 1000 + AIME2025 30 quality report

2026-07-02 12:54:20 +08:00