-
6309dc1181
docs: Phase 27 scaled-up — GSM8K 1000 + AIME2025 30 quality report
main
Gahow Wang
2026-07-02 12:54:20 +08:00
-
264c004662
eagle3: GSM8K quality benchmark proves tree-spec is correctness-preserving
Gahow Wang
2026-07-02 10:29:33 +08:00
-
2fe903ecea
eagle3: extend tree to top-3 siblings — speedup_e2e = 1.20×
Gahow Wang
2026-07-02 00:24:57 +08:00
-
aac9ace144
eagle3: tree drafting with top-2 siblings — speedup_e2e = 1.17× 🎉
Gahow Wang
2026-07-02 00:09:30 +08:00
-
6da0972740
speculative: copy_kv_position primitive for tree drafting KV remap
Gahow Wang
2026-07-01 23:09:35 +08:00
-
40d8a29e33
docs: Phase 26 epilogue 2 — tree kernel landed; KV remap is the remaining blocker
Gahow Wang
2026-07-01 20:46:28 +08:00
-
fd392f7fbb
attention: tree-aware paged_decode_attention_tree kernel + wrapper
Gahow Wang
2026-07-01 20:45:55 +08:00
-
10a98539d0
eagle3: coverage + top-3 diagnostic; acceptance ceiling analysis
Gahow Wang
2026-07-01 20:19:28 +08:00
-
cc3bc2188c
docs: Phase 26 epilogue — speedup_e2e = 1.10x achieved
Gahow Wang
2026-07-01 19:59:03 +08:00
-
06a798cab9
eagle3: cuBLAS-GEMM verify path — speedup_e2e > 1 achieved 🎉
Gahow Wang
2026-07-01 19:58:23 +08:00
-
9a1af0adee
docs: Phase 26 — EAGLE3 implementation follow-up + bug hunt log
Gahow Wang
2026-07-01 19:18:37 +08:00
-
d2c55c47b2
eagle3: γ≥2 correctness fixes + per-slot diagnostic
Gahow Wang
2026-07-01 19:16:31 +08:00
-
14925154a3
eagle3: γ≥2 recursive drafting + batched verify with hooks
Gahow Wang
2026-07-01 18:01:55 +08:00
-
a24621fa6a
eagle3: proper residual chain + stateful KV cache
Gahow Wang
2026-07-01 17:50:49 +08:00
-
68b55fa1e6
eagle3: γ=1 speculative bench + first end-to-end measurement
Gahow Wang
2026-07-01 17:32:53 +08:00
-
8f11d6e5cd
eagle3: fix EAGLE_HOOK_LAYERS to [2, 18, 33] for Qwen3-8B
Gahow Wang
2026-07-01 17:29:00 +08:00
-
e04a8ffb18
speculative: EAGLE3 draft head implementation (Phase 25 step 1)
Gahow Wang
2026-07-01 17:23:22 +08:00
-
6485c87c5b
docs: Phase 25 — three speculative-decoding paradigms compared
Gahow Wang
2026-07-01 16:53:37 +08:00
-
a77239c0c8
speculative: Qwen3 decode graph + gamma sweep (Phase 24 step 2)
Gahow Wang
2026-07-01 16:32:17 +08:00
-
e5734b41fa
speculative: batched-GEMV kernel for verify path (Phase 24 step 1)
Gahow Wang
2026-07-01 16:13:37 +08:00
-
42e13f33dd
docs: Phase 24 investigation notes and revised speedup plan
Gahow Wang
2026-07-01 15:35:11 +08:00
-
fcf531a9b2
style: rustfmt server engine files
Gahow Wang
2026-07-01 15:13:35 +08:00
-
d96ee0766c
server: sampling-param validation, finish_reason normalization, backpressure
Gahow Wang
2026-07-01 15:13:24 +08:00
-
ce10e4a998
sampling: NaN-safe sample() top-k/top-p path
Gahow Wang
2026-07-01 15:13:19 +08:00
-
5f060902f6
cuda: fix remaining int32-address and nondeterministic-reduction bugs
Gahow Wang
2026-07-01 15:13:07 +08:00
-
a67753f516
softmax: cap block size at 512 threads
Gahow Wang
2026-07-01 14:15:58 +08:00
-
f5ec10c2c3
xserv-cli: expose sampling params and greedy repetition penalty
Gahow Wang
2026-07-01 14:15:50 +08:00
-
ce7229f4fe
speculative: Qwen3 draft-model v0 with paged verify parity
Gahow Wang
2026-07-01 14:15:39 +08:00
-
5b350ee5f0
cuda: deterministic BF16 gemv + paged attention reductions
Gahow Wang
2026-07-01 14:14:55 +08:00
-
0314b4f3ac
server: non-blocking stream send — stop one slow client stalling the batch
Gahow Wang
2026-07-01 12:37:32 +08:00
-
cfbd64d206
cuda: fix int32 overflow in MoE dense kernels; surface launch errors in release
Gahow Wang
2026-07-01 12:37:21 +08:00
-
531cd3fe08
style: format Rust workspace
Gahow Wang
2026-06-18 18:11:58 +08:00
-
013465fc06
docs: Phase 21 — decode CUDA graph + GPU argmax results
Gahow Wang
2026-06-12 20:12:37 +08:00
-
8414f8d1e6
sampling: GPU argmax fast path for greedy decode
Gahow Wang
2026-06-12 20:12:37 +08:00
-
34224c7c93
gpt-oss: replay the whole batch=1 decode step as one CUDA graph
Gahow Wang
2026-06-12 20:12:37 +08:00
-
4088f49b7d
cuda: infrastructure for whole-step CUDA graph capture
Gahow Wang
2026-06-12 20:12:37 +08:00
-
2a92f268a9
docs: fill the Phase 19 gap, refresh README/roadmap to actual state
Gahow Wang
2026-06-12 17:02:59 +08:00
-
5343391dbd
review cleanups: pp+gpt-oss guard, sparse GEMV asserts, warnings
Gahow Wang
2026-06-12 17:02:59 +08:00
-
1897b2e17a
gpt-oss: drop debug syncs from forward; GPU broadcast bias-add
Gahow Wang
2026-06-12 17:02:59 +08:00
-
63f5599717
server: serve gpt-oss on a single GPU via the TP engine (world=1)
Gahow Wang
2026-06-12 16:29:10 +08:00
-
fb20178992
moe: sparse top-k decode — compute only routed experts (1.8x, beats llama TP=2)
Gahow Wang
2026-06-12 16:29:10 +08:00
-
cf1e9e41db
tools: single-stream decode benchmark vs llama.cpp
Gahow Wang
2026-06-12 15:01:42 +08:00
-
d33220498a
quantization: MXFP4 W4A16 expert weights (memory-optimization foundation)
Gahow Wang
2026-06-12 15:01:42 +08:00
-
e631a71b68
quantization: single strided-batched FP8 MoE GEMM — cut per-token launches ~768→48
Gahow Wang
2026-06-12 01:23:29 +08:00
-
24c49c31c2
tools: warm-server FP8 vs BF16 benchmark + results doc
Gahow Wang
2026-06-12 00:58:46 +08:00
-
5a16225c1f
quantization: cache cuBLASLt FP8 plan per shape — fix per-expert heuristic churn
Gahow Wang
2026-06-12 00:58:46 +08:00
-
3a530956af
tools: add FP8 vs BF16 benchmark and GSM8K eval harness
Gahow Wang
2026-06-08 00:25:50 +08:00
-
76487b7963
quantization: W8A8 FP8 compute via cuBLASLt tensor cores
Gahow Wang
2026-06-07 20:38:26 +08:00
-
9f1fbbb98b
quantization: add FP8 E4M3 W8A16 for gpt-oss MoE expert weights
Gahow Wang
2026-06-07 19:33:07 +08:00
-
e1eb77baa4
xserv-chat: fix unclosed <think> on early termination and flush analysis tokens
Gahow Wang
2026-06-03 01:01:41 +08:00
-
34e9bee375
xserv-chat: render gpt-oss analysis as a Qwen3-style <think> block
Gahow Wang
2026-06-02 21:37:28 +08:00
-
3b9e32e6cd
kernels: fix uninitialized shared-memory read in M=1 decode GEMV
Gahow Wang
2026-06-02 17:18:37 +08:00
-
5157b2cd30
kernels: fix NaN in flash-attention sinks on fully-masked window tiles
Gahow Wang
2026-06-02 16:09:43 +08:00
-
ea5d8ba7ea
xserv-chat: render gpt-oss multi-turn as canonical harmony (drop CoT)
Gahow Wang
2026-06-02 15:39:24 +08:00
-
c0a81c84e7
server: canonical harmony system message in gpt-oss fallback
Gahow Wang
2026-06-02 15:19:50 +08:00
-
3d6bb1918e
xserv-chat: fix gpt-oss harmony chat (canonical system prompt + routing)
Gahow Wang
2026-06-02 15:19:07 +08:00
-
f2e60218b4
xserv-chat: harmony channel routing + repetition penalty for gpt-oss
Gahow Wang
2026-06-02 12:40:17 +08:00
-
3ee8df2c0f
xserv-chat: filter harmony control tokens + stop at <|end|> for gpt-oss
Gahow Wang
2026-06-02 12:05:07 +08:00
-
ae08896f46
xserv-chat: support gpt-oss-20b with TP; fix GEMV precision bug
Gahow Wang
2026-06-02 00:58:10 +08:00
-
1d0ec32e8d
server: Jinja chat template rendering via minijinja
Gahow Wang
2026-05-31 13:23:18 +08:00
-
4368e79695
model: fused GPU MoE kernel — eliminate CPU roundtrip
Gahow Wang
2026-05-31 13:22:59 +08:00
-
377a04b81f
tokenizer: read pre-tokenizer regex from tokenizer.json
Gahow Wang
2026-05-31 13:22:35 +08:00
-
241009a96c
docs: remove TO-BE-FIXED.md — all listed issues have been resolved
Gahow Wang
2026-05-30 21:20:48 +08:00
-
0c6135aea3
bench-gpt-oss: teacher-forced diagnostics + --prompt flag
Gahow Wang
2026-05-31 00:56:46 +08:00
-
ffd90ce7fb
server: emit harmony developer instructions block for gpt-oss
Gahow Wang
2026-05-31 00:56:39 +08:00
-
3c9d5e260e
server: harmony termination via is_eos + TP repetition penalty
Gahow Wang
2026-05-31 00:56:33 +08:00
-
99b212e6c1
model/sampling: NaN-safe argmax + optional repetition penalty
Gahow Wang
2026-05-31 00:56:27 +08:00
-
e11f15e009
tokenizer: support multiple end-of-generation tokens
Gahow Wang
2026-05-31 00:56:21 +08:00
-
9c98c169ff
kernels: flash attention with gpt-oss sinks + sliding window
Gahow Wang
2026-05-31 00:56:10 +08:00
-
5cb3cf28f9
server: add gpt-oss chat template for proper prompt formatting
Gahow Wang
2026-05-30 15:43:29 +08:00
-
15c51f143e
server: support GptOss in TP engine + benchmark script
Gahow Wang
2026-05-30 15:39:44 +08:00
-
d29c39d74e
fix: GEMV NaN bug — skip custom kernel for small N (<256)
Gahow Wang
2026-05-30 15:20:04 +08:00
-
9ad91a4a92
phase19: MoE support — gpt-oss-20b end-to-end inference with TP=2
Gahow Wang
2026-05-30 15:18:01 +08:00
-
46bfb59f30
Merge branch 'phase18-pipeline-parallelism': pipeline-parallel inference
Gahow Wang
2026-05-30 13:13:05 +08:00
-
-
9a01c60100
server: GPU argmax fast path for greedy decode
Gahow Wang
2026-05-30 12:50:47 +08:00
-
c679f618fd
model: fuse QKV/gate_up projections, batched decode ops
Gahow Wang
2026-05-30 12:50:39 +08:00
-
cc4bd4cfe5
paged-kv: kernel-based scatter + fix data_ptr offset bug
Gahow Wang
2026-05-30 12:50:28 +08:00
-
13ae3de69e
kernels: reshape_and_cache, GPU argmax, single-launch GEMV
Gahow Wang
2026-05-30 12:50:17 +08:00
-
6ce21345be
cuda: add cached_trim() to release pooled GPU buffers
Gahow Wang
2026-05-30 12:50:04 +08:00
-
1ab6ca9c09
tensor: add narrow() view and relax is_contiguous for size-1 dims
Gahow Wang
2026-05-30 12:49:57 +08:00
-
11e0154e4d
docs: Phase 18 pipeline parallelism — design + benchmark results
Gahow Wang
2026-05-29 18:46:06 +08:00
-
d5dcf1a5ab
bench: PP harness (xserv --pp vs llama.cpp -sm layer)
Gahow Wang
2026-05-29 18:45:59 +08:00
-
824cc58daa
server: pipeline-parallel HTTP engine (--pp N)
Gahow Wang
2026-05-29 18:45:52 +08:00
-
da3aaa134a
model: pipeline-parallel Qwen3 (from_weights_pp + stage forward)
Gahow Wang
2026-05-29 18:45:47 +08:00
-
859c0cc0b6
distributed: NCCL P2P primitives (PpContext + send/recv)
Gahow Wang
2026-05-29 18:45:42 +08:00
-
-
c2362df1f1
fix(xserv-chat): UTF-8/CJK-aware line input
Gahow Wang
2026-05-29 11:36:54 +08:00
-
7b8b520cda
docs: TP=1/2/4 xserv vs llama.cpp benchmark results
Gahow Wang
2026-05-29 11:10:52 +08:00
-
a4a171d425
bench: TP sweep harness (xserv --tp, llama row-split, concurrent groups)
Gahow Wang
2026-05-29 11:10:43 +08:00
-
95eb61d639
server: tensor-parallel HTTP engine (--tp N)
Gahow Wang
2026-05-29 11:10:33 +08:00
-
f17011129e
model: tensor-parallel Qwen3 (sharded weights + AllReduce)
Gahow Wang
2026-05-29 11:10:24 +08:00
-
453520d622
distributed: NCCL tensor-parallel primitives (TpContext + AllReduce)
Gahow Wang
2026-05-29 11:10:14 +08:00
-
76fffb3b68
docs: Phase 17 tensor parallelism design
Gahow Wang
2026-05-29 11:10:03 +08:00
-
14a44b503e
docs: add Chinese README (overview + usage)
Gahow Wang
2026-05-28 21:38:20 +08:00
-
80157e614a
docs: update llama.cpp comparison with 8192 results (OOM fixed)
Gahow Wang
2026-05-28 21:32:14 +08:00
-
fc1900a745
server: VRAM-sized KV pool + vLLM-style swap scheduler
Gahow Wang
2026-05-28 19:59:06 +08:00
-
d52baa0006
model: paged KV cache with CPU swap pool, decode graph, qwen3 updates
Gahow Wang
2026-05-28 19:58:54 +08:00
-
4c3f914459
kernels/cuda: paged-attention kernel, dispatch, pinned host memory
Gahow Wang
2026-05-28 19:58:36 +08:00
-
3f1c3d429a
docs: llama.cpp vs xserv benchmark results + summary
Gahow Wang
2026-05-28 15:06:21 +08:00
-
950ccf3822
bench: fix llama.cpp per-slot context (was 1/parallel of intended)
Gahow Wang
2026-05-28 15:06:12 +08:00
-
7cb9ee3870
bench: run one server at a time, match thinking mode, fix tools package
Gahow Wang
2026-05-28 11:40:07 +08:00