Commit Graph

  • 6309dc1181 docs: Phase 27 scaled-up — GSM8K 1000 + AIME2025 30 quality report main Gahow Wang 2026-07-02 12:54:20 +08:00
  • 264c004662 eagle3: GSM8K quality benchmark proves tree-spec is correctness-preserving Gahow Wang 2026-07-02 10:29:33 +08:00
  • 2fe903ecea eagle3: extend tree to top-3 siblings — speedup_e2e = 1.20× Gahow Wang 2026-07-02 00:24:57 +08:00
  • aac9ace144 eagle3: tree drafting with top-2 siblings — speedup_e2e = 1.17× 🎉 Gahow Wang 2026-07-02 00:09:30 +08:00
  • 6da0972740 speculative: copy_kv_position primitive for tree drafting KV remap Gahow Wang 2026-07-01 23:09:35 +08:00
  • 40d8a29e33 docs: Phase 26 epilogue 2 — tree kernel landed; KV remap is the remaining blocker Gahow Wang 2026-07-01 20:46:28 +08:00
  • fd392f7fbb attention: tree-aware paged_decode_attention_tree kernel + wrapper Gahow Wang 2026-07-01 20:45:55 +08:00
  • 10a98539d0 eagle3: coverage + top-3 diagnostic; acceptance ceiling analysis Gahow Wang 2026-07-01 20:19:28 +08:00
  • cc3bc2188c docs: Phase 26 epilogue — speedup_e2e = 1.10x achieved Gahow Wang 2026-07-01 19:59:03 +08:00
  • 06a798cab9 eagle3: cuBLAS-GEMM verify path — speedup_e2e > 1 achieved 🎉 Gahow Wang 2026-07-01 19:58:23 +08:00
  • 9a1af0adee docs: Phase 26 — EAGLE3 implementation follow-up + bug hunt log Gahow Wang 2026-07-01 19:18:37 +08:00
  • d2c55c47b2 eagle3: γ≥2 correctness fixes + per-slot diagnostic Gahow Wang 2026-07-01 19:16:31 +08:00
  • 14925154a3 eagle3: γ≥2 recursive drafting + batched verify with hooks Gahow Wang 2026-07-01 18:01:55 +08:00
  • a24621fa6a eagle3: proper residual chain + stateful KV cache Gahow Wang 2026-07-01 17:50:49 +08:00
  • 68b55fa1e6 eagle3: γ=1 speculative bench + first end-to-end measurement Gahow Wang 2026-07-01 17:32:53 +08:00
  • 8f11d6e5cd eagle3: fix EAGLE_HOOK_LAYERS to [2, 18, 33] for Qwen3-8B Gahow Wang 2026-07-01 17:29:00 +08:00
  • e04a8ffb18 speculative: EAGLE3 draft head implementation (Phase 25 step 1) Gahow Wang 2026-07-01 17:23:22 +08:00
  • 6485c87c5b docs: Phase 25 — three speculative-decoding paradigms compared Gahow Wang 2026-07-01 16:53:37 +08:00
  • a77239c0c8 speculative: Qwen3 decode graph + gamma sweep (Phase 24 step 2) Gahow Wang 2026-07-01 16:32:17 +08:00
  • e5734b41fa speculative: batched-GEMV kernel for verify path (Phase 24 step 1) Gahow Wang 2026-07-01 16:13:37 +08:00
  • 42e13f33dd docs: Phase 24 investigation notes and revised speedup plan Gahow Wang 2026-07-01 15:35:11 +08:00
  • fcf531a9b2 style: rustfmt server engine files Gahow Wang 2026-07-01 15:13:35 +08:00
  • d96ee0766c server: sampling-param validation, finish_reason normalization, backpressure Gahow Wang 2026-07-01 15:13:24 +08:00
  • ce10e4a998 sampling: NaN-safe sample() top-k/top-p path Gahow Wang 2026-07-01 15:13:19 +08:00
  • 5f060902f6 cuda: fix remaining int32-address and nondeterministic-reduction bugs Gahow Wang 2026-07-01 15:13:07 +08:00
  • a67753f516 softmax: cap block size at 512 threads Gahow Wang 2026-07-01 14:15:58 +08:00
  • f5ec10c2c3 xserv-cli: expose sampling params and greedy repetition penalty Gahow Wang 2026-07-01 14:15:50 +08:00
  • ce7229f4fe speculative: Qwen3 draft-model v0 with paged verify parity Gahow Wang 2026-07-01 14:15:39 +08:00
  • 5b350ee5f0 cuda: deterministic BF16 gemv + paged attention reductions Gahow Wang 2026-07-01 14:14:55 +08:00
  • 0314b4f3ac server: non-blocking stream send — stop one slow client stalling the batch Gahow Wang 2026-07-01 12:37:32 +08:00
  • cfbd64d206 cuda: fix int32 overflow in MoE dense kernels; surface launch errors in release Gahow Wang 2026-07-01 12:37:21 +08:00
  • 531cd3fe08 style: format Rust workspace Gahow Wang 2026-06-18 18:11:58 +08:00
  • 013465fc06 docs: Phase 21 — decode CUDA graph + GPU argmax results Gahow Wang 2026-06-12 20:12:37 +08:00
  • 8414f8d1e6 sampling: GPU argmax fast path for greedy decode Gahow Wang 2026-06-12 20:12:37 +08:00
  • 34224c7c93 gpt-oss: replay the whole batch=1 decode step as one CUDA graph Gahow Wang 2026-06-12 20:12:37 +08:00
  • 4088f49b7d cuda: infrastructure for whole-step CUDA graph capture Gahow Wang 2026-06-12 20:12:37 +08:00
  • 2a92f268a9 docs: fill the Phase 19 gap, refresh README/roadmap to actual state Gahow Wang 2026-06-12 17:02:59 +08:00
  • 5343391dbd review cleanups: pp+gpt-oss guard, sparse GEMV asserts, warnings Gahow Wang 2026-06-12 17:02:59 +08:00
  • 1897b2e17a gpt-oss: drop debug syncs from forward; GPU broadcast bias-add Gahow Wang 2026-06-12 17:02:59 +08:00
  • 63f5599717 server: serve gpt-oss on a single GPU via the TP engine (world=1) Gahow Wang 2026-06-12 16:29:10 +08:00
  • fb20178992 moe: sparse top-k decode — compute only routed experts (1.8x, beats llama TP=2) Gahow Wang 2026-06-12 16:29:10 +08:00
  • cf1e9e41db tools: single-stream decode benchmark vs llama.cpp Gahow Wang 2026-06-12 15:01:42 +08:00
  • d33220498a quantization: MXFP4 W4A16 expert weights (memory-optimization foundation) Gahow Wang 2026-06-12 15:01:42 +08:00
  • e631a71b68 quantization: single strided-batched FP8 MoE GEMM — cut per-token launches ~768→48 Gahow Wang 2026-06-12 01:23:29 +08:00
  • 24c49c31c2 tools: warm-server FP8 vs BF16 benchmark + results doc Gahow Wang 2026-06-12 00:58:46 +08:00
  • 5a16225c1f quantization: cache cuBLASLt FP8 plan per shape — fix per-expert heuristic churn Gahow Wang 2026-06-12 00:58:46 +08:00
  • 3a530956af tools: add FP8 vs BF16 benchmark and GSM8K eval harness Gahow Wang 2026-06-08 00:25:50 +08:00
  • 76487b7963 quantization: W8A8 FP8 compute via cuBLASLt tensor cores Gahow Wang 2026-06-07 20:38:26 +08:00
  • 9f1fbbb98b quantization: add FP8 E4M3 W8A16 for gpt-oss MoE expert weights Gahow Wang 2026-06-07 19:33:07 +08:00
  • e1eb77baa4 xserv-chat: fix unclosed <think> on early termination and flush analysis tokens Gahow Wang 2026-06-03 01:01:41 +08:00
  • 34e9bee375 xserv-chat: render gpt-oss analysis as a Qwen3-style <think> block Gahow Wang 2026-06-02 21:37:28 +08:00
  • 3b9e32e6cd kernels: fix uninitialized shared-memory read in M=1 decode GEMV Gahow Wang 2026-06-02 17:18:37 +08:00
  • 5157b2cd30 kernels: fix NaN in flash-attention sinks on fully-masked window tiles Gahow Wang 2026-06-02 16:09:43 +08:00
  • ea5d8ba7ea xserv-chat: render gpt-oss multi-turn as canonical harmony (drop CoT) Gahow Wang 2026-06-02 15:39:24 +08:00
  • c0a81c84e7 server: canonical harmony system message in gpt-oss fallback Gahow Wang 2026-06-02 15:19:50 +08:00
  • 3d6bb1918e xserv-chat: fix gpt-oss harmony chat (canonical system prompt + routing) Gahow Wang 2026-06-02 15:19:07 +08:00
  • f2e60218b4 xserv-chat: harmony channel routing + repetition penalty for gpt-oss Gahow Wang 2026-06-02 12:40:17 +08:00
  • 3ee8df2c0f xserv-chat: filter harmony control tokens + stop at <|end|> for gpt-oss Gahow Wang 2026-06-02 12:05:07 +08:00
  • ae08896f46 xserv-chat: support gpt-oss-20b with TP; fix GEMV precision bug Gahow Wang 2026-06-02 00:58:10 +08:00
  • 1d0ec32e8d server: Jinja chat template rendering via minijinja Gahow Wang 2026-05-31 13:23:18 +08:00
  • 4368e79695 model: fused GPU MoE kernel — eliminate CPU roundtrip Gahow Wang 2026-05-31 13:22:59 +08:00
  • 377a04b81f tokenizer: read pre-tokenizer regex from tokenizer.json Gahow Wang 2026-05-31 13:22:35 +08:00
  • 241009a96c docs: remove TO-BE-FIXED.md — all listed issues have been resolved Gahow Wang 2026-05-30 21:20:48 +08:00
  • 0c6135aea3 bench-gpt-oss: teacher-forced diagnostics + --prompt flag Gahow Wang 2026-05-31 00:56:46 +08:00
  • ffd90ce7fb server: emit harmony developer instructions block for gpt-oss Gahow Wang 2026-05-31 00:56:39 +08:00
  • 3c9d5e260e server: harmony termination via is_eos + TP repetition penalty Gahow Wang 2026-05-31 00:56:33 +08:00
  • 99b212e6c1 model/sampling: NaN-safe argmax + optional repetition penalty Gahow Wang 2026-05-31 00:56:27 +08:00
  • e11f15e009 tokenizer: support multiple end-of-generation tokens Gahow Wang 2026-05-31 00:56:21 +08:00
  • 9c98c169ff kernels: flash attention with gpt-oss sinks + sliding window Gahow Wang 2026-05-31 00:56:10 +08:00
  • 5cb3cf28f9 server: add gpt-oss chat template for proper prompt formatting Gahow Wang 2026-05-30 15:43:29 +08:00
  • 15c51f143e server: support GptOss in TP engine + benchmark script Gahow Wang 2026-05-30 15:39:44 +08:00
  • d29c39d74e fix: GEMV NaN bug — skip custom kernel for small N (<256) Gahow Wang 2026-05-30 15:20:04 +08:00
  • 9ad91a4a92 phase19: MoE support — gpt-oss-20b end-to-end inference with TP=2 Gahow Wang 2026-05-30 15:18:01 +08:00
  • 46bfb59f30 Merge branch 'phase18-pipeline-parallelism': pipeline-parallel inference Gahow Wang 2026-05-30 13:13:05 +08:00
  • 9a01c60100 server: GPU argmax fast path for greedy decode Gahow Wang 2026-05-30 12:50:47 +08:00
  • c679f618fd model: fuse QKV/gate_up projections, batched decode ops Gahow Wang 2026-05-30 12:50:39 +08:00
  • cc4bd4cfe5 paged-kv: kernel-based scatter + fix data_ptr offset bug Gahow Wang 2026-05-30 12:50:28 +08:00
  • 13ae3de69e kernels: reshape_and_cache, GPU argmax, single-launch GEMV Gahow Wang 2026-05-30 12:50:17 +08:00
  • 6ce21345be cuda: add cached_trim() to release pooled GPU buffers Gahow Wang 2026-05-30 12:50:04 +08:00
  • 1ab6ca9c09 tensor: add narrow() view and relax is_contiguous for size-1 dims Gahow Wang 2026-05-30 12:49:57 +08:00
  • 11e0154e4d docs: Phase 18 pipeline parallelism — design + benchmark results Gahow Wang 2026-05-29 18:46:06 +08:00
  • d5dcf1a5ab bench: PP harness (xserv --pp vs llama.cpp -sm layer) Gahow Wang 2026-05-29 18:45:59 +08:00
  • 824cc58daa server: pipeline-parallel HTTP engine (--pp N) Gahow Wang 2026-05-29 18:45:52 +08:00
  • da3aaa134a model: pipeline-parallel Qwen3 (from_weights_pp + stage forward) Gahow Wang 2026-05-29 18:45:47 +08:00
  • 859c0cc0b6 distributed: NCCL P2P primitives (PpContext + send/recv) Gahow Wang 2026-05-29 18:45:42 +08:00
  • c2362df1f1 fix(xserv-chat): UTF-8/CJK-aware line input Gahow Wang 2026-05-29 11:36:54 +08:00
  • 7b8b520cda docs: TP=1/2/4 xserv vs llama.cpp benchmark results Gahow Wang 2026-05-29 11:10:52 +08:00
  • a4a171d425 bench: TP sweep harness (xserv --tp, llama row-split, concurrent groups) Gahow Wang 2026-05-29 11:10:43 +08:00
  • 95eb61d639 server: tensor-parallel HTTP engine (--tp N) Gahow Wang 2026-05-29 11:10:33 +08:00
  • f17011129e model: tensor-parallel Qwen3 (sharded weights + AllReduce) Gahow Wang 2026-05-29 11:10:24 +08:00
  • 453520d622 distributed: NCCL tensor-parallel primitives (TpContext + AllReduce) Gahow Wang 2026-05-29 11:10:14 +08:00
  • 76fffb3b68 docs: Phase 17 tensor parallelism design Gahow Wang 2026-05-29 11:10:03 +08:00
  • 14a44b503e docs: add Chinese README (overview + usage) Gahow Wang 2026-05-28 21:38:20 +08:00
  • 80157e614a docs: update llama.cpp comparison with 8192 results (OOM fixed) Gahow Wang 2026-05-28 21:32:14 +08:00
  • fc1900a745 server: VRAM-sized KV pool + vLLM-style swap scheduler Gahow Wang 2026-05-28 19:59:06 +08:00
  • d52baa0006 model: paged KV cache with CPU swap pool, decode graph, qwen3 updates Gahow Wang 2026-05-28 19:58:54 +08:00
  • 4c3f914459 kernels/cuda: paged-attention kernel, dispatch, pinned host memory Gahow Wang 2026-05-28 19:58:36 +08:00
  • 3f1c3d429a docs: llama.cpp vs xserv benchmark results + summary Gahow Wang 2026-05-28 15:06:21 +08:00
  • 950ccf3822 bench: fix llama.cpp per-slot context (was 1/parallel of intended) Gahow Wang 2026-05-28 15:06:12 +08:00
  • 7cb9ee3870 bench: run one server at a time, match thinking mode, fix tools package Gahow Wang 2026-05-28 11:40:07 +08:00