xserv/docs at 5157b2cd3026783900589dfc5698552124977b49 - xserv - Local Gitea

gahow/xserv

Files

History

Gahow Wang ae08896f46 xserv-chat: support gpt-oss-20b with TP; fix GEMV precision bug

- Add ChatModel enum dispatching between Qwen3 and GptOss based on
  config.is_moe(), following the TP engine pattern.
- Add --tp N flag for tensor-parallel inference (required for 39GB
  gpt-oss-20b which doesn't fit on a single 32GB GPU).
- Add gpt-oss harmony chat template with channel/message format.
- Replace hardcoded is_stop_token() with tokenizer.is_eos() for
  multi-model EOS support.
- Restore gpt-oss hardcoded prompt template in server api.rs, lost
  during the Jinja template refactor.
- Fix GEMV race condition: the K-split kernel zeroed the FP32
  accumulator inside the kernel (block k=0) while other blocks
  atomicAdd'd concurrently. Pre-zero with cudaMemsetAsync instead.
- Update benchmark docs with post-fix results.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-06-02 00:58:10 +08:00

..

xserv-chat: support gpt-oss-20b with TP; fix GEMV precision bug

2026-06-02 00:58:10 +08:00

00-roadmap.md

fix: comprehensive review + 14 bug fixes + Phase 12/14 overhaul

2026-05-22 17:53:28 +08:00

01-cuda-ffi.md

docs: add design docs + takeaways for Phase 2 and Phase 3

2026-05-21 20:59:45 +08:00

02-tensor.md

docs: add design docs + takeaways for Phase 2 and Phase 3

2026-05-21 20:59:45 +08:00

03-gemm.md

docs: add design docs + takeaways for Phase 2 and Phase 3

2026-05-21 20:59:45 +08:00

04-transformer-kernels.md

phase 4: transformer core kernels

2026-05-21 21:07:24 +08:00

05-attention.md

phase 5: naive multi-head attention

2026-05-21 21:17:23 +08:00

06-model-loading.md

phase 6+7+8: model loading, BPE tokenizer, GPT-2 inference (Milestone ①)

2026-05-21 22:04:00 +08:00

07-tokenizer.md

phase 6+7+8: model loading, BPE tokenizer, GPT-2 inference (Milestone ①)

2026-05-21 22:04:00 +08:00

08-gpt2.md

phase 6+7+8: model loading, BPE tokenizer, GPT-2 inference (Milestone ①)

2026-05-21 22:04:00 +08:00

09-kv-cache.md

phase 9: KV cache + autoregressive generation

2026-05-21 23:39:41 +08:00

10-qwen3.md

fix: comprehensive review + 14 bug fixes + Phase 12/14 overhaul

2026-05-22 17:53:28 +08:00

11-paged-attention.md

docs: Phase 14 design doc + benchmark, fix Phase 11/12 honesty

2026-05-22 18:51:29 +08:00

12-continuous-batching.md

docs: Phase 14 design doc + benchmark, fix Phase 11/12 honesty

2026-05-22 18:51:29 +08:00

13-http-api.md

docs: split Phase 12 and Phase 13 into separate design documents

2026-05-22 13:15:27 +08:00

14-flash-attention.md

docs: Phase 14 design doc + benchmark, fix Phase 11/12 honesty

2026-05-22 18:51:29 +08:00

15-performance.md

docs: Phase 15 design doc + benchmark report

2026-05-23 00:39:27 +08:00

16-llama-cpp-comparison.md

docs: update llama.cpp comparison with 8192 results (OOM fixed)

2026-05-28 21:32:14 +08:00

17-tensor-parallelism.md

docs: Phase 17 tensor parallelism design

2026-05-29 11:10:03 +08:00

18-pipeline-parallelism.md

docs: Phase 18 pipeline parallelism — design + benchmark results

2026-05-29 18:57:09 +08:00