Commit Graph

120 Commits

Author SHA1 Message Date
Gahow Wang
1d0ec32e8d server: Jinja chat template rendering via minijinja
Load the model's chat_template.jinja (or tokenizer_config.json
chat_template field) at startup and render it with minijinja instead of
hardcoded per-model prompt builders.

Custom Jinja functions: strftime_now (date formatting), raise_exception
(template validation errors).  Falls back to Qwen3 ChatML template if
no Jinja template is found.

Removes the hardcoded build_prompt_gpt_oss() — the model's own template
now drives prompt formatting, matching llama.cpp's behavior exactly.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-31 13:23:18 +08:00
Gahow Wang
4368e79695 model: fused GPU MoE kernel — eliminate CPU roundtrip
Replace the per-token CPU-routed MoE forward with an all-GPU path:

  1. moe_topk_softmax: GPU top-k + softmax (was CPU sort + softmax)
  2. moe_replicate: broadcast input to all local experts
  3. cublasGemmStridedBatchedEx: batched expert matmul (was per-expert cuBLAS)
  4. moe_weighted_sum: FP32-accumulated weighted sum on GPU (was GPU→CPU→F32→BF16→GPU)

Expert weights stored as contiguous 3D tensors for strided batched GEMM.
Zero CPU↔GPU transfers per MoE layer (was ~40 per token per layer).

Also: configurable geglu_alpha, LayerNorm bias auto-detect, unused-weight
diagnostic at load time.

GSM8K 30-problem: 11/30 → 23/30 (76.7%) vs llama.cpp 30/30 (100%).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-31 13:22:59 +08:00
Gahow Wang
377a04b81f tokenizer: read pre-tokenizer regex from tokenizer.json
Parse the model's `pre_tokenizer` section to extract its Split regex
instead of hardcoding the GPT-2 pattern.  The gpt-oss-20b model uses
a GPT-4-style regex that produces different word boundaries, causing a
1-token prompt mismatch vs HuggingFace (136 → 135 tokens, now aligned).

Unsupported lookahead `(?!\S)` is stripped — it only affects trailing
whitespace edge cases.  Falls back to the old GPT-2/Qwen heuristic if
the model regex fails to compile or is absent.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-31 13:22:35 +08:00
Gahow Wang
241009a96c docs: remove TO-BE-FIXED.md — all listed issues have been resolved
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-31 01:06:26 +08:00
0c6135aea3 bench-gpt-oss: teacher-forced diagnostics + --prompt flag
Add --prompt to override the fixed prompt, and two teacher-forced
diagnostics: --forced runs prefill over prompt+oracle ids and reports
per-position top-1 agreement; --forced-decode walks the oracle trajectory
through the decode path with per-position agreement bucketed by position,
to localize long-context decode divergence from the reference.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 00:56:46 +08:00
ffd90ce7fb server: emit harmony developer instructions block for gpt-oss
Route caller-supplied system messages into a harmony 'developer'
instructions block (<|start|>developer<|message|># Instructions...),
keeping the fixed system/meta block for the channel declaration. Harmony
puts user instructions on the developer role, not system.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 00:56:39 +08:00
3c9d5e260e server: harmony termination via is_eos + TP repetition penalty
Use tokenizer.is_eos() (multi-eos) for generation termination in both PP
and TP engines instead of a single eos id, so gpt-oss stops on <|return|>
/<|call|>/<|endoftext|>.

In the TP engine, optionally apply a repetition penalty on the greedy
decode path (XSERV_REP_PENALTY>1 over XSERV_REP_WINDOW recent tokens; off
by default) to break greedy repetition loops.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 00:56:33 +08:00
99b212e6c1 model/sampling: NaN-safe argmax + optional repetition penalty
Make argmax skip NaN logits (warn once) instead of panicking the engine
thread on a single NaN. Add sample_greedy_penalized() applying an
HF-style repetition penalty over recent ids on the greedy path, to break
greedy repetition loops on reasoning models without touching the forward
pass.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 00:56:27 +08:00
e11f15e009 tokenizer: support multiple end-of-generation tokens
Track an ordered eos_token_ids list (not just one id) and add is_eos().
gpt-oss/harmony ends the assistant turn on <|return|> and also treats
<|call|> and <|endoftext|> as terminators (generation_config.json
eos_token_id = [200002, 199999, 200012]); single-eos families are
unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 00:56:21 +08:00
9c98c169ff kernels: flash attention with gpt-oss sinks + sliding window
Add flash_attention_sinks_bf16 prefill kernel that folds the per-head
attention sink into the softmax denominator (exactly as the decode sink
kernel) and supports an optional sliding-window mask matching HF gpt-oss.

Wire it through xserv-kernels (flash_attention_sinks) and use it in
GptOss prefill, replacing the post-hoc sink approximation for an exact
match against the reference math.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 00:56:10 +08:00
Gahow Wang
5cb3cf28f9 server: add gpt-oss chat template for proper prompt formatting
The gpt-oss model requires a specific prompt format with <|start|>,
<|message|>, <|end|>, <|channel|> tokens. Without this, the model
produces degenerate output. Auto-detected via config.model_type.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-30 15:43:29 +08:00
Gahow Wang
15c51f143e server: support GptOss in TP engine + benchmark script
- tp_engine.rs: TpModel enum dispatches between Qwen3 and GptOss based on
  config.is_moe(). Server auto-detects model type on startup.
- tools/run_gpt_oss_bench.sh: one-click benchmark comparing xserv (TP=2)
  vs llama.cpp (BF16 GGUF) on GSM8K quality + speed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-30 15:39:44 +08:00
Gahow Wang
d29c39d74e fix: GEMV NaN bug — skip custom kernel for small N (<256)
The custom launch_gemv_bf16 kernel produces NaN when output dimension N
is small (e.g. N=32 for the MoE router). Fall back to cuBLAS GemmEx for
N < 256. Also removes the padding workaround in gpt_oss MoE forward.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-30 15:20:04 +08:00
Gahow Wang
9ad91a4a92 phase19: MoE support — gpt-oss-20b end-to-end inference with TP=2
Add Mixture-of-Experts support for the gpt-oss-20b model (20.9B params,
32 experts × top-4 routing). Key additions:

- ModelConfig: MoE fields (num_local_experts, layer_types, sliding_window,
  attention_bias, explicit head_dim, rope_scaling, swiglu_limit)
- YaRN RoPE: RopeCache::new_yarn() with correct frequency interpolation
  and attention_scaling = 0.1*ln(factor)+1
- Custom GLU kernel: gpt_oss_glu_bf16 (clamped sigmoid gate activation)
- Paged attention with sinks + sliding window kernel variant
- GptOss model struct with expert-parallel TP (split 32 experts across ranks)
- bench-gpt-oss binary for TP inference benchmarking

Verified on dash5 with 2x RTX 5090: 63.6 tok/s decode, ~160ms TTFT.
Model generates topically-coherent output (needs chat template for quality).

Known issues:
- Custom GEMV kernel produces NaN with small N (workaround: pad to M=2)
- Prefill doesn't use attention sinks (uses standard flash attention)
- Output quality requires chat template formatting

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-30 15:18:01 +08:00
Gahow Wang
46bfb59f30 Merge branch 'phase18-pipeline-parallelism': pipeline-parallel inference
Adds --pp N for layer-wise pipeline parallelism via NCCL P2P send/recv.
Each stage holds layers [s*L, (s+1)*L), stage 0 owns embedding, last
stage owns norm/lm_head. v1 serial (one request at a time) — correctness
+ per-GPU memory savings (~1/N). Refactors model to unfused QKV/gate_up
projections and removes unused kernels (argmax, reshape_and_cache).
2026-05-30 13:13:05 +08:00
Gahow Wang
9a01c60100 server: GPU argmax fast path for greedy decode
When all active sequences use temperature=0, run argmax on the GPU and
only D2H the token ids (~B×4 bytes) instead of the full [B, vocab_size]
BF16 logits (~1.2 MB at B=4, Qwen3 vocab=152K). Mixed-sampling batches
fall back to the existing CPU path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-30 12:50:47 +08:00
Gahow Wang
c679f618fd model: fuse QKV/gate_up projections, batched decode ops
Weight fusion at load time:
- q/k/v_proj → single qkv_proj_wt, GEMV once then narrow() to split
- gate/up_proj → single gate_up_proj_wt, same pattern
- Reduces GEMV calls from 7 to 4 per layer (36 layers → 108 fewer launches)

Batched decode refactor (forward_decode_paged):
- Per-head RMSNorm: reshape to [B*H, D], one rmsnorm call
- Batched RoPE: one call for all sequences
- Batched KV scatter: one reshape_and_cache kernel per layer
- Eliminates the per-sequence loop entirely

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-30 12:50:39 +08:00
Gahow Wang
cc4bd4cfe5 paged-kv: kernel-based scatter + fix data_ptr offset bug
Replace the Rust cudaMemcpy loop in append_tokens() with the new
reshape_and_cache kernel. Add append_tokens_batched() for the decode
path using the batched variant.

Fix: use data_ptr() instead of storage().gpu_buffer().as_ptr() so that
tensor offset is respected. The old code silently read from storage base
(element 0) instead of the tensor's logical start, which produced wrong
results when K/V tensors were narrow() views into a fused QKV buffer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-30 12:50:28 +08:00
Gahow Wang
13ae3de69e kernels: reshape_and_cache, GPU argmax, single-launch GEMV
Three new CUDA kernels and one rewrite:

- reshape_and_cache: scatter K/V into paged pool in a single kernel per
  layer, replacing the Rust-side per-token per-head cudaMemcpy loop.
  Includes both single-sequence (prefill) and batched (decode) variants.

- argmax: GPU-side BF16 argmax with warp-shuffle reduction. Greedy
  decode now only D2H-transfers B×4 bytes (token ids) instead of the
  full [B, vocab] logits tensor.

- GEMV rewrite: fused zero-init inside the K-split kernel eliminates
  the cudaMemsetAsync call, reducing launches from 3 to 2 per GEMV.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-30 12:50:17 +08:00
Gahow Wang
6ce21345be cuda: add cached_trim() to release pooled GPU buffers
Exposes the caching allocator's trim() through a public free function.
Called after weight fusion during model loading to free temporary buffers
that would otherwise sit in the pool and cause OOM.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-30 12:50:04 +08:00
Gahow Wang
1ab6ca9c09 tensor: add narrow() view and relax is_contiguous for size-1 dims
narrow(dim, start, len) creates a zero-copy slice along any dimension.
is_contiguous() now ignores stride mismatches on dimensions of size 1,
since those dimensions are never stepped. This avoids unnecessary GPU
strided copies when slicing fused projection outputs at batch=1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-30 12:49:57 +08:00
11e0154e4d docs: Phase 18 pipeline parallelism — design + benchmark results
docs/18-pipeline-parallelism.md: PP design (layer split, NCCL P2P,
per-stage KV, engine/threading model).
docs/benchmarks/pp-sweep.md: measured on dash5 (8x RTX 5090, Qwen3-8B
BF16) — single-stream latency + per-GPU VRAM (~1/N), byte-exact
correctness (single x2 vs pp4 x2 control), and the full AIME-30 +
GSM8K-30 quality matrix (xserv & llama.cpp PP=1/2/4): GSM8K 29/30 in
every cell, TPOT flat across PP.
README: multi-card (TP/PP) section + roadmap to Phase 18.
gitignore: /.claude/ runtime state.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 18:57:09 +08:00
d5dcf1a5ab bench: PP harness (xserv --pp vs llama.cpp -sm layer)
runner/servers: add --pp for both engines (xserv --pp N; llama.cpp
-sm layer over N GPUs). New drivers: pp_final.sh (sequential latency +
per-GPU VRAM + byte-exact correctness), pp_diag.sh (single x2 vs pp4 x2
determinism control), pp_quality_full.sh / pp_llama_47.sh (AIME+GSM8K
matrix, xserv on 0-3 || llama on 4-7), summarize_pp/summarize_fullq,
pp_time.py latency probe.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 18:45:59 +08:00
824cc58daa server: pipeline-parallel HTTP engine (--pp N)
pp_engine::run_pp: stage-0 coordinator (scheduler/tokenizer/sampling +
stop logic) on the calling thread, worker stage threads for 1..P. Each
step the coordinator embeds + runs its layers, then the hidden state is
handed stage->stage over NCCL P2P; the last stage samples and returns
the token to stage 0 over an in-process channel. v1 is serial (one
request, one token/step) — correctness first; throughput via microbatch
overlap is future work.

main: wire --pp N (mutually exclusive with --tp).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 18:45:52 +08:00
da3aaa134a model: pipeline-parallel Qwen3 (from_weights_pp + stage forward)
Layer-wise split: each stage loads only its contiguous layer range
[s*L, (s+1)*L); stage 0 keeps embed_tokens, the last stage keeps
norm/lm_head (others get a 1x1 placeholder). Heads are NOT split
(PP is orthogonal to TP). Adds embed/head and forward_layers_prefill/
forward_layers_decode that take and return the [tokens, hidden] hidden
state; per-stage PagedKVCache is indexed by local layer id.

sampling: derive Clone on SamplingParams (carried in the PP command enum).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 18:45:47 +08:00
859c0cc0b6 distributed: NCCL P2P primitives (PpContext + send/recv)
Add ncclSend/ncclRecv FFI and a PpContext that initializes a NCCL
communicator across P pipeline stages and hands the hidden state to
neighbour stages on the null stream. Mirrors TpContext; the collective
differs (point-to-point hand-off vs in-layer AllReduce).

tests/sendrecv.rs: 2-GPU stage0->stage1 send/recv smoke test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 18:45:42 +08:00
c2362df1f1 fix(xserv-chat): UTF-8/CJK-aware line input
Cooked-mode read_line() left line editing to the terminal, so Backspace on a
multi-byte 汉字/かな/한글 deleted a byte (or behaved inconsistently across TTYs).
Replace with a raw-mode reader (libc termios): Backspace pops a whole char,
multi-byte input is reassembled from its continuation bytes, and a full-line
redraw renders double-width glyphs correctly. Non-TTY input falls back to a
plain read; raw mode is restored after each line. libc is already a locked
transitive dep, so this builds offline.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 11:36:54 +08:00
7b8b520cda docs: TP=1/2/4 xserv vs llama.cpp benchmark results
AIME 2025 + GSM8K at TP=1/2/4. Quality on par across engines/TP. Opposite
perf scaling: xserv TPOT improves with TP (21->17->15ms) while llama.cpp
row-split regresses over PCIe (10->19->20ms), crossing over so xserv is faster
at TP=4. Includes the clean same-path bench-tp scaling (58/76/86 tok/s).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 11:10:52 +08:00
a4a171d425 bench: TP sweep harness (xserv --tp, llama row-split, concurrent groups)
runner/servers gain --tp (xserv --tp N; llama.cpp --split-mode row) and
--llama-devices so llama can run on a disjoint GPU group. run_tp_parallel.sh
runs xserv (GPU 0..N-1) and llama.cpp (GPU 4..4+N-1) concurrently per TP,
matching the box's 0-3 / 4-7 PHB groups. summarize_tp.py tabulates the sweep.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 11:10:43 +08:00
95eb61d639 server: tensor-parallel HTTP engine (--tp N)
tp_engine: rank-0 coordinator owns the scheduler and broadcasts per-token
commands (Register/Prefill/Decode/Free) to worker rank threads; the sampled
token always comes from rank 0, so it's correct for greedy and stochastic
sampling. Serial single-request path (sufficient for the quality benchmark).
--tp N selects it; TP=1 keeps the existing single-GPU Engine unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 11:10:33 +08:00
f17011129e model: tensor-parallel Qwen3 (sharded weights + AllReduce)
from_weights_tp shards each rank's weights (column-split q/k/v/gate/up,
row-split o/down; replicate norms/embed/lm_head) and the paged forward uses
local head counts + AllReduces after o_proj and down_proj. PagedKVCache::new_tp
sizes the pool for the rank's local KV heads (KV is sharded too). TP=1 is the
identity path. New bench-tp binary runs E2E multi-GPU generation per TP degree.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 11:10:24 +08:00
453520d622 distributed: NCCL tensor-parallel primitives (TpContext + AllReduce)
New xserv-distributed crate: hand-written NCCL FFI, TpContext (one rank per
thread, bound to one GPU), and in-place BF16 AllReduce on the null stream so
it orders naturally with the model's kernels. 2-GPU AllReduce test included.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 11:10:14 +08:00
76fffb3b68 docs: Phase 17 tensor parallelism design
Megatron-style TP for Qwen3 on the 8x5090 (no-NVLink, PCIe) box: column/row
split per layer, 2 AllReduces/layer, multi-thread one-rank-per-GPU model,
NCCL, sharded weights, and the incremental implementation + verification plan.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 11:10:03 +08:00
14a44b503e docs: add Chinese README (overview + usage)
Project intro, architecture, build, basic usage (HTTP server / CLI / bench),
and the llama.cpp comparison workflow.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 21:38:20 +08:00
80157e614a docs: update llama.cpp comparison with 8192 results (OOM fixed)
Re-ran the full comparison at --max-seq-len 8192 now that xserv handles it:
- OOM finding resolved — pool sized to available VRAM + vLLM-style host swap;
  8192 runs with 0 swap events (swap is the overload safety net).
- Quality at parity with equal context: AIME 20.0% vs 20.0%, GSM8K 98% vs 96%.
- Speed unchanged relative to llama.cpp (~0.42-0.60x); TPOT is bandwidth-bound.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 21:32:14 +08:00
fc1900a745 server: VRAM-sized KV pool + vLLM-style swap scheduler
Fixes the paged-KV OOM at large --max-seq-len and adds elastic memory:

- Size the GPU block pool to available VRAM (cudaMemGetInfo) instead of the
  worst-case blocks_per_seq * max_batch * 2 reservation, which OOM'd at 8192.
- Scheduler tracks waiting/running/swapped sets: block-aware admission,
  swap-in of resumable sequences when blocks free, and preemption of the
  newest running sequence to host when the pool can't cover a decode step.
- --swap-space-gb (default 8) sizes the pinned host swap pool;
  XSERV_MAX_KV_BLOCKS forces a small pool to exercise swapping.
- api: poison-tolerant lock + clean 503 when the engine thread is gone,
  instead of cascading mutex-poison panics.

Verified on RTX 5090: serves at --max-seq-len 8192 (previously OOM), and a
forced 40-block pool drives 48 lossless swap-out/swap-in cycles under
concurrency with coherent output.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 19:59:06 +08:00
d52baa0006 model: paged KV cache with CPU swap pool, decode graph, qwen3 updates
- paged_kv_cache: new block-paged KV cache; adds a pinned-host swap pool with
  a second BlockAllocator, per-sequence Location {Gpu,Cpu}, and lossless
  swap_out/swap_in (block-granular D2H/H2D) for vLLM-style preemption.
  bytes_per_block helper exposes per-block cost for VRAM-based sizing.
- decode_graph: CUDA-graph decode path.
- qwen3/gpt2/kv_cache: paged prefill/decode forward + related updates.
- tokenizer/bins: BPE updates, new xserv-chat CLI, bench-qwen3 tweaks.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 19:58:54 +08:00
4c3f914459 kernels/cuda: paged-attention kernel, dispatch, pinned host memory
CUDA layer for the paged-KV + swap work:
- csrc: new paged_attention.cu plus updates across attention/gemm/norm/
  activation/embedding/reduce kernels and common.cuh.
- xserv-kernels: new dispatch module and kernel-binding updates.
- xserv-cuda: cudaMallocHost/FreeHost bindings + PinnedBuffer (host swap
  pool backing) and offset-aware D2H/H2D copies used to move KV blocks
  between the GPU pool and pinned host memory.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 19:58:36 +08:00
3f1c3d429a docs: llama.cpp vs xserv benchmark results + summary
Record what the new baseline adds (llama.cpp pinned b9371, same BF16 weights,
AIME 2025 + GSM8K) and the measured results: performance (xserv ~0.45-0.61x
llama.cpp throughput) and quality parity (GSM8K 94% vs 96%, AIME 23.3% vs 20%
after the context fix), plus the findings the bench surfaced.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 15:06:21 +08:00
950ccf3822 bench: fix llama.cpp per-slot context (was 1/parallel of intended)
llama.cpp divides total -c across --parallel slots, so -c 4096 --parallel 4
gave each request only 1024 tokens — truncating long AIME generations before
the boxed answer and making xserv look artificially better (20% vs 3.3%).
Set total -c = max_seq_len * n_parallel so per-slot context equals xserv's
per-sequence max_seq_len. Also drop --log-disable; its startup log reports the
per-slot n_ctx that catches exactly this misconfiguration.

After the fix, AIME is at parity (xserv 23.3% vs llama.cpp 20.0%), matching the
GSM8K parity and confirming the gap was a config artifact, not engine quality.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 15:06:12 +08:00
7cb9ee3870 bench: run one server at a time, match thinking mode, fix tools package
Refinements from end-to-end bring-up on the GPU host:

- Run each system start→suites→stop in sequence. Two BF16 8B models don't
  co-reside on one 32GB GPU, and a resident idle engine would distort the
  other's latency/throughput.
- Match generation mode: xserv hardcodes Qwen3 thinking off, so send
  chat_template_kwargs={enable_thinking:false} to llama.cpp via a per-endpoint
  extra_body. --enable-thinking opts back into thinking mode.
- Add tools/__init__.py so `python3 -m tools.bench.runner` resolves our package
  instead of a site-packages `tools` (nvfuser ships one that shadowed it).
- Document offline-GPU-host workflow, thinking-match, and the xserv 8192 OOM
  finding that the bench surfaced.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 11:40:07 +08:00
49c7653222 tools: add llama.cpp comparison baseline + standard benchmark suite
Vendor llama.cpp as a submodule pinned to b9371 and add a one-click
benchmark driver that compares xserv against it on identical workloads:

- setup-llama-cpp.sh: network-optional CUDA build (SM120); convert-to-gguf.sh
  converts the same safetensors to BF16 GGUF for an apples-to-apples baseline.
- tools/bench/: black-box OpenAI-API driver measuring TTFT/TPOT/throughput
  (single-stream + concurrent) and response quality on AIME 2025 + GSM8K.
- fetch_datasets.py pulls datasets to local JSON (GPU host has no network);
  task loaders prefer the local JSON.
- sync-and-build.sh: `bench` subcommand transfers source + datasets to the
  GPU host via tar-over-ssh (no rsync there), builds, and runs the suite.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 11:18:52 +08:00
9bb5c5c328 tools: add correctness + performance test scripts for Qwen3-8B
- test_correctness.py: compare prefill logits top-20 vs HF transformers
- bench_server.py: HTTP API benchmark (throughput, streaming, concurrent, EOS leak check)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 14:13:49 +08:00
986a289616 fix: 12 bug fixes from comprehensive review — 51 tok/s verified on RTX 5090
P0 fixes (blocking usability):
- FIX-01: thread-local cuBLAS handle (was creating/destroying per matmul)
- FIX-16: EOS token no longer leaks into API responses
- FIX-17: max_seq_len configurable via --max-seq-len (default 2048, was hardcoded 256)
- FIX-18: max_tokens clamped to available seq space, prompt overflow returns 400

P1 fixes (bugs & performance):
- FIX-07: CachingAllocator wired into all hot paths (to_device, embedding, rope, concat)
- FIX-08: CudaDeviceProp buffer increased to 32KB for CUDA 12.9 safety
- FIX-09: tokenizer byte_fallback graceful degradation (was panic)
- FIX-19: causal mask uses -INFINITY instead of -1e9 (BF16 supports inf)
- FIX-20: LayerNorm rewritten to numerically stable two-pass algorithm
- FIX-21: min block size guard (32 threads) for LayerNorm/RMSNorm launches

P2 fixes (improvements):
- FIX-22: Option<GpuKVCache> + take() eliminates dummy KV cache allocations
- FIX-23: RoPE cache no longer artificially capped at 8192 positions

Verified on dash5 (RTX 5090): 51 tok/s batch=1, 74 tok/s 2-concurrent, 1.7-3.3x HF transformers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 14:13:43 +08:00
a67e724119 docs: Phase 15 design doc + benchmark report
Design document (docs/15-performance.md):
- Roofline analysis: 112 tok/s theoretical at 1.79 TB/s
- Bottleneck quantification: cuBLAS M=1 GEMV at 8% bandwidth → 77% of step time
- Six optimizations with rationale, implementation details, and expected impact
- Ablation table with per-optimization delta measurements
- Remaining 55% roofline gap breakdown with next-step priorities

Benchmark report (docs/benchmarks/phase15-performance.md):
- Full ablation: 12.9 → 50.3 tok/s across 6 optimizations
- Per-prompt detail (8 prompts, 46-51 tok/s range)
- Concurrent throughput analysis (batch=4 vs serial)
- Phase-over-phase tracking from Phase 8 to Phase 15 (2.5 → 50.3 tok/s)
- Correctness verification (9/10 top-1 match, 52/52 API pass)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 00:39:27 +08:00
d5532ef209 phase 15: Tensor::empty + CUDA Graph infra — 50.3 tok/s (140% of HF, 45% roofline)
Two optimizations:

1. Tensor::empty() — skip cudaMemset for output tensors
   All kernel wrappers that fully overwrite their output now use
   Tensor::empty() instead of Tensor::zeros(). Eliminates ~756
   cudaMemset calls per decode step (21 per layer × 36 layers).
   Improvement: 46.6 → 50.3 tok/s (+8%).

2. CUDA Graph infrastructure (for future use)
   Added FFI bindings (cudaStreamBeginCapture, cudaGraphInstantiate,
   cudaGraphLaunch) and RAII CudaGraph wrapper. Not yet used in the
   forward pass due to variable kv_len, but provides foundation for
   future graph-based decode optimization.

Ablation (dash5, RTX 5090, Qwen3-8B BF16, serial decode):

| Optimization | tok/s | vs HF | Roofline |
|-------------|-------|-------|----------|
| Phase 14 baseline | 12.9 | 36% | 12% |
| + Fused kernels | 13.2 | 37% | 12% |
| + Batched decode | 13.2 (serial) | 37% | 12% |
| + Custom GEMV | 46.6 | 130% | 42% |
| + Tensor::empty | 50.3 | 140% | 45% |

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 23:57:34 +08:00
e207523e21 phase 15: custom GEMV kernel — 46.6 tok/s serial (3.5x improvement, 130% of HF)
Custom bandwidth-optimized GEMV kernel for M=1 BF16 decode, replacing
cuBLAS which achieves only ~8% bandwidth utilization for tiny M=1 GEMMs.

Kernel design (csrc/gemm/gemv.cu):
- K-split tiled: TILE_N=128, TILE_K=256, Grid=(N/128, K/256)=512 blocks
- High occupancy: 512 blocks / 170 SMs = ~3 blocks/SM
- Coalesced memory access: adjacent threads read adjacent columns of W
- Shared memory for x vector (avoids redundant global reads)
- FP32 accumulation via atomicAdd (K-split partial sums)
- Separate fp32→bf16 conversion kernel

Integration:
- matmul() auto-dispatches to custom GEMV when M==1 && dtype==BF16
- Batched decode (M>1) continues to use cuBLAS
- Caching allocator provides FP32 temp buffer (pooled, no per-call malloc)

Ablation results (dash5, RTX 5090, Qwen3-8B BF16):

| Config | tok/s | vs HF (36) | vs roofline (112) |
|--------|-------|-----------|-------------------|
| Phase 14 (cuBLAS M=1) | 13.2 | 37% | 12% |
| + Custom GEMV (M=1) | 46.6 | 130% | 42% |
| Concurrent batch=4 | 28.2 | 78% | — |

Single-request throughput now EXCEEDS HuggingFace transformers by 30%.
The custom GEMV achieves ~42% of the theoretical roofline (vs 12% before).

Note: concurrent batch=4 (28.2 tok/s) is slower than serial (46.6 tok/s)
because the per-seq attention/reshape overhead in batched decode outweighs
the cuBLAS M=4 benefit when the custom GEMV already handles M=1 efficiently.
Engine should prefer serial decode when custom GEMV is available.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 22:22:31 +08:00
876d3f5d6a phase 15: batched decode forward — 35 tok/s (97% of HF transformers)
Implement batched decode that processes multiple sequences' tokens in one
forward pass. The key insight: cuBLAS M=4 GEMM is dramatically faster
than 4× M=1 GEMV due to better TensorCore utilization and amortized
kernel launch overhead.

New method Qwen3::forward_decode_batch(&tokens, &positions, &mut caches):
- Batched embedding, norm, projections, FFN: [B, hidden] × [hidden, X]
  → one cuBLAS call per weight matrix instead of B calls
- Per-sequence attention: RoPE, KV cache, decode_attention remain per-seq
  (each has different position and KV length)
- Row extraction (row_view) and concatenation (concat_rows) for
  batched↔per-seq transitions

Engine Step 4b:
- batch_size >= 2: extracts caches via std::mem::replace, calls
  forward_decode_batch, restores caches, samples per-sequence
- batch_size == 1: falls back to per-seq forward_gpu_cache (no overhead)

Ablation results (dash5, RTX 5090, Qwen3-8B BF16):

| Scenario | Throughput | vs HF |
|----------|-----------|-------|
| Serial (batch=1) | 13.2 tok/s | 37% |
| Concurrent (batch=4) | 35.1 tok/s | 97% |
| HF transformers | 36.0 tok/s | 100% |

The 2.66x throughput improvement (13.2 → 35.1) for concurrent requests
comes from cuBLAS going from 1008 M=1 GEMVs to 252 M=4 GEMMs per step,
which cuBLAS handles ~4x more efficiently on TensorCores.

Milestone ④ target (50% of vLLM/HF throughput) achieved with 97%.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 20:07:43 +08:00
9783fcf410 phase 15: decode attention kernel + fused silu_mul + fused add_rmsnorm
Three performance optimizations targeting decode throughput:

1. Decode Attention Kernel (csrc/attention/flash_attention.cu):
   - Specialized kernel for Q_len=1 (decode step)
   - 256 threads parallelize across KV sequence dimension
   - Online softmax with block-level warp-shuffle reduction
   - Replaces FA2 kernel which wasted 63/64 threads for decode
   - flash_attention() auto-dispatches when q_len==1

2. Fused SiLU×Mul (csrc/activation/activations.cu):
   - Single kernel: out = silu(gate) * up
   - Saves 1 HBM read + 1 HBM write per FFN layer (N elements)
   - Eliminates intermediate tensor allocation

3. Fused Add+RMSNorm (csrc/normalization/rmsnorm.cu):
   - Single kernel: (normed, sum) = (rmsnorm(x+residual), x+residual)
   - Saves 1 full HBM round-trip per attention block
   - Eliminates separate add + rmsnorm kernel pair

Performance analysis:
- At current short sequences (max 79 tokens), these optimizations provide
  marginal benefit because the bottleneck is cuBLAS GEMV overhead:
  252 weight matrix reads × ~32MB each = 15.5 GB per decode step.
  Theoretical minimum at 1.79 TB/s = 8.7ms, actual ~78ms (9x gap).
- The fused kernels and decode attention will show larger gains at
  longer sequences where attention and element-wise ops dominate.
- Next optimization target: CUDA Graphs to eliminate kernel launch
  overhead, or custom GEMV kernels to replace cuBLAS for M=1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 19:40:56 +08:00
6cc1c9332d docs: Phase 14 design doc + benchmark, fix Phase 11/12 honesty
Phase 14 (Flash Attention):
- Design doc: FA2 algorithm, SM120 hardware constraints (FA4 incompatible),
  kernel config (BR=BC=64, 32KB smem), GQA mapping, causal tile-skip,
  known limitations and optimization roadmap
- Benchmark doc: correctness (9/10 top-1 match, identical to pre-FA baseline),
  performance tracking (6.9→10.3→12.9 tok/s across phases), memory savings
  analysis, remaining bottleneck breakdown

Phase 11 doc: title corrected from "Paged Attention" to "GPU-Resident KV Cache"
with explicit note that paged allocation was not implemented.

Phase 12 doc: "当前状态" updated from "未实现" to reflect actual state —
iteration-level scheduling implemented + verified (6.0x concurrent speedup),
batched GPU forward explicitly marked as not yet implemented.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 18:51:29 +08:00