96 Commits

Author SHA1 Message Date
e11f15e009 tokenizer: support multiple end-of-generation tokens
Track an ordered eos_token_ids list (not just one id) and add is_eos().
gpt-oss/harmony ends the assistant turn on <|return|> and also treats
<|call|> and <|endoftext|> as terminators (generation_config.json
eos_token_id = [200002, 199999, 200012]); single-eos families are
unchanged.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 00:56:21 +08:00
9c98c169ff kernels: flash attention with gpt-oss sinks + sliding window
Add flash_attention_sinks_bf16 prefill kernel that folds the per-head
attention sink into the softmax denominator (exactly as the decode sink
kernel) and supports an optional sliding-window mask matching HF gpt-oss.

Wire it through xserv-kernels (flash_attention_sinks) and use it in
GptOss prefill, replacing the post-hoc sink approximation for an exact
match against the reference math.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-31 00:56:10 +08:00
Gahow Wang
5cb3cf28f9 server: add gpt-oss chat template for proper prompt formatting
The gpt-oss model requires a specific prompt format with <|start|>,
<|message|>, <|end|>, <|channel|> tokens. Without this, the model
produces degenerate output. Auto-detected via config.model_type.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-30 15:43:29 +08:00
Gahow Wang
15c51f143e server: support GptOss in TP engine + benchmark script
- tp_engine.rs: TpModel enum dispatches between Qwen3 and GptOss based on
  config.is_moe(). Server auto-detects model type on startup.
- tools/run_gpt_oss_bench.sh: one-click benchmark comparing xserv (TP=2)
  vs llama.cpp (BF16 GGUF) on GSM8K quality + speed

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-30 15:39:44 +08:00
Gahow Wang
d29c39d74e fix: GEMV NaN bug — skip custom kernel for small N (<256)
The custom launch_gemv_bf16 kernel produces NaN when output dimension N
is small (e.g. N=32 for the MoE router). Fall back to cuBLAS GemmEx for
N < 256. Also removes the padding workaround in gpt_oss MoE forward.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-30 15:20:04 +08:00
Gahow Wang
9ad91a4a92 phase19: MoE support — gpt-oss-20b end-to-end inference with TP=2
Add Mixture-of-Experts support for the gpt-oss-20b model (20.9B params,
32 experts × top-4 routing). Key additions:

- ModelConfig: MoE fields (num_local_experts, layer_types, sliding_window,
  attention_bias, explicit head_dim, rope_scaling, swiglu_limit)
- YaRN RoPE: RopeCache::new_yarn() with correct frequency interpolation
  and attention_scaling = 0.1*ln(factor)+1
- Custom GLU kernel: gpt_oss_glu_bf16 (clamped sigmoid gate activation)
- Paged attention with sinks + sliding window kernel variant
- GptOss model struct with expert-parallel TP (split 32 experts across ranks)
- bench-gpt-oss binary for TP inference benchmarking

Verified on dash5 with 2x RTX 5090: 63.6 tok/s decode, ~160ms TTFT.
Model generates topically-coherent output (needs chat template for quality).

Known issues:
- Custom GEMV kernel produces NaN with small N (workaround: pad to M=2)
- Prefill doesn't use attention sinks (uses standard flash attention)
- Output quality requires chat template formatting

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-30 15:18:01 +08:00
Gahow Wang
46bfb59f30 Merge branch 'phase18-pipeline-parallelism': pipeline-parallel inference
Adds --pp N for layer-wise pipeline parallelism via NCCL P2P send/recv.
Each stage holds layers [s*L, (s+1)*L), stage 0 owns embedding, last
stage owns norm/lm_head. v1 serial (one request at a time) — correctness
+ per-GPU memory savings (~1/N). Refactors model to unfused QKV/gate_up
projections and removes unused kernels (argmax, reshape_and_cache).
2026-05-30 13:13:05 +08:00
Gahow Wang
9a01c60100 server: GPU argmax fast path for greedy decode
When all active sequences use temperature=0, run argmax on the GPU and
only D2H the token ids (~B×4 bytes) instead of the full [B, vocab_size]
BF16 logits (~1.2 MB at B=4, Qwen3 vocab=152K). Mixed-sampling batches
fall back to the existing CPU path.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-30 12:50:47 +08:00
Gahow Wang
c679f618fd model: fuse QKV/gate_up projections, batched decode ops
Weight fusion at load time:
- q/k/v_proj → single qkv_proj_wt, GEMV once then narrow() to split
- gate/up_proj → single gate_up_proj_wt, same pattern
- Reduces GEMV calls from 7 to 4 per layer (36 layers → 108 fewer launches)

Batched decode refactor (forward_decode_paged):
- Per-head RMSNorm: reshape to [B*H, D], one rmsnorm call
- Batched RoPE: one call for all sequences
- Batched KV scatter: one reshape_and_cache kernel per layer
- Eliminates the per-sequence loop entirely

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-30 12:50:39 +08:00
Gahow Wang
cc4bd4cfe5 paged-kv: kernel-based scatter + fix data_ptr offset bug
Replace the Rust cudaMemcpy loop in append_tokens() with the new
reshape_and_cache kernel. Add append_tokens_batched() for the decode
path using the batched variant.

Fix: use data_ptr() instead of storage().gpu_buffer().as_ptr() so that
tensor offset is respected. The old code silently read from storage base
(element 0) instead of the tensor's logical start, which produced wrong
results when K/V tensors were narrow() views into a fused QKV buffer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-30 12:50:28 +08:00
Gahow Wang
13ae3de69e kernels: reshape_and_cache, GPU argmax, single-launch GEMV
Three new CUDA kernels and one rewrite:

- reshape_and_cache: scatter K/V into paged pool in a single kernel per
  layer, replacing the Rust-side per-token per-head cudaMemcpy loop.
  Includes both single-sequence (prefill) and batched (decode) variants.

- argmax: GPU-side BF16 argmax with warp-shuffle reduction. Greedy
  decode now only D2H-transfers B×4 bytes (token ids) instead of the
  full [B, vocab] logits tensor.

- GEMV rewrite: fused zero-init inside the K-split kernel eliminates
  the cudaMemsetAsync call, reducing launches from 3 to 2 per GEMV.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-30 12:50:17 +08:00
Gahow Wang
6ce21345be cuda: add cached_trim() to release pooled GPU buffers
Exposes the caching allocator's trim() through a public free function.
Called after weight fusion during model loading to free temporary buffers
that would otherwise sit in the pool and cause OOM.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-30 12:50:04 +08:00
Gahow Wang
1ab6ca9c09 tensor: add narrow() view and relax is_contiguous for size-1 dims
narrow(dim, start, len) creates a zero-copy slice along any dimension.
is_contiguous() now ignores stride mismatches on dimensions of size 1,
since those dimensions are never stepped. This avoids unnecessary GPU
strided copies when slicing fused projection outputs at batch=1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-30 12:49:57 +08:00
824cc58daa server: pipeline-parallel HTTP engine (--pp N)
pp_engine::run_pp: stage-0 coordinator (scheduler/tokenizer/sampling +
stop logic) on the calling thread, worker stage threads for 1..P. Each
step the coordinator embeds + runs its layers, then the hidden state is
handed stage->stage over NCCL P2P; the last stage samples and returns
the token to stage 0 over an in-process channel. v1 is serial (one
request, one token/step) — correctness first; throughput via microbatch
overlap is future work.

main: wire --pp N (mutually exclusive with --tp).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 18:45:52 +08:00
da3aaa134a model: pipeline-parallel Qwen3 (from_weights_pp + stage forward)
Layer-wise split: each stage loads only its contiguous layer range
[s*L, (s+1)*L); stage 0 keeps embed_tokens, the last stage keeps
norm/lm_head (others get a 1x1 placeholder). Heads are NOT split
(PP is orthogonal to TP). Adds embed/head and forward_layers_prefill/
forward_layers_decode that take and return the [tokens, hidden] hidden
state; per-stage PagedKVCache is indexed by local layer id.

sampling: derive Clone on SamplingParams (carried in the PP command enum).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 18:45:47 +08:00
859c0cc0b6 distributed: NCCL P2P primitives (PpContext + send/recv)
Add ncclSend/ncclRecv FFI and a PpContext that initializes a NCCL
communicator across P pipeline stages and hands the hidden state to
neighbour stages on the null stream. Mirrors TpContext; the collective
differs (point-to-point hand-off vs in-layer AllReduce).

tests/sendrecv.rs: 2-GPU stage0->stage1 send/recv smoke test.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-05-29 18:45:42 +08:00
c2362df1f1 fix(xserv-chat): UTF-8/CJK-aware line input
Cooked-mode read_line() left line editing to the terminal, so Backspace on a
multi-byte 汉字/かな/한글 deleted a byte (or behaved inconsistently across TTYs).
Replace with a raw-mode reader (libc termios): Backspace pops a whole char,
multi-byte input is reassembled from its continuation bytes, and a full-line
redraw renders double-width glyphs correctly. Non-TTY input falls back to a
plain read; raw mode is restored after each line. libc is already a locked
transitive dep, so this builds offline.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 11:36:54 +08:00
95eb61d639 server: tensor-parallel HTTP engine (--tp N)
tp_engine: rank-0 coordinator owns the scheduler and broadcasts per-token
commands (Register/Prefill/Decode/Free) to worker rank threads; the sampled
token always comes from rank 0, so it's correct for greedy and stochastic
sampling. Serial single-request path (sufficient for the quality benchmark).
--tp N selects it; TP=1 keeps the existing single-GPU Engine unchanged.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 11:10:33 +08:00
f17011129e model: tensor-parallel Qwen3 (sharded weights + AllReduce)
from_weights_tp shards each rank's weights (column-split q/k/v/gate/up,
row-split o/down; replicate norms/embed/lm_head) and the paged forward uses
local head counts + AllReduces after o_proj and down_proj. PagedKVCache::new_tp
sizes the pool for the rank's local KV heads (KV is sharded too). TP=1 is the
identity path. New bench-tp binary runs E2E multi-GPU generation per TP degree.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 11:10:24 +08:00
453520d622 distributed: NCCL tensor-parallel primitives (TpContext + AllReduce)
New xserv-distributed crate: hand-written NCCL FFI, TpContext (one rank per
thread, bound to one GPU), and in-place BF16 AllReduce on the null stream so
it orders naturally with the model's kernels. 2-GPU AllReduce test included.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-29 11:10:14 +08:00
fc1900a745 server: VRAM-sized KV pool + vLLM-style swap scheduler
Fixes the paged-KV OOM at large --max-seq-len and adds elastic memory:

- Size the GPU block pool to available VRAM (cudaMemGetInfo) instead of the
  worst-case blocks_per_seq * max_batch * 2 reservation, which OOM'd at 8192.
- Scheduler tracks waiting/running/swapped sets: block-aware admission,
  swap-in of resumable sequences when blocks free, and preemption of the
  newest running sequence to host when the pool can't cover a decode step.
- --swap-space-gb (default 8) sizes the pinned host swap pool;
  XSERV_MAX_KV_BLOCKS forces a small pool to exercise swapping.
- api: poison-tolerant lock + clean 503 when the engine thread is gone,
  instead of cascading mutex-poison panics.

Verified on RTX 5090: serves at --max-seq-len 8192 (previously OOM), and a
forced 40-block pool drives 48 lossless swap-out/swap-in cycles under
concurrency with coherent output.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 19:59:06 +08:00
d52baa0006 model: paged KV cache with CPU swap pool, decode graph, qwen3 updates
- paged_kv_cache: new block-paged KV cache; adds a pinned-host swap pool with
  a second BlockAllocator, per-sequence Location {Gpu,Cpu}, and lossless
  swap_out/swap_in (block-granular D2H/H2D) for vLLM-style preemption.
  bytes_per_block helper exposes per-block cost for VRAM-based sizing.
- decode_graph: CUDA-graph decode path.
- qwen3/gpt2/kv_cache: paged prefill/decode forward + related updates.
- tokenizer/bins: BPE updates, new xserv-chat CLI, bench-qwen3 tweaks.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 19:58:54 +08:00
4c3f914459 kernels/cuda: paged-attention kernel, dispatch, pinned host memory
CUDA layer for the paged-KV + swap work:
- csrc: new paged_attention.cu plus updates across attention/gemm/norm/
  activation/embedding/reduce kernels and common.cuh.
- xserv-kernels: new dispatch module and kernel-binding updates.
- xserv-cuda: cudaMallocHost/FreeHost bindings + PinnedBuffer (host swap
  pool backing) and offset-aware D2H/H2D copies used to move KV blocks
  between the GPU pool and pinned host memory.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-28 19:58:36 +08:00
986a289616 fix: 12 bug fixes from comprehensive review — 51 tok/s verified on RTX 5090
P0 fixes (blocking usability):
- FIX-01: thread-local cuBLAS handle (was creating/destroying per matmul)
- FIX-16: EOS token no longer leaks into API responses
- FIX-17: max_seq_len configurable via --max-seq-len (default 2048, was hardcoded 256)
- FIX-18: max_tokens clamped to available seq space, prompt overflow returns 400

P1 fixes (bugs & performance):
- FIX-07: CachingAllocator wired into all hot paths (to_device, embedding, rope, concat)
- FIX-08: CudaDeviceProp buffer increased to 32KB for CUDA 12.9 safety
- FIX-09: tokenizer byte_fallback graceful degradation (was panic)
- FIX-19: causal mask uses -INFINITY instead of -1e9 (BF16 supports inf)
- FIX-20: LayerNorm rewritten to numerically stable two-pass algorithm
- FIX-21: min block size guard (32 threads) for LayerNorm/RMSNorm launches

P2 fixes (improvements):
- FIX-22: Option<GpuKVCache> + take() eliminates dummy KV cache allocations
- FIX-23: RoPE cache no longer artificially capped at 8192 positions

Verified on dash5 (RTX 5090): 51 tok/s batch=1, 74 tok/s 2-concurrent, 1.7-3.3x HF transformers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-23 14:13:43 +08:00
d5532ef209 phase 15: Tensor::empty + CUDA Graph infra — 50.3 tok/s (140% of HF, 45% roofline)
Two optimizations:

1. Tensor::empty() — skip cudaMemset for output tensors
   All kernel wrappers that fully overwrite their output now use
   Tensor::empty() instead of Tensor::zeros(). Eliminates ~756
   cudaMemset calls per decode step (21 per layer × 36 layers).
   Improvement: 46.6 → 50.3 tok/s (+8%).

2. CUDA Graph infrastructure (for future use)
   Added FFI bindings (cudaStreamBeginCapture, cudaGraphInstantiate,
   cudaGraphLaunch) and RAII CudaGraph wrapper. Not yet used in the
   forward pass due to variable kv_len, but provides foundation for
   future graph-based decode optimization.

Ablation (dash5, RTX 5090, Qwen3-8B BF16, serial decode):

| Optimization | tok/s | vs HF | Roofline |
|-------------|-------|-------|----------|
| Phase 14 baseline | 12.9 | 36% | 12% |
| + Fused kernels | 13.2 | 37% | 12% |
| + Batched decode | 13.2 (serial) | 37% | 12% |
| + Custom GEMV | 46.6 | 130% | 42% |
| + Tensor::empty | 50.3 | 140% | 45% |

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 23:57:34 +08:00
e207523e21 phase 15: custom GEMV kernel — 46.6 tok/s serial (3.5x improvement, 130% of HF)
Custom bandwidth-optimized GEMV kernel for M=1 BF16 decode, replacing
cuBLAS which achieves only ~8% bandwidth utilization for tiny M=1 GEMMs.

Kernel design (csrc/gemm/gemv.cu):
- K-split tiled: TILE_N=128, TILE_K=256, Grid=(N/128, K/256)=512 blocks
- High occupancy: 512 blocks / 170 SMs = ~3 blocks/SM
- Coalesced memory access: adjacent threads read adjacent columns of W
- Shared memory for x vector (avoids redundant global reads)
- FP32 accumulation via atomicAdd (K-split partial sums)
- Separate fp32→bf16 conversion kernel

Integration:
- matmul() auto-dispatches to custom GEMV when M==1 && dtype==BF16
- Batched decode (M>1) continues to use cuBLAS
- Caching allocator provides FP32 temp buffer (pooled, no per-call malloc)

Ablation results (dash5, RTX 5090, Qwen3-8B BF16):

| Config | tok/s | vs HF (36) | vs roofline (112) |
|--------|-------|-----------|-------------------|
| Phase 14 (cuBLAS M=1) | 13.2 | 37% | 12% |
| + Custom GEMV (M=1) | 46.6 | 130% | 42% |
| Concurrent batch=4 | 28.2 | 78% | — |

Single-request throughput now EXCEEDS HuggingFace transformers by 30%.
The custom GEMV achieves ~42% of the theoretical roofline (vs 12% before).

Note: concurrent batch=4 (28.2 tok/s) is slower than serial (46.6 tok/s)
because the per-seq attention/reshape overhead in batched decode outweighs
the cuBLAS M=4 benefit when the custom GEMV already handles M=1 efficiently.
Engine should prefer serial decode when custom GEMV is available.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 22:22:31 +08:00
876d3f5d6a phase 15: batched decode forward — 35 tok/s (97% of HF transformers)
Implement batched decode that processes multiple sequences' tokens in one
forward pass. The key insight: cuBLAS M=4 GEMM is dramatically faster
than 4× M=1 GEMV due to better TensorCore utilization and amortized
kernel launch overhead.

New method Qwen3::forward_decode_batch(&tokens, &positions, &mut caches):
- Batched embedding, norm, projections, FFN: [B, hidden] × [hidden, X]
  → one cuBLAS call per weight matrix instead of B calls
- Per-sequence attention: RoPE, KV cache, decode_attention remain per-seq
  (each has different position and KV length)
- Row extraction (row_view) and concatenation (concat_rows) for
  batched↔per-seq transitions

Engine Step 4b:
- batch_size >= 2: extracts caches via std::mem::replace, calls
  forward_decode_batch, restores caches, samples per-sequence
- batch_size == 1: falls back to per-seq forward_gpu_cache (no overhead)

Ablation results (dash5, RTX 5090, Qwen3-8B BF16):

| Scenario | Throughput | vs HF |
|----------|-----------|-------|
| Serial (batch=1) | 13.2 tok/s | 37% |
| Concurrent (batch=4) | 35.1 tok/s | 97% |
| HF transformers | 36.0 tok/s | 100% |

The 2.66x throughput improvement (13.2 → 35.1) for concurrent requests
comes from cuBLAS going from 1008 M=1 GEMVs to 252 M=4 GEMMs per step,
which cuBLAS handles ~4x more efficiently on TensorCores.

Milestone ④ target (50% of vLLM/HF throughput) achieved with 97%.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 20:07:43 +08:00
9783fcf410 phase 15: decode attention kernel + fused silu_mul + fused add_rmsnorm
Three performance optimizations targeting decode throughput:

1. Decode Attention Kernel (csrc/attention/flash_attention.cu):
   - Specialized kernel for Q_len=1 (decode step)
   - 256 threads parallelize across KV sequence dimension
   - Online softmax with block-level warp-shuffle reduction
   - Replaces FA2 kernel which wasted 63/64 threads for decode
   - flash_attention() auto-dispatches when q_len==1

2. Fused SiLU×Mul (csrc/activation/activations.cu):
   - Single kernel: out = silu(gate) * up
   - Saves 1 HBM read + 1 HBM write per FFN layer (N elements)
   - Eliminates intermediate tensor allocation

3. Fused Add+RMSNorm (csrc/normalization/rmsnorm.cu):
   - Single kernel: (normed, sum) = (rmsnorm(x+residual), x+residual)
   - Saves 1 full HBM round-trip per attention block
   - Eliminates separate add + rmsnorm kernel pair

Performance analysis:
- At current short sequences (max 79 tokens), these optimizations provide
  marginal benefit because the bottleneck is cuBLAS GEMV overhead:
  252 weight matrix reads × ~32MB each = 15.5 GB per decode step.
  Theoretical minimum at 1.79 TB/s = 8.7ms, actual ~78ms (9x gap).
- The fused kernels and decode attention will show larger gains at
  longer sequences where attention and element-wise ops dominate.
- Next optimization target: CUDA Graphs to eliminate kernel launch
  overhead, or custom GEMV kernels to replace cuBLAS for M=1.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 19:40:56 +08:00
d67dda404e phase 14: Flash Attention 2 for SM120 (RTX 5090)
Implement Flash Attention 2 forward kernel targeting SM120 (CC 12.0).
FA4 requires TMEM (only on data-center Blackwell SM100), so FA2 is the
correct target for consumer Blackwell GPUs like the RTX 5090.

CUDA kernel (csrc/attention/flash_attention.cu):
- Online softmax with tiled Q/K/V — O(1) extra memory, no S×S matrix
- Tile sizes: BR=BC=64, head_dim up to 128 (runtime parameter)
- BF16 input, FP32 accumulation, BF16 output
- Native GQA: kv_head = q_head / (num_q_heads / num_kv_heads)
- Causal mask with tile-level skip optimization
- Shared memory: 32 KB (Q_tile 16KB + KV_tile 16KB, fits in 48KB default)
- Grid: (q_tiles, batch × num_q_heads), Block: 128 threads

Integration:
- flash_attention() Rust wrapper in xserv-kernels with shape/dtype validation
- Qwen3 forward_gpu_cache uses flash_attention directly (no repeat_kv_gpu)
- Eliminates repeat_kv memory allocation + copy per layer per step
- Naive attention() preserved for testing/comparison

Validated on dash5 (RTX 5090, CUDA 12.9):
- Correctness: 9/10 top-1 match vs HF (identical to pre-FA baseline)
- Throughput: 12.9 tok/s (up from 10.3, +25% improvement)
- Now at 35% of HF transformers baseline (up from 30%)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 18:27:39 +08:00
ee68d3565d fix: comprehensive review + 14 bug fixes + Phase 12/14 overhaul
Strict code review identified 30+ issues across correctness, performance,
and architecture. This commit addresses 14 of them with verified fixes,
restructures Phase 12 for honest continuous batching, and updates Phase 14
to target FA2 (RTX 5090 SM120 lacks TMEM required by FA4).

Bug fixes:
- FIX-01: Global cuBLAS handle (thread-local singleton, was per-call)
- FIX-02: Remove 19 unnecessary cudaDeviceSynchronize calls from kernels
- FIX-03: Qwen3 ChatML template (was plain text concatenation)
- FIX-04: EOS token from tokenizer (was hardcoded 151645)
- FIX-05: Storage tracks actual GPU device ordinal (was always Cuda(0))
- FIX-06: unsqueeze stride preserves contiguous layout
- FIX-08: CudaDeviceProp replaced with heap buffer (was UB-prone padding)
- FIX-09: Tokenizer byte_fallback to <0xNN> tokens (was panic)

Feature additions:
- FIX-10: SSE streaming (/v1/chat/completions, OpenAI-compatible)
- FIX-11: Correct usage statistics (prompt/completion/total tokens)
- FIX-13: Temperature / top-k / top-p sampling with SamplingParams

Performance improvements:
- FIX-07: Caching allocator wired up (thread-local pool, pooled flag)
- FIX-12: KV cache staging buffers (zero-alloc get_kv_len via borrow_raw)
- FIX-14: GPU strided copy kernel (eliminates contiguous() CPU round-trip)

Architecture:
- Phase 12 engine restructured: prefill/decode separation, honest TODO
  for batched GPU forward (requires Flash Attention)
- Phase 14 updated: FA2 for SM120 (FA4 requires TMEM, absent on 5090)
- Qwen3-7B → Qwen3-8B typo fixed across all docs (36 layers, hidden 4096)

Validated on dash5 (8x RTX 5090):
- 52/52 API prompts pass (EN/CN/code), SSE streaming verified
- Logits match HF transformers 9/10 top-1, 4.0/5 avg top-5 overlap
- 8 concurrent requests: 5.99x scheduling speedup (batch_size=4)
- Throughput: 10.3 tok/s (serial), 30% of HF baseline

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 17:53:28 +08:00
d8493bd70f phase 12: implement real continuous batching scheduler
Rewrote engine.rs from scratch:
- Scheduler loop: admit → prefill → decode → finish → check new requests
- Multiple sequences run concurrently (max_batch_size configurable)
- Each sequence has independent GpuKVCache
- Non-blocking try_recv() for new requests during decode iterations
- Dynamic join: new requests enter batch immediately, don't wait for others

Verified with concurrent test (tools/test_concurrent.py):
- 3 concurrent requests: wall_time=3.8s, concurrency_ratio=2.82x ✓
- 5 concurrent requests: wall_time=6.1s, concurrency_ratio=4.04x ✓
- All outputs are coherent and correct

Design doc (docs/12-continuous-batching.md) fully rewritten with:
- Detailed scheduler loop pseudocode
- Data structures (Sequence, Scheduler)
- Acceptance criteria with specific test cases
- Clear separation from Phase 13 (HTTP layer)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 13:44:26 +08:00
da043554ba phase 12+13: HTTP API server with OpenAI-compatible endpoint (Milestone ③)
New crate: xserv-server
- Engine thread: loads Qwen3-8B, processes requests sequentially
- axum HTTP server: /health, /v1/models, /v1/chat/completions
- tokio::sync::mpsc channel between API and engine threads
- Non-streaming JSON response (streaming SSE to be added later)

API is OpenAI-compatible:
  POST /v1/chat/completions {"messages": [...], "max_tokens": N}
  → {"choices": [{"message": {"content": "..."}}]}

Verified: "Hi" → ", I'm" (3 tokens), model runs correctly via HTTP.

Key learnings:
- std::sync::mpsc::SyncSender is Send but NOT Sync → wrap in Mutex for Arc<AppState>
- MutexGuard must not live across await points (scope carefully)
- axum 0.8 Extension<Arc<T>> requires T: Send + Sync

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 12:55:19 +08:00
2be27d6d94 perf: GPU transpose/reshape/repeat_kv kernels (eliminate CPU round-trips)
New CUDA kernels (csrc/embedding/transpose.cu):
- reshape_heads_bf16: [S, H*D] → [1, H, S, D]
- merge_heads_bf16: [1, H, S, D] → [S, H*D]
- transpose_hsd_to_shd_bf16: [1, H, S, D] → [S, H, D] (for RoPE)
- transpose_shd_to_hsd_bf16: [S, H, D] → [1, H, S, D] (from RoPE)
- repeat_kv_bf16: [1, KV_H, S, D] → [1, KV_H*n_rep, S, D]

Rust wrappers (xserv-kernels/src/transpose.rs):
- reshape_heads_gpu, merge_heads_gpu, transpose_for/from_rope_gpu, repeat_kv_gpu

Qwen3 forward_gpu_cache now uses all GPU kernels — zero CPU data round-trips.

Result: 50/50 self-consistent, 3-5% faster (TBT 142→137ms)
Remaining bottleneck: ~900 device::synchronize() calls + 252 cuBLAS handle
creations per token (Phase 15 targets)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 12:01:07 +08:00
2d48f25e66 phase 11: GPU-resident KV cache
- GpuKVCache: pre-allocated GPU buffers, D2D copy append at offset
- Per-head strided layout [num_kv_heads, max_seq_len, head_dim]
- Fixed critical bug: seq_len must advance AFTER all layers write
  (not inside the loop per-layer)
- GpuBuffer::copy_from_device_at for offset-based D2D copy
- Tensor::from_storage constructor for wrapping raw GPU buffers
- Exported Storage and Dims from xserv-tensor

Correctness: GPU KV cache vs CPU KV cache = 50/50 bit-identical
Performance: ~neutral (KV cache was never the main bottleneck —
reshape/merge/transpose CPU round-trips dominate for Qwen3-8B)

TTFT: 122ms, TBT: 142ms, 7.0 tok/s (marginal change from 7.3)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 11:50:12 +08:00
be5c64ea8a phase 10: GPU add/mul kernels + BF16 precision analysis
Kernel additions:
- add_f32/bf16, mul_f32/bf16 CUDA kernels (element-wise, on GPU)
- Refactored activation.rs with dispatch_unary/dispatch_binary helpers
- Qwen3 and GPT-2 now use GPU add/mul instead of CPU round-trips

GPT-2 add_bias also moved to GPU (broadcast via tile + GPU add)

BF16 precision analysis (docs/benchmarks/phase10-qwen3.md):
- Root cause: separate attention kernels materialize BF16 intermediates
  (QK^T→BF16→scale→BF16→mask→BF16→softmax→BF16 vs HF's fused FP32 path)
- HF itself SDPA vs Eager also differs by ~0.125 logit
- xserv vs HF: ~1-2 logit systematic offset, but same top-1 in 84% cases
- Industry standard for BF16: top-5 overlap (we achieve 100%)
- Fix path: Flash Attention (Phase 14) to fuse attention in FP32

Performance: TTFT 138→119ms, TBT 144→137ms (GPU ops faster than CPU)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 11:35:26 +08:00
268e40d764 phase 10: add Qwen3-8B benchmark + performance fix
Benchmark infrastructure:
- bench-qwen3 binary: 50 prompts × 20 tokens with KV cache
- bench_compare_qwen3.py: comparison against HF transformers (BF16)

Performance fix:
- Precompute transposed weights at model load time (eliminated per-token
  weight transpose CPU round-trip: was 252 transposes × 32MB each = 8GB/token)
- Result: from "infinite" (>10 min/token) to 144ms/token

Results (50 prompts):
- Prefill top-1: 42/50 (84%), top-5: 50/50 (100%) vs HF transformers
- Greedy sequence: 0/50 exact match (BF16 precision drift over 36 layers)
- Performance: TTFT=138ms, TBT=144ms, 6.9 tok/s (HF: 21ms, 45.6 tok/s)
- All outputs are coherent English/Chinese

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 10:25:33 +08:00
246ae1c590 phase 10: Qwen3-8B support (Milestone ②)
Qwen3 model (qwen3.rs):
- RMSNorm + QK normalization (per-head q_norm/k_norm)
- GQA: 32 Q heads, 8 KV heads, repeat_kv for attention
- SwiGLU FFN: gate_proj → SiLU → * up_proj → down_proj
- RoPE with transpose for [1,H,S,D] ↔ [S,H,D] layout
- BF16 forward pass, [out,in] weight layout via linear_t
- No attention bias (attention_bias=false)

Tokenizer fixes:
- Fixed unicode_to_byte: shifted bytes now use correct inverse lookup table
- MergeEntry supports both string and array formats
- Both GPT-2 and Qwen3 tokenizers work correctly (English + Chinese)

KVCache refactored:
- Dtype-agnostic: stores raw bytes per-head, works for F32 and BF16
- append_kv_tensor/get_kv_tensors use Tensor directly

CLI updated:
- Auto-detects model type from config.json (gpt2 vs qwen3)
- Supports both GPT-2 (F32) and Qwen3 (BF16)

Verified: Qwen3-8B generates coherent English and Chinese on single RTX 5090.
61/61 tests pass, GPT-2 performance no regression.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-22 00:46:37 +08:00
64084d3489 phase 9: KV cache + autoregressive generation
- KVCache: per-layer, per-head storage with append + reconstruct
- forward_with_cache: prefill (full prompt) + decode (single token) modes
- Fixed data layout bug: per-head vectors avoid cross-head interleaving
- CLI updated to use KV cache by default
- bench-gpt2 supports --no-cache flag for comparison

Benchmark results (50 prompts × 20 tokens):
- KV cache vs no-cache: 50/50 bit-identical (cache is correct)
- 18x speedup: TTFT 400→24ms, TBT 407→22ms, throughput 2.5→44 tok/s
- vs HF transformers: 40/50 match (10 are FP divergence, avg logit gap 0.20)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 23:39:41 +08:00
cb12250ef0 phase 8: add benchmark framework + baseline results
- bench-gpt2 binary: runs 50 prompts, measures TTFT/TBT per prompt, outputs JSON
- bench_compare.py: compares xserv vs transformers token-by-token + timing
- Baseline results: 50/50 correctness, 400ms TTFT / 407ms TBT (100x slower than PyTorch)
- Bottlenecks documented: no KV cache, CPU round-trips, cuBLAS handle churn

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 23:29:41 +08:00
e1e75fc7f6 phase 6+7+8: model loading, BPE tokenizer, GPT-2 inference (Milestone ①)
Phase 6 — Model Loading (xserv-model):
- safetensors parser with single/sharded file support
- ModelConfig with dual naming (GPT-2 n_embd/n_head + modern HF naming)
- Weight loading flow: safetensors → mmap → CPU Tensor → GPU

Phase 7 — BPE Tokenizer (xserv-tokenizer):
- Full BPE encode/decode from tokenizer.json
- GPT-2 byte-to-unicode mapping (printable ASCII identity + shifted bytes)
- Pre-tokenization regex, special token handling
- Chat template support structure

Phase 8 — GPT-2 Complete Inference:
- GPT-2 model definition: wte, wpe, 12 transformer blocks, ln_f
- Forward pass: embedding → (LayerNorm → MHA → residual → LayerNorm → MLP → residual) × 12 → LN → logits
- QKV split with correct [batch, heads, seq, dim] layout (fixed reshape bug)
- Greedy sampling from last-position logits
- Interactive CLI: xserv-cli <model-dir> [--max-tokens N]

Verified: GPT-2 124M generates coherent English text on RTX 5090.
"The future of AI is uncertain. The future of AI is uncertain..."
"Once upon a time, the world was a place of great beauty..."

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 22:04:00 +08:00
6035ffdc0b phase 5: naive multi-head attention
- Batched GEMM via cublasGemmStridedBatchedEx
- Causal mask CUDA kernel (F32 + BF16)
- Element-wise scale CUDA kernel (F32 + BF16)
- attention() composing: batched_matmul + scale + causal_mask + softmax
- Fixed to_device/contiguous infinite recursion (GPU contiguous via CPU round-trip)
- 5 attention tests passing (max_err < 3e-7 F32)
- Total: 61 tests passing across all crates

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 21:17:23 +08:00
c8e8153702 phase 4: transformer core kernels
CUDA kernels (csrc/):
- common.cuh: shared warp_reduce_sum/max, block_reduce_sum/max
- normalization/rmsnorm.cu: RMSNorm (F32 + BF16)
- normalization/layernorm.cu: LayerNorm with Welford (F32 + BF16)
- activation/activations.cu: GELU tanh-approx + SiLU (F32 + BF16)
- reduce/softmax.cu: safe softmax, 3-pass (F32 + BF16)
- embedding/embedding.cu: gather lookup (F32 + BF16)
- embedding/rope.cu: RoPE in-place + precomputed cos/sin cache (F32 + BF16)

Rust wrappers (xserv-kernels/src/):
- rmsnorm.rs, layernorm.rs, activation.rs, softmax.rs, embedding.rs, rope.rs
- RopeCache struct with GPU-side precomputation

Tests: 12 new tests (ops_test.rs), all passing with good precision:
- F32: max_err 1e-6 ~ 1e-9
- BF16: max_err 2e-3 ~ 7e-3
Total: 29 kernel tests + 27 prior = 56 tests passing

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 21:07:24 +08:00
d77f921a12 phase 3: GEMM kernels (naive, tiled, cuBLAS)
- Naive GEMM kernel: one thread per output element (F32 + BF16)
- Tiled GEMM kernel: 32x32 shared memory tiles (F32 + BF16)
- cuBLAS wrapper: cublasGemmEx with row-major trick
- GemmBackend enum for runtime backend selection
- CublasContext RAII handle
- Made error::check public for cross-crate use
- 17 GEMM tests: small/medium/rect sizes, all backends, F32+BF16
- Cross-backend consistency verified (naive vs tiled vs cuBLAS)
- All 44 tests pass across all crates

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 19:48:05 +08:00
a83971fa25 phase 2: tensor abstraction layer
- DType enum (F32, F16, BF16) with TensorDType trait
- Shape utilities: contiguous_strides, broadcast_shape, broadcast_strides
- Storage with Arc reference counting (CPU Vec<u8> or GPU GpuBuffer)
- Device enum (Cpu, Cuda(id)) with to_device transfer
- Tensor type with strided layout: reshape, transpose, squeeze, unsqueeze
- contiguous() copies non-contiguous views to contiguous layout
- from_slice, zeros, ones constructors
- as_slice<T> for typed CPU read access, data_ptr for GPU kernel launch
- CPU↔GPU roundtrip verified
- All 27 tests pass (12 cuda + 4 shape + 11 tensor)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 19:45:22 +08:00
c8f7bc0c3c phase 0+1: fix Rust 2024 edition compat + memory query
- unsafe extern "C" blocks (Rust 2024 requirement)
- unsafe blocks inside unsafe fn bodies
- Use cudaMemGetInfo for accurate GPU memory reporting
- Remove cc "cuda" feature (doesn't exist, built-in)
- All 12 tests pass on RTX 5090 (CC 12.0, 170 SMs, 32GB)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 19:40:49 +08:00
9806b4db35 phase 0+1: project scaffold + xserv-cuda crate
- Cargo workspace with xserv-cuda crate
- CUDA FFI bindings (cudart: memory, stream, device, error)
- GpuBuffer RAII wrapper with H2D/D2H/D2D copy
- CudaStream wrapper with RAII Drop
- CachingAllocator with size-bucketed free lists
- PinnedBuffer for page-locked host memory
- Device info query via cudaDeviceGetAttribute
- Vector-add CUDA kernel smoke test
- Integration test suite (11 tests)
- build.rs: cc crate compiles .cu for SM 12.0
- sync-and-build.sh for remote build on dash5
- Roadmap doc (docs/00-roadmap.md) and Phase 0+1 design doc

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
2026-05-21 18:40:22 +08:00