531cd3fe08
style: format Rust workspace
2026-06-18 18:11:58 +08:00
9f1fbbb98b
quantization: add FP8 E4M3 W8A16 for gpt-oss MoE expert weights
...
Store expert gate_up_proj and down_proj weights in FP8 E4M3 (1 byte/elem)
with per-expert FP32 scale factors. At inference, a fused CUDA kernel
dequantizes to BF16 before the existing cuBLAS batched GEMM.
Results on gpt-oss-20b (50-problem GSM8K subset):
- FP8 TP=1: 47/50 = 94.0% (single RTX 5090, ~25 GB VRAM)
- BF16 TP=2: 47/50 = 94.0% (requires 2× RTX 5090, ~39 GB total)
No measurable accuracy degradation. Model size: 41.8 GB → 22.7 GB (−46%).
New files:
- tools/quantize_fp8.py: offline BF16→FP8 conversion script
- csrc/quantization/dequant_fp8.cu: per-expert-scale dequant kernel
- crates/xserv-kernels/src/quantization.rs: Rust FFI wrapper
- tools/eval_gsm8k_batch.sh: GSM8K accuracy evaluation harness
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-06-07 19:33:07 +08:00
Gahow Wang
1ab6ca9c09
tensor: add narrow() view and relax is_contiguous for size-1 dims
...
narrow(dim, start, len) creates a zero-copy slice along any dimension.
is_contiguous() now ignores stride mismatches on dimensions of size 1,
since those dimensions are never stepped. This avoids unnecessary GPU
strided copies when slicing fused projection outputs at batch=1.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-05-30 12:49:57 +08:00
986a289616
fix: 12 bug fixes from comprehensive review — 51 tok/s verified on RTX 5090
...
P0 fixes (blocking usability):
- FIX-01: thread-local cuBLAS handle (was creating/destroying per matmul)
- FIX-16: EOS token no longer leaks into API responses
- FIX-17: max_seq_len configurable via --max-seq-len (default 2048, was hardcoded 256)
- FIX-18: max_tokens clamped to available seq space, prompt overflow returns 400
P1 fixes (bugs & performance):
- FIX-07: CachingAllocator wired into all hot paths (to_device, embedding, rope, concat)
- FIX-08: CudaDeviceProp buffer increased to 32KB for CUDA 12.9 safety
- FIX-09: tokenizer byte_fallback graceful degradation (was panic)
- FIX-19: causal mask uses -INFINITY instead of -1e9 (BF16 supports inf)
- FIX-20: LayerNorm rewritten to numerically stable two-pass algorithm
- FIX-21: min block size guard (32 threads) for LayerNorm/RMSNorm launches
P2 fixes (improvements):
- FIX-22: Option<GpuKVCache> + take() eliminates dummy KV cache allocations
- FIX-23: RoPE cache no longer artificially capped at 8192 positions
Verified on dash5 (RTX 5090): 51 tok/s batch=1, 74 tok/s 2-concurrent, 1.7-3.3x HF transformers.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-05-23 14:13:43 +08:00
d5532ef209
phase 15: Tensor::empty + CUDA Graph infra — 50.3 tok/s (140% of HF, 45% roofline)
...
Two optimizations:
1. Tensor::empty() — skip cudaMemset for output tensors
All kernel wrappers that fully overwrite their output now use
Tensor::empty() instead of Tensor::zeros(). Eliminates ~756
cudaMemset calls per decode step (21 per layer × 36 layers).
Improvement: 46.6 → 50.3 tok/s (+8%).
2. CUDA Graph infrastructure (for future use)
Added FFI bindings (cudaStreamBeginCapture, cudaGraphInstantiate,
cudaGraphLaunch) and RAII CudaGraph wrapper. Not yet used in the
forward pass due to variable kv_len, but provides foundation for
future graph-based decode optimization.
Ablation (dash5, RTX 5090, Qwen3-8B BF16, serial decode):
| Optimization | tok/s | vs HF | Roofline |
|-------------|-------|-------|----------|
| Phase 14 baseline | 12.9 | 36% | 12% |
| + Fused kernels | 13.2 | 37% | 12% |
| + Batched decode | 13.2 (serial) | 37% | 12% |
| + Custom GEMV | 46.6 | 130% | 42% |
| + Tensor::empty | 50.3 | 140% | 45% |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-05-22 23:57:34 +08:00
ee68d3565d
fix: comprehensive review + 14 bug fixes + Phase 12/14 overhaul
...
Strict code review identified 30+ issues across correctness, performance,
and architecture. This commit addresses 14 of them with verified fixes,
restructures Phase 12 for honest continuous batching, and updates Phase 14
to target FA2 (RTX 5090 SM120 lacks TMEM required by FA4).
Bug fixes:
- FIX-01: Global cuBLAS handle (thread-local singleton, was per-call)
- FIX-02: Remove 19 unnecessary cudaDeviceSynchronize calls from kernels
- FIX-03: Qwen3 ChatML template (was plain text concatenation)
- FIX-04: EOS token from tokenizer (was hardcoded 151645)
- FIX-05: Storage tracks actual GPU device ordinal (was always Cuda(0))
- FIX-06: unsqueeze stride preserves contiguous layout
- FIX-08: CudaDeviceProp replaced with heap buffer (was UB-prone padding)
- FIX-09: Tokenizer byte_fallback to <0xNN> tokens (was panic)
Feature additions:
- FIX-10: SSE streaming (/v1/chat/completions, OpenAI-compatible)
- FIX-11: Correct usage statistics (prompt/completion/total tokens)
- FIX-13: Temperature / top-k / top-p sampling with SamplingParams
Performance improvements:
- FIX-07: Caching allocator wired up (thread-local pool, pooled flag)
- FIX-12: KV cache staging buffers (zero-alloc get_kv_len via borrow_raw)
- FIX-14: GPU strided copy kernel (eliminates contiguous() CPU round-trip)
Architecture:
- Phase 12 engine restructured: prefill/decode separation, honest TODO
for batched GPU forward (requires Flash Attention)
- Phase 14 updated: FA2 for SM120 (FA4 requires TMEM, absent on 5090)
- Qwen3-7B → Qwen3-8B typo fixed across all docs (36 layers, hidden 4096)
Validated on dash5 (8x RTX 5090):
- 52/52 API prompts pass (EN/CN/code), SSE streaming verified
- Logits match HF transformers 9/10 top-1, 4.0/5 avg top-5 overlap
- 8 concurrent requests: 5.99x scheduling speedup (batch_size=4)
- Throughput: 10.3 tok/s (serial), 30% of HF baseline
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-05-22 17:53:28 +08:00
2d48f25e66
phase 11: GPU-resident KV cache
...
- GpuKVCache: pre-allocated GPU buffers, D2D copy append at offset
- Per-head strided layout [num_kv_heads, max_seq_len, head_dim]
- Fixed critical bug: seq_len must advance AFTER all layers write
(not inside the loop per-layer)
- GpuBuffer::copy_from_device_at for offset-based D2D copy
- Tensor::from_storage constructor for wrapping raw GPU buffers
- Exported Storage and Dims from xserv-tensor
Correctness: GPU KV cache vs CPU KV cache = 50/50 bit-identical
Performance: ~neutral (KV cache was never the main bottleneck —
reshape/merge/transpose CPU round-trips dominate for Qwen3-8B)
TTFT: 122ms, TBT: 142ms, 7.0 tok/s (marginal change from 7.3)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-05-22 11:50:12 +08:00
6035ffdc0b
phase 5: naive multi-head attention
...
- Batched GEMM via cublasGemmStridedBatchedEx
- Causal mask CUDA kernel (F32 + BF16)
- Element-wise scale CUDA kernel (F32 + BF16)
- attention() composing: batched_matmul + scale + causal_mask + softmax
- Fixed to_device/contiguous infinite recursion (GPU contiguous via CPU round-trip)
- 5 attention tests passing (max_err < 3e-7 F32)
- Total: 61 tests passing across all crates
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-05-21 21:17:23 +08:00
a83971fa25
phase 2: tensor abstraction layer
...
- DType enum (F32, F16, BF16) with TensorDType trait
- Shape utilities: contiguous_strides, broadcast_shape, broadcast_strides
- Storage with Arc reference counting (CPU Vec<u8> or GPU GpuBuffer)
- Device enum (Cpu, Cuda(id)) with to_device transfer
- Tensor type with strided layout: reshape, transpose, squeeze, unsqueeze
- contiguous() copies non-contiguous views to contiguous layout
- from_slice, zeros, ones constructors
- as_slice<T> for typed CPU read access, data_ptr for GPU kernel launch
- CPU↔GPU roundtrip verified
- All 27 tests pass (12 cuda + 4 shape + 11 tensor)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com >
2026-05-21 19:45:22 +08:00