xserv

gahow/xserv

Fork 0

Commit Graph

Author	SHA1	Message	Date
Gahow Wang	ee68d3565d	fix: comprehensive review + 14 bug fixes + Phase 12/14 overhaul Strict code review identified 30+ issues across correctness, performance, and architecture. This commit addresses 14 of them with verified fixes, restructures Phase 12 for honest continuous batching, and updates Phase 14 to target FA2 (RTX 5090 SM120 lacks TMEM required by FA4). Bug fixes: - FIX-01: Global cuBLAS handle (thread-local singleton, was per-call) - FIX-02: Remove 19 unnecessary cudaDeviceSynchronize calls from kernels - FIX-03: Qwen3 ChatML template (was plain text concatenation) - FIX-04: EOS token from tokenizer (was hardcoded 151645) - FIX-05: Storage tracks actual GPU device ordinal (was always Cuda(0)) - FIX-06: unsqueeze stride preserves contiguous layout - FIX-08: CudaDeviceProp replaced with heap buffer (was UB-prone padding) - FIX-09: Tokenizer byte_fallback to <0xNN> tokens (was panic) Feature additions: - FIX-10: SSE streaming (/v1/chat/completions, OpenAI-compatible) - FIX-11: Correct usage statistics (prompt/completion/total tokens) - FIX-13: Temperature / top-k / top-p sampling with SamplingParams Performance improvements: - FIX-07: Caching allocator wired up (thread-local pool, pooled flag) - FIX-12: KV cache staging buffers (zero-alloc get_kv_len via borrow_raw) - FIX-14: GPU strided copy kernel (eliminates contiguous() CPU round-trip) Architecture: - Phase 12 engine restructured: prefill/decode separation, honest TODO for batched GPU forward (requires Flash Attention) - Phase 14 updated: FA2 for SM120 (FA4 requires TMEM, absent on 5090) - Qwen3-7B → Qwen3-8B typo fixed across all docs (36 layers, hidden 4096) Validated on dash5 (8x RTX 5090): - 52/52 API prompts pass (EN/CN/code), SSE streaming verified - Logits match HF transformers 9/10 top-1, 4.0/5 avg top-5 overlap - 8 concurrent requests: 5.99x scheduling speedup (batch_size=4) - Throughput: 10.3 tok/s (serial), 30% of HF baseline Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 17:53:28 +08:00
Gahow Wang	246ae1c590	phase 10: Qwen3-8B support (Milestone ②) Qwen3 model (qwen3.rs): - RMSNorm + QK normalization (per-head q_norm/k_norm) - GQA: 32 Q heads, 8 KV heads, repeat_kv for attention - SwiGLU FFN: gate_proj → SiLU → * up_proj → down_proj - RoPE with transpose for [1,H,S,D] ↔ [S,H,D] layout - BF16 forward pass, [out,in] weight layout via linear_t - No attention bias (attention_bias=false) Tokenizer fixes: - Fixed unicode_to_byte: shifted bytes now use correct inverse lookup table - MergeEntry supports both string and array formats - Both GPT-2 and Qwen3 tokenizers work correctly (English + Chinese) KVCache refactored: - Dtype-agnostic: stores raw bytes per-head, works for F32 and BF16 - append_kv_tensor/get_kv_tensors use Tensor directly CLI updated: - Auto-detects model type from config.json (gpt2 vs qwen3) - Supports both GPT-2 (F32) and Qwen3 (BF16) Verified: Qwen3-8B generates coherent English and Chinese on single RTX 5090. 61/61 tests pass, GPT-2 performance no regression. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-05-22 00:46:37 +08:00

Author

SHA1

Message

Date

Gahow Wang

ee68d3565d

fix: comprehensive review + 14 bug fixes + Phase 12/14 overhaul

Strict code review identified 30+ issues across correctness, performance,
and architecture. This commit addresses 14 of them with verified fixes,
restructures Phase 12 for honest continuous batching, and updates Phase 14
to target FA2 (RTX 5090 SM120 lacks TMEM required by FA4).

Bug fixes:
- FIX-01: Global cuBLAS handle (thread-local singleton, was per-call)
- FIX-02: Remove 19 unnecessary cudaDeviceSynchronize calls from kernels
- FIX-03: Qwen3 ChatML template (was plain text concatenation)
- FIX-04: EOS token from tokenizer (was hardcoded 151645)
- FIX-05: Storage tracks actual GPU device ordinal (was always Cuda(0))
- FIX-06: unsqueeze stride preserves contiguous layout
- FIX-08: CudaDeviceProp replaced with heap buffer (was UB-prone padding)
- FIX-09: Tokenizer byte_fallback to <0xNN> tokens (was panic)

Feature additions:
- FIX-10: SSE streaming (/v1/chat/completions, OpenAI-compatible)
- FIX-11: Correct usage statistics (prompt/completion/total tokens)
- FIX-13: Temperature / top-k / top-p sampling with SamplingParams

Performance improvements:
- FIX-07: Caching allocator wired up (thread-local pool, pooled flag)
- FIX-12: KV cache staging buffers (zero-alloc get_kv_len via borrow_raw)
- FIX-14: GPU strided copy kernel (eliminates contiguous() CPU round-trip)

Architecture:
- Phase 12 engine restructured: prefill/decode separation, honest TODO
  for batched GPU forward (requires Flash Attention)
- Phase 14 updated: FA2 for SM120 (FA4 requires TMEM, absent on 5090)
- Qwen3-7B → Qwen3-8B typo fixed across all docs (36 layers, hidden 4096)

Validated on dash5 (8x RTX 5090):
- 52/52 API prompts pass (EN/CN/code), SSE streaming verified
- Logits match HF transformers 9/10 top-1, 4.0/5 avg top-5 overlap
- 8 concurrent requests: 5.99x scheduling speedup (batch_size=4)
- Throughput: 10.3 tok/s (serial), 30% of HF baseline

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-22 17:53:28 +08:00

Gahow Wang

246ae1c590

phase 10: Qwen3-8B support (Milestone ②)

Qwen3 model (qwen3.rs):
- RMSNorm + QK normalization (per-head q_norm/k_norm)
- GQA: 32 Q heads, 8 KV heads, repeat_kv for attention
- SwiGLU FFN: gate_proj → SiLU → * up_proj → down_proj
- RoPE with transpose for [1,H,S,D] ↔ [S,H,D] layout
- BF16 forward pass, [out,in] weight layout via linear_t
- No attention bias (attention_bias=false)

Tokenizer fixes:
- Fixed unicode_to_byte: shifted bytes now use correct inverse lookup table
- MergeEntry supports both string and array formats
- Both GPT-2 and Qwen3 tokenizers work correctly (English + Chinese)

KVCache refactored:
- Dtype-agnostic: stores raw bytes per-head, works for F32 and BF16
- append_kv_tensor/get_kv_tensors use Tensor directly

CLI updated:
- Auto-detects model type from config.json (gpt2 vs qwen3)
- Supports both GPT-2 (F32) and Qwen3 (BF16)

Verified: Qwen3-8B generates coherent English and Chinese on single RTX 5090.
61/61 tests pass, GPT-2 performance no regression.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-22 00:46:37 +08:00

2 Commits