xserv

Go to file

Gahow Wang 876d3f5d6a phase 15: batched decode forward — 35 tok/s (97% of HF transformers)

Implement batched decode that processes multiple sequences' tokens in one
forward pass. The key insight: cuBLAS M=4 GEMM is dramatically faster
than 4× M=1 GEMV due to better TensorCore utilization and amortized
kernel launch overhead.

New method Qwen3::forward_decode_batch(&tokens, &positions, &mut caches):
- Batched embedding, norm, projections, FFN: [B, hidden] × [hidden, X]
  → one cuBLAS call per weight matrix instead of B calls
- Per-sequence attention: RoPE, KV cache, decode_attention remain per-seq
  (each has different position and KV length)
- Row extraction (row_view) and concatenation (concat_rows) for
  batched↔per-seq transitions

Engine Step 4b:
- batch_size >= 2: extracts caches via std::mem::replace, calls
  forward_decode_batch, restores caches, samples per-sequence
- batch_size == 1: falls back to per-seq forward_gpu_cache (no overhead)

Ablation results (dash5, RTX 5090, Qwen3-8B BF16):

| Scenario | Throughput | vs HF |
|----------|-----------|-------|
| Serial (batch=1) | 13.2 tok/s | 37% |
| Concurrent (batch=4) | 35.1 tok/s | 97% |
| HF transformers | 36.0 tok/s | 100% |

The 2.66x throughput improvement (13.2 → 35.1) for concurrent requests
comes from cuBLAS going from 1008 M=1 GEMVs to 252 M=4 GEMMs per step,
which cuBLAS handles ~4x more efficiently on TensorCores.

Milestone ④ target (50% of vLLM/HF throughput) achieved with 97%.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

2026-05-22 20:07:43 +08:00

crates

phase 15: batched decode forward — 35 tok/s (97% of HF transformers)

2026-05-22 20:07:43 +08:00

csrc

phase 15: decode attention kernel + fused silu_mul + fused add_rmsnorm

2026-05-22 19:40:56 +08:00

docs

docs: Phase 14 design doc + benchmark, fix Phase 11/12 honesty

2026-05-22 18:51:29 +08:00

tools

fix: comprehensive review + 14 bug fixes + Phase 12/14 overhaul