876d3f5d6a4d8edfe2027e5d96c08d987aa345b0
Implement batched decode that processes multiple sequences' tokens in one forward pass. The key insight: cuBLAS M=4 GEMM is dramatically faster than 4× M=1 GEMV due to better TensorCore utilization and amortized kernel launch overhead. New method Qwen3::forward_decode_batch(&tokens, &positions, &mut caches): - Batched embedding, norm, projections, FFN: [B, hidden] × [hidden, X] → one cuBLAS call per weight matrix instead of B calls - Per-sequence attention: RoPE, KV cache, decode_attention remain per-seq (each has different position and KV length) - Row extraction (row_view) and concatenation (concat_rows) for batched↔per-seq transitions Engine Step 4b: - batch_size >= 2: extracts caches via std::mem::replace, calls forward_decode_batch, restores caches, samples per-sequence - batch_size == 1: falls back to per-seq forward_gpu_cache (no overhead) Ablation results (dash5, RTX 5090, Qwen3-8B BF16): | Scenario | Throughput | vs HF | |----------|-----------|-------| | Serial (batch=1) | 13.2 tok/s | 37% | | Concurrent (batch=4) | 35.1 tok/s | 97% | | HF transformers | 36.0 tok/s | 100% | The 2.66x throughput improvement (13.2 → 35.1) for concurrent requests comes from cuBLAS going from 1008 M=1 GEMVs to 252 M=4 GEMMs per step, which cuBLAS handles ~4x more efficiently on TensorCores. Milestone ④ target (50% of vLLM/HF throughput) achieved with 97%. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Description
No description provided
Languages
Rust
67.5%
Python
15.1%
Cuda
13.5%
Shell
3.9%