review cleanups: pp+gpt-oss guard, sparse GEMV asserts, warnings

- --pp with gpt-oss now fails with a clear message instead of a cryptic missing-weight panic inside the Qwen3-only PP engine. - Sparse GEMV wrappers assert K%16==0 (FP8) / K%32==0 (MXFP4) — the uint4-vectorized kernels would silently drop a tail otherwise. - Document the topk_ids buffer holding i32 under an F32 dtype label (DType has no I32). - Drop unused imports/locals and the cuBLASLt scale-mode constants orphaned by the strided-batched FP8 rework (e631a71). Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
2026-06-12 17:02:59 +08:00
parent 1897b2e17a
commit 5343391dbd
8 changed files with 21 additions and 17 deletions
--- a/csrc/moe/moe_kernels.cu
+++ b/csrc/moe/moe_kernels.cu
@@ -93,11 +93,9 @@ __global__ void moe_replicate_bf16_kernel(
    int total = local_experts * num_tokens * hidden;
    if (idx >= total) return;

-    int expert = idx / (num_tokens * hidden);
    int remainder = idx % (num_tokens * hidden);
    // x_rep[expert, token, dim] = x[token, dim]
    x_rep[idx] = x[remainder];
-    (void)expert; // suppress unused warning
 }

 // ============================================================