review cleanups: pp+gpt-oss guard, sparse GEMV asserts, warnings

- --pp with gpt-oss now fails with a clear message instead of a
  cryptic missing-weight panic inside the Qwen3-only PP engine.
- Sparse GEMV wrappers assert K%16==0 (FP8) / K%32==0 (MXFP4) — the
  uint4-vectorized kernels would silently drop a tail otherwise.
- Document the topk_ids buffer holding i32 under an F32 dtype label
  (DType has no I32).
- Drop unused imports/locals and the cuBLASLt scale-mode constants
  orphaned by the strided-batched FP8 rework (e631a71).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This commit is contained in:
2026-06-12 17:02:59 +08:00
parent 1897b2e17a
commit 5343391dbd
8 changed files with 21 additions and 17 deletions

View File

@@ -93,11 +93,9 @@ __global__ void moe_replicate_bf16_kernel(
int total = local_experts * num_tokens * hidden;
if (idx >= total) return;
int expert = idx / (num_tokens * hidden);
int remainder = idx % (num_tokens * hidden);
// x_rep[expert, token, dim] = x[token, dim]
x_rep[idx] = x[remainder];
(void)expert; // suppress unused warning
}
// ============================================================