Files
kernel-lab/tasks/02_row_softmax/spec.md
2026-04-10 13:15:06 +00:00

41 lines
1.5 KiB
Markdown

# Task 02: Row Softmax
## 1. Problem Statement
Implement a row-wise softmax with numerical stability and compare naive and fused viewpoints.
## 2. Expected Input/Output Shapes
- Input: a 2D tensor `[num_rows, num_cols]`
- Output: a 2D tensor with the same shape
## 3. Performance Intuition
Softmax is often bandwidth-bound because each element is read several times unless you fuse work carefully. The arithmetic is cheap relative to the data movement.
## 4. Memory Access Discussion
A naive implementation may read rows multiple times: once for the max, once for the sum of exponentials, and once for normalization. Think about which intermediate values can stay on chip.
## 5. What Triton Is Abstracting
Triton makes it easy to load a row block, apply masked operations, and reduce across the block with tensor-style code.
## 6. What CUDA Makes Explicit
CUDA forces you to decide where the row reduction lives: one block per row, multiple warps per row, or a tiled strategy. Shared-memory use and synchronization become explicit design choices.
## 7. Reflection Questions
- Why is max subtraction required for stable softmax?
- Why is softmax often bandwidth-bound rather than compute-bound?
- Which intermediate quantities would you prefer not to write back to global memory?
## 8. Implementation Checklist
- Validate the reference row softmax
- Fill in Triton row loading, max reduction, sum reduction, and normalization
- Fill in the CUDA reduction structure
- Test large positive and negative values
- Compare against `torch.softmax`