1.5 KiB
Task 02: Row Softmax
1. Problem Statement
Implement a row-wise softmax with numerical stability and compare naive and fused viewpoints.
2. Expected Input/Output Shapes
- Input: a 2D tensor
[num_rows, num_cols] - Output: a 2D tensor with the same shape
3. Performance Intuition
Softmax is often bandwidth-bound because each element is read several times unless you fuse work carefully. The arithmetic is cheap relative to the data movement.
4. Memory Access Discussion
A naive implementation may read rows multiple times: once for the max, once for the sum of exponentials, and once for normalization. Think about which intermediate values can stay on chip.
5. What Triton Is Abstracting
Triton makes it easy to load a row block, apply masked operations, and reduce across the block with tensor-style code.
6. What CUDA Makes Explicit
CUDA forces you to decide where the row reduction lives: one block per row, multiple warps per row, or a tiled strategy. Shared-memory use and synchronization become explicit design choices.
7. Reflection Questions
- Why is max subtraction required for stable softmax?
- Why is softmax often bandwidth-bound rather than compute-bound?
- Which intermediate quantities would you prefer not to write back to global memory?
8. Implementation Checklist
- Validate the reference row softmax
- Fill in Triton row loading, max reduction, sum reduction, and normalization
- Fill in the CUDA reduction structure
- Test large positive and negative values
- Compare against
torch.softmax