# Task 02: Row Softmax ## 1. Problem Statement Implement a row-wise softmax with numerical stability and compare naive and fused viewpoints. ## 2. Expected Input/Output Shapes - Input: a 2D tensor `[num_rows, num_cols]` - Output: a 2D tensor with the same shape ## 3. Performance Intuition Softmax is often bandwidth-bound because each element is read several times unless you fuse work carefully. The arithmetic is cheap relative to the data movement. ## 4. Memory Access Discussion A naive implementation may read rows multiple times: once for the max, once for the sum of exponentials, and once for normalization. Think about which intermediate values can stay on chip. ## 5. What Triton Is Abstracting Triton makes it easy to load a row block, apply masked operations, and reduce across the block with tensor-style code. ## 6. What CUDA Makes Explicit CUDA forces you to decide where the row reduction lives: one block per row, multiple warps per row, or a tiled strategy. Shared-memory use and synchronization become explicit design choices. ## 7. Reflection Questions - Why is max subtraction required for stable softmax? - Why is softmax often bandwidth-bound rather than compute-bound? - Which intermediate quantities would you prefer not to write back to global memory? ## 8. Implementation Checklist - Validate the reference row softmax - Fill in Triton row loading, max reduction, sum reduction, and normalization - Fill in the CUDA reduction structure - Test large positive and negative values - Compare against `torch.softmax`