Task 02: Row Softmax

1. Problem Statement

Implement a row-wise softmax with numerical stability and compare naive and fused viewpoints.

2. Expected Input/Output Shapes

Input: a 2D tensor [num_rows, num_cols]
Output: a 2D tensor with the same shape

3. Performance Intuition

Softmax is often bandwidth-bound because each element is read several times unless you fuse work carefully. The arithmetic is cheap relative to the data movement.

4. Memory Access Discussion

A naive implementation may read rows multiple times: once for the max, once for the sum of exponentials, and once for normalization. Think about which intermediate values can stay on chip.

5. What Triton Is Abstracting

Triton makes it easy to load a row block, apply masked operations, and reduce across the block with tensor-style code.

6. What CUDA Makes Explicit

CUDA forces you to decide where the row reduction lives: one block per row, multiple warps per row, or a tiled strategy. Shared-memory use and synchronization become explicit design choices.

7. Reflection Questions

Why is max subtraction required for stable softmax?
Why is softmax often bandwidth-bound rather than compute-bound?
Which intermediate quantities would you prefer not to write back to global memory?

8. Implementation Checklist

Validate the reference row softmax
Fill in Triton row loading, max reduction, sum reduction, and normalization
Fill in the CUDA reduction structure
Test large positive and negative values
Compare against torch.softmax

1.5 KiB Raw Permalink Blame History