# Task 02: Row Softmax

## 1. Problem Statement

Implement a row-wise softmax with numerical stability and compare naive and fused viewpoints.

## 2. Expected Input/Output Shapes

- Input: a 2D tensor `[num_rows, num_cols]`
- Output: a 2D tensor with the same shape

## 3. Performance Intuition

Softmax is often bandwidth-bound because each element is read several times unless you fuse work carefully. The arithmetic is cheap relative to the data movement.

## 4. Memory Access Discussion

A naive implementation may read rows multiple times: once for the max, once for the sum of exponentials, and once for normalization. Think about which intermediate values can stay on chip.

## 5. What Triton Is Abstracting

Triton makes it easy to load a row block, apply masked operations, and reduce across the block with tensor-style code.

## 6. What CUDA Makes Explicit

CUDA forces you to decide where the row reduction lives: one block per row, multiple warps per row, or a tiled strategy. Shared-memory use and synchronization become explicit design choices.

## 7. Reflection Questions

- Why is max subtraction required for stable softmax?
- Why is softmax often bandwidth-bound rather than compute-bound?
- Which intermediate quantities would you prefer not to write back to global memory?

## 8. Implementation Checklist

- Validate the reference row softmax
- Fill in Triton row loading, max reduction, sum reduction, and normalization
- Fill in the CUDA reduction structure
- Test large positive and negative values
- Compare against `torch.softmax`