Initial project scaffold
This commit is contained in:
40
tasks/02_row_softmax/spec.md
Normal file
40
tasks/02_row_softmax/spec.md
Normal file
@@ -0,0 +1,40 @@
|
||||
# Task 02: Row Softmax
|
||||
|
||||
## 1. Problem Statement
|
||||
|
||||
Implement a row-wise softmax with numerical stability and compare naive and fused viewpoints.
|
||||
|
||||
## 2. Expected Input/Output Shapes
|
||||
|
||||
- Input: a 2D tensor `[num_rows, num_cols]`
|
||||
- Output: a 2D tensor with the same shape
|
||||
|
||||
## 3. Performance Intuition
|
||||
|
||||
Softmax is often bandwidth-bound because each element is read several times unless you fuse work carefully. The arithmetic is cheap relative to the data movement.
|
||||
|
||||
## 4. Memory Access Discussion
|
||||
|
||||
A naive implementation may read rows multiple times: once for the max, once for the sum of exponentials, and once for normalization. Think about which intermediate values can stay on chip.
|
||||
|
||||
## 5. What Triton Is Abstracting
|
||||
|
||||
Triton makes it easy to load a row block, apply masked operations, and reduce across the block with tensor-style code.
|
||||
|
||||
## 6. What CUDA Makes Explicit
|
||||
|
||||
CUDA forces you to decide where the row reduction lives: one block per row, multiple warps per row, or a tiled strategy. Shared-memory use and synchronization become explicit design choices.
|
||||
|
||||
## 7. Reflection Questions
|
||||
|
||||
- Why is max subtraction required for stable softmax?
|
||||
- Why is softmax often bandwidth-bound rather than compute-bound?
|
||||
- Which intermediate quantities would you prefer not to write back to global memory?
|
||||
|
||||
## 8. Implementation Checklist
|
||||
|
||||
- Validate the reference row softmax
|
||||
- Fill in Triton row loading, max reduction, sum reduction, and normalization
|
||||
- Fill in the CUDA reduction structure
|
||||
- Test large positive and negative values
|
||||
- Compare against `torch.softmax`
|
||||
Reference in New Issue
Block a user