Initial project scaffold

2026-04-10 13:22:19 +00:00
commit 7fa69b1354
94 changed files with 3964 additions and 0 deletions
--- a/tasks/02_row_softmax/spec.md
+++ b/tasks/02_row_softmax/spec.md
@@ -0,0 +1,40 @@
+# Task 02: Row Softmax
+
+## 1. Problem Statement
+
+Implement a row-wise softmax with numerical stability and compare naive and fused viewpoints.
+
+## 2. Expected Input/Output Shapes
+
+- Input: a 2D tensor `[num_rows, num_cols]`
+- Output: a 2D tensor with the same shape
+
+## 3. Performance Intuition
+
+Softmax is often bandwidth-bound because each element is read several times unless you fuse work carefully. The arithmetic is cheap relative to the data movement.
+
+## 4. Memory Access Discussion
+
+A naive implementation may read rows multiple times: once for the max, once for the sum of exponentials, and once for normalization. Think about which intermediate values can stay on chip.
+
+## 5. What Triton Is Abstracting
+
+Triton makes it easy to load a row block, apply masked operations, and reduce across the block with tensor-style code.
+
+## 6. What CUDA Makes Explicit
+
+CUDA forces you to decide where the row reduction lives: one block per row, multiple warps per row, or a tiled strategy. Shared-memory use and synchronization become explicit design choices.
+
+## 7. Reflection Questions
+
+- Why is max subtraction required for stable softmax?
+- Why is softmax often bandwidth-bound rather than compute-bound?
+- Which intermediate quantities would you prefer not to write back to global memory?
+
+## 8. Implementation Checklist
+
+- Validate the reference row softmax
+- Fill in Triton row loading, max reduction, sum reduction, and normalization
+- Fill in the CUDA reduction structure
+- Test large positive and negative values
+- Compare against `torch.softmax`