Initial project scaffold

2026-04-10 13:22:19 +00:00
commit 7fa69b1354
94 changed files with 3964 additions and 0 deletions
--- a/tasks/01_vector_add/spec.md
+++ b/tasks/01_vector_add/spec.md
@@ -0,0 +1,40 @@
+# Task 01: Vector Add
+
+## 1. Problem Statement
+
+Implement `out[i] = x[i] + y[i]` in both Triton and CUDA, then compare both against the PyTorch reference.
+
+## 2. Expected Input/Output Shapes
+
+- Input: two tensors with identical 1D or flattened shapes
+- Output: one tensor with the same shape
+
+## 3. Performance Intuition
+
+Vector add is simple enough that launch overhead and memory bandwidth dominate quickly. It is a good place to learn indexing before the math becomes interesting.
+
+## 4. Memory Access Discussion
+
+This kernel should read `x[i]` and `y[i]` once and write `out[i]` once. The main thing to inspect is whether neighboring threads or lanes access neighboring elements.
+
+## 5. What Triton Is Abstracting
+
+Triton lets you express one block of contiguous offsets with `program_id` and `tl.arange`, then apply a mask on the tail.
+
+## 6. What CUDA Makes Explicit
+
+CUDA makes you compute `global_idx` from block and thread indices yourself and write the boundary check explicitly.
+
+## 7. Reflection Questions
+
+- What is the exact correspondence between `program_id` and `blockIdx.x` here?
+- Why is a mask or bounds check required on the final block?
+- How would the ownership change if one thread handled multiple elements?
+
+## 8. Implementation Checklist
+
+- Confirm the reference implementation
+- Fill in the Triton masked loads, add, and store
+- Fill in the CUDA thread ownership and store
+- Test small and non-multiple-of-block-size shapes
+- Benchmark bandwidth on larger vectors