Initial project scaffold

2026-04-10 13:22:19 +00:00
commit 7fa69b1354
94 changed files with 3964 additions and 0 deletions
--- a/docs/blackwell_notes.md
+++ b/docs/blackwell_notes.md
@@ -0,0 +1,20 @@
+# Blackwell Notes
+
+This repository targets a Blackwell-style workflow, but keeps the build configuration explicit because local toolchain support may differ across systems.
+
+## Build Guidance
+
+- Prefer explicit architecture selection over hidden defaults.
+- Use `KERNEL_LAB_CUDA_ARCH=120` for Python-side build helpers when your local environment supports it.
+- Use `-DCMAKE_CUDA_ARCHITECTURES=120` with CMake for direct native builds.
+- If your toolkit does not yet accept the exact architecture value you want, adjust the build flag rather than editing the kernels.
+
+## What To Watch On A New GPU Generation
+
+- compiler support for the target architecture
+- PyTorch wheel compatibility
+- Triton support level
+- driver/toolkit mismatch
+- profiler tool compatibility
+
+Treat environment validation as part of the lab, not as a one-time setup nuisance.
--- a/docs/cuda_execution_model.md
+++ b/docs/cuda_execution_model.md
@@ -0,0 +1,40 @@
+# CUDA Execution Model
+
+## How To Read A CUDA Kernel
+
+Use this short checklist every time:
+
+1. Find the logical work unit.
+   Ask what one thread, warp, or block is responsible for.
+2. Decode the index math.
+   Look for `blockIdx`, `threadIdx`, `blockDim`, and any derived offsets.
+3. Inspect the memory accesses.
+   Separate global loads, shared memory loads, stores, and reductions.
+4. Find synchronization points.
+   Every `__syncthreads()` should protect a clear shared-memory phase boundary.
+5. Check boundary conditions.
+   Out-of-range reads and stores are a common first bug.
+6. Compare against the reference implementation.
+   Make sure the math, masking, and shape conventions still match.
+
+## Execution Hierarchy
+
+- Grid: all blocks launched for one kernel
+- Block: a cooperating team of threads
+- Thread: one scalar execution context
+
+CUDA makes several things explicit that Triton abstracts:
+
+- manual thread/block decomposition
+- pointer arithmetic
+- shared-memory allocation and reuse
+- synchronization
+- launch configuration choices
+
+## Reading Order For This Lab
+
+- `vector_add.cu`: pure indexing
+- `row_softmax.cu`: reduction structure
+- `tiled_matmul.cu`: shared-memory tiling
+- `online_softmax.cu`: stateful reduction recurrence
+- `flash_attention_fwd.cu`: composition of multiple ideas
--- a/docs/flashattention_notes.md
+++ b/docs/flashattention_notes.md
@@ -0,0 +1,28 @@
+# FlashAttention Notes
+
+FlashAttention-style kernels are useful because the naive attention pipeline materializes large score tensors and spends too much bandwidth moving them.
+
+## The Core Idea
+
+Instead of:
+
+1. computing the full score matrix
+2. writing it out
+3. running softmax
+4. reading it back
+5. multiplying by `V`
+
+you process attention block by block and keep more intermediate state on chip.
+
+## Why Online Softmax Matters
+
+Blockwise processing changes the normalization problem. You cannot assume you have seen the full row. The running max / running sum recurrence lets you update normalization state incrementally without losing numerical stability.
+
+## What This Lab Covers
+
+- forward pass only
+- small-shape correctness first
+- optional causal masking
+- side-by-side Triton and CUDA skeletons
+
+This repo intentionally stops short of a polished production FlashAttention implementation. The point is to expose the algorithmic structure.
--- a/docs/gpu_basics.md
+++ b/docs/gpu_basics.md
@@ -0,0 +1,30 @@
+# GPU Basics
+
+This lab assumes you are learning GPU kernels as structured data-parallel programs.
+
+## Core Ideas
+
+- GPU throughput comes from massive parallelism, not a single fast thread.
+- Launch geometry determines which logical elements each thread or program instance owns.
+- Global memory is large and slow relative to on-chip storage.
+- Kernel design is often about reducing memory traffic and increasing reuse.
+
+## Terms To Keep Straight
+
+- thread: the smallest execution entity in CUDA
+- warp: a hardware scheduling group, usually 32 threads
+- block: a cooperating group of threads with shared memory access
+- grid: the full launch of all blocks
+- program instance: Triton's block-level work abstraction
+
+## Mental Model For This Repo
+
+Each task asks the same questions in both Triton and CUDA:
+
+- What data does one unit of work own?
+- How is that ownership computed from launch indices?
+- Which reads are coalesced or contiguous?
+- Which intermediate values must be reduced?
+- Which values should be reused on chip?
+
+Keep a notebook. Write down the answers before you code.
--- a/docs/profiling_guide.md
+++ b/docs/profiling_guide.md
@@ -0,0 +1,87 @@
+# Profiling Guide
+
+## Profile One Kernel At A Time
+
+Good profiling starts narrow:
+
+- one implementation
+- one shape
+- one dtype
+- one device
+- one command you can rerun
+
+If you profile a full training script too early, you will not know which kernel you are looking at.
+
+## Why Warmup Matters
+
+The first iterations may include:
+
+- lazy module loading
+- JIT compilation
+- cache effects
+- allocator setup
+
+Warm up first, then measure.
+
+## Why Synchronization Matters
+
+GPU work is asynchronous with respect to Python. If you do not synchronize before stopping a timer, you usually measure launch overhead instead of kernel runtime.
+
+Use `torch.cuda.synchronize()` around timed regions.
+
+## How To Avoid Misleading Timings
+
+- keep shapes fixed
+- use multiple repetitions
+- report median, not only minimum
+- separate correctness from performance testing
+- compare implementations under the same dtype and device conditions
+- check that all inputs are already on the GPU
+
+## First Metrics To Inspect
+
+- kernel duration
+- achieved memory throughput
+- occupancy
+- DRAM transactions or bandwidth
+- shared-memory throughput when tiling is relevant
+- eligible warps per cycle when investigating latency hiding
+
+## Practical `ncu` Examples
+
+```bash
+ncu --set full --target-processes all \
+  python bench/bench_vector_add.py --device cuda --mode cuda
+```
+
+```bash
+ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed,\
+dram__throughput.avg.pct_of_peak_sustained_elapsed \
+  python bench/bench_softmax.py --device cuda --mode triton
+```
+
+## Practical `nsys` Examples
+
+```bash
+nsys profile --trace=cuda,nvtx,osrt --sample=none \
+  -o profile-output/attention_triton \
+  python bench/bench_attention.py --device cuda --mode triton
+```
+
+```bash
+nsys profile --trace=cuda,nvtx,osrt --sample=none \
+  -o profile-output/matmul_cuda \
+  python bench/bench_matmul.py --device cuda --mode cuda
+```
+
+## Checklist Before Trusting A Benchmark Result
+
+- Was there a warmup phase?
+- Was the device synchronized before and after timing?
+- Did all implementations run the same math?
+- Were outputs checked against a reference?
+- Were shapes and dtypes identical?
+- Was one implementation silently skipped or falling back to CPU?
+- Did you report median time over several repetitions?
+- Is the measured quantity bandwidth-bound or compute-bound?
+- Did you accidentally include setup or compilation time?
--- a/docs/roadmap.md
+++ b/docs/roadmap.md
@@ -0,0 +1,75 @@
+# Roadmap
+
+## Week 1 Study Plan
+
+Day 1:
+
+- Run `tools/check_env.py`
+- Read `docs/gpu_basics.md`
+- Read `docs/cuda_execution_model.md`
+- Inspect `reference/torch_vector_add.py`
+- Implement or partially implement `tasks/01_vector_add/triton_skeleton.py`
+
+Day 2:
+
+- Read `docs/triton_vs_cuda.md`
+- Inspect `kernels/cuda/src/vector_add.cu`
+- Fill in vector add indexing TODOs in Triton and CUDA
+- Run `pytest -q tasks/01_vector_add/test_task.py`
+
+Day 3:
+
+- Read `reference/torch_row_softmax.py`
+- Read `tasks/02_row_softmax/spec.md`
+- Implement numerically stable row softmax in Triton first
+- Compare against the CUDA skeleton and map the reduction strategy
+
+Day 4:
+
+- Study `tasks/03_tiled_matmul/spec.md`
+- Draw the tile decomposition on paper
+- Implement one matmul tile path with correctness-only priorities
+
+Day 5:
+
+- Read `docs/flashattention_notes.md`
+- Read `tasks/04_online_softmax/spec.md`
+- Derive the running max / running sum recurrence informally
+
+Day 6:
+
+- Inspect `tasks/05_flash_attention_fwd/spec.md`
+- Trace the PyTorch reference line by line
+- Annotate where Q/K/V loads, score computation, normalization, and output accumulation happen
+
+Day 7:
+
+- Read `docs/profiling_guide.md`
+- Run one benchmark and one profiler command
+- Write down which numbers changed after warmup and synchronization
+
+## Recommended TODO Order
+
+1. Environment checks
+2. Vector add Triton
+3. Vector add CUDA
+4. Row softmax Triton
+5. Row softmax CUDA
+6. Tiled matmul Triton
+7. Tiled matmul CUDA
+8. Online softmax Triton
+9. Online softmax CUDA
+10. Flash attention forward Triton
+11. Flash attention forward CUDA
+12. PyTorch custom op binding
+13. Profiling passes and benchmark validation
+
+## What To Focus On First
+
+- Correctness on tiny shapes
+- Clear index math
+- Explicit shape assumptions
+- Numerically stable reductions
+- Repeatable measurement
+
+Do not chase peak performance before you can explain the memory traffic and launch geometry of your kernel.
--- a/docs/triton_vs_cuda.md
+++ b/docs/triton_vs_cuda.md
@@ -0,0 +1,30 @@
+# Triton Vs CUDA
+
+## Concept Mapping Table
+
+| Triton concept | CUDA concept | What to notice |
+| --- | --- | --- |
+| `tl.program_id(axis=0)` | `blockIdx.x` and block ownership | Both assign a chunk of logical work to a block-scale unit |
+| `tl.arange(0, BLOCK)` | `threadIdx.x` or manual lane-local offsets | Triton expresses vectors of indices directly |
+| masked `tl.load` / `tl.store` | explicit `if (idx < n)` checks | Same boundary problem, different syntax |
+| blocked tensor operations | thread/block decomposition plus loops | Triton lifts index sets into tensor expressions |
+| pointer arithmetic in element units | byte-addressed pointer math and indexing | CUDA makes layout mechanics more visible |
+| implicit vectorized math | manual scalar or vector intrinsics | Triton often reads like array algebra |
+| autotuned launch parameters | manual block-size tuning | Both still depend on the memory hierarchy |
+| block pointers and tile views | shared memory tiles and cooperative loads | The same reuse idea shows up with different APIs |
+| reduction combinators | warp/block reductions | Same algorithmic structure, different implementation burden |
+| masks and predicates | control flow and bounds checks | Divergence and predication still matter |
+
+## How To Compare Side By Side
+
+1. Start from the reference PyTorch function and identify the mathematical operator.
+2. In the Triton version, ask what one program instance owns.
+3. In the CUDA version, ask what one block and one thread own.
+4. Match the memory reads and writes, not just the variable names.
+5. Write down where reduction state lives in each version.
+6. For tiled code, identify when data moves from global memory to on-chip storage.
+7. Only then compare performance.
+
+## Rule Of Thumb
+
+Triton usually compresses the "how" so you can focus on the blocked tensor math. CUDA exposes the "how" directly, which is why it is valuable to study both.