Initial project scaffold

This commit is contained in:
2026-04-10 13:22:19 +00:00
commit 7fa69b1354
94 changed files with 3964 additions and 0 deletions

20
docs/blackwell_notes.md Normal file
View File

@@ -0,0 +1,20 @@
# Blackwell Notes
This repository targets a Blackwell-style workflow, but keeps the build configuration explicit because local toolchain support may differ across systems.
## Build Guidance
- Prefer explicit architecture selection over hidden defaults.
- Use `KERNEL_LAB_CUDA_ARCH=120` for Python-side build helpers when your local environment supports it.
- Use `-DCMAKE_CUDA_ARCHITECTURES=120` with CMake for direct native builds.
- If your toolkit does not yet accept the exact architecture value you want, adjust the build flag rather than editing the kernels.
## What To Watch On A New GPU Generation
- compiler support for the target architecture
- PyTorch wheel compatibility
- Triton support level
- driver/toolkit mismatch
- profiler tool compatibility
Treat environment validation as part of the lab, not as a one-time setup nuisance.

View File

@@ -0,0 +1,40 @@
# CUDA Execution Model
## How To Read A CUDA Kernel
Use this short checklist every time:
1. Find the logical work unit.
Ask what one thread, warp, or block is responsible for.
2. Decode the index math.
Look for `blockIdx`, `threadIdx`, `blockDim`, and any derived offsets.
3. Inspect the memory accesses.
Separate global loads, shared memory loads, stores, and reductions.
4. Find synchronization points.
Every `__syncthreads()` should protect a clear shared-memory phase boundary.
5. Check boundary conditions.
Out-of-range reads and stores are a common first bug.
6. Compare against the reference implementation.
Make sure the math, masking, and shape conventions still match.
## Execution Hierarchy
- Grid: all blocks launched for one kernel
- Block: a cooperating team of threads
- Thread: one scalar execution context
CUDA makes several things explicit that Triton abstracts:
- manual thread/block decomposition
- pointer arithmetic
- shared-memory allocation and reuse
- synchronization
- launch configuration choices
## Reading Order For This Lab
- `vector_add.cu`: pure indexing
- `row_softmax.cu`: reduction structure
- `tiled_matmul.cu`: shared-memory tiling
- `online_softmax.cu`: stateful reduction recurrence
- `flash_attention_fwd.cu`: composition of multiple ideas

View File

@@ -0,0 +1,28 @@
# FlashAttention Notes
FlashAttention-style kernels are useful because the naive attention pipeline materializes large score tensors and spends too much bandwidth moving them.
## The Core Idea
Instead of:
1. computing the full score matrix
2. writing it out
3. running softmax
4. reading it back
5. multiplying by `V`
you process attention block by block and keep more intermediate state on chip.
## Why Online Softmax Matters
Blockwise processing changes the normalization problem. You cannot assume you have seen the full row. The running max / running sum recurrence lets you update normalization state incrementally without losing numerical stability.
## What This Lab Covers
- forward pass only
- small-shape correctness first
- optional causal masking
- side-by-side Triton and CUDA skeletons
This repo intentionally stops short of a polished production FlashAttention implementation. The point is to expose the algorithmic structure.

30
docs/gpu_basics.md Normal file
View File

@@ -0,0 +1,30 @@
# GPU Basics
This lab assumes you are learning GPU kernels as structured data-parallel programs.
## Core Ideas
- GPU throughput comes from massive parallelism, not a single fast thread.
- Launch geometry determines which logical elements each thread or program instance owns.
- Global memory is large and slow relative to on-chip storage.
- Kernel design is often about reducing memory traffic and increasing reuse.
## Terms To Keep Straight
- thread: the smallest execution entity in CUDA
- warp: a hardware scheduling group, usually 32 threads
- block: a cooperating group of threads with shared memory access
- grid: the full launch of all blocks
- program instance: Triton's block-level work abstraction
## Mental Model For This Repo
Each task asks the same questions in both Triton and CUDA:
- What data does one unit of work own?
- How is that ownership computed from launch indices?
- Which reads are coalesced or contiguous?
- Which intermediate values must be reduced?
- Which values should be reused on chip?
Keep a notebook. Write down the answers before you code.

87
docs/profiling_guide.md Normal file
View File

@@ -0,0 +1,87 @@
# Profiling Guide
## Profile One Kernel At A Time
Good profiling starts narrow:
- one implementation
- one shape
- one dtype
- one device
- one command you can rerun
If you profile a full training script too early, you will not know which kernel you are looking at.
## Why Warmup Matters
The first iterations may include:
- lazy module loading
- JIT compilation
- cache effects
- allocator setup
Warm up first, then measure.
## Why Synchronization Matters
GPU work is asynchronous with respect to Python. If you do not synchronize before stopping a timer, you usually measure launch overhead instead of kernel runtime.
Use `torch.cuda.synchronize()` around timed regions.
## How To Avoid Misleading Timings
- keep shapes fixed
- use multiple repetitions
- report median, not only minimum
- separate correctness from performance testing
- compare implementations under the same dtype and device conditions
- check that all inputs are already on the GPU
## First Metrics To Inspect
- kernel duration
- achieved memory throughput
- occupancy
- DRAM transactions or bandwidth
- shared-memory throughput when tiling is relevant
- eligible warps per cycle when investigating latency hiding
## Practical `ncu` Examples
```bash
ncu --set full --target-processes all \
python bench/bench_vector_add.py --device cuda --mode cuda
```
```bash
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed,\
dram__throughput.avg.pct_of_peak_sustained_elapsed \
python bench/bench_softmax.py --device cuda --mode triton
```
## Practical `nsys` Examples
```bash
nsys profile --trace=cuda,nvtx,osrt --sample=none \
-o profile-output/attention_triton \
python bench/bench_attention.py --device cuda --mode triton
```
```bash
nsys profile --trace=cuda,nvtx,osrt --sample=none \
-o profile-output/matmul_cuda \
python bench/bench_matmul.py --device cuda --mode cuda
```
## Checklist Before Trusting A Benchmark Result
- Was there a warmup phase?
- Was the device synchronized before and after timing?
- Did all implementations run the same math?
- Were outputs checked against a reference?
- Were shapes and dtypes identical?
- Was one implementation silently skipped or falling back to CPU?
- Did you report median time over several repetitions?
- Is the measured quantity bandwidth-bound or compute-bound?
- Did you accidentally include setup or compilation time?

75
docs/roadmap.md Normal file
View File

@@ -0,0 +1,75 @@
# Roadmap
## Week 1 Study Plan
Day 1:
- Run `tools/check_env.py`
- Read `docs/gpu_basics.md`
- Read `docs/cuda_execution_model.md`
- Inspect `reference/torch_vector_add.py`
- Implement or partially implement `tasks/01_vector_add/triton_skeleton.py`
Day 2:
- Read `docs/triton_vs_cuda.md`
- Inspect `kernels/cuda/src/vector_add.cu`
- Fill in vector add indexing TODOs in Triton and CUDA
- Run `pytest -q tasks/01_vector_add/test_task.py`
Day 3:
- Read `reference/torch_row_softmax.py`
- Read `tasks/02_row_softmax/spec.md`
- Implement numerically stable row softmax in Triton first
- Compare against the CUDA skeleton and map the reduction strategy
Day 4:
- Study `tasks/03_tiled_matmul/spec.md`
- Draw the tile decomposition on paper
- Implement one matmul tile path with correctness-only priorities
Day 5:
- Read `docs/flashattention_notes.md`
- Read `tasks/04_online_softmax/spec.md`
- Derive the running max / running sum recurrence informally
Day 6:
- Inspect `tasks/05_flash_attention_fwd/spec.md`
- Trace the PyTorch reference line by line
- Annotate where Q/K/V loads, score computation, normalization, and output accumulation happen
Day 7:
- Read `docs/profiling_guide.md`
- Run one benchmark and one profiler command
- Write down which numbers changed after warmup and synchronization
## Recommended TODO Order
1. Environment checks
2. Vector add Triton
3. Vector add CUDA
4. Row softmax Triton
5. Row softmax CUDA
6. Tiled matmul Triton
7. Tiled matmul CUDA
8. Online softmax Triton
9. Online softmax CUDA
10. Flash attention forward Triton
11. Flash attention forward CUDA
12. PyTorch custom op binding
13. Profiling passes and benchmark validation
## What To Focus On First
- Correctness on tiny shapes
- Clear index math
- Explicit shape assumptions
- Numerically stable reductions
- Repeatable measurement
Do not chase peak performance before you can explain the memory traffic and launch geometry of your kernel.

30
docs/triton_vs_cuda.md Normal file
View File

@@ -0,0 +1,30 @@
# Triton Vs CUDA
## Concept Mapping Table
| Triton concept | CUDA concept | What to notice |
| --- | --- | --- |
| `tl.program_id(axis=0)` | `blockIdx.x` and block ownership | Both assign a chunk of logical work to a block-scale unit |
| `tl.arange(0, BLOCK)` | `threadIdx.x` or manual lane-local offsets | Triton expresses vectors of indices directly |
| masked `tl.load` / `tl.store` | explicit `if (idx < n)` checks | Same boundary problem, different syntax |
| blocked tensor operations | thread/block decomposition plus loops | Triton lifts index sets into tensor expressions |
| pointer arithmetic in element units | byte-addressed pointer math and indexing | CUDA makes layout mechanics more visible |
| implicit vectorized math | manual scalar or vector intrinsics | Triton often reads like array algebra |
| autotuned launch parameters | manual block-size tuning | Both still depend on the memory hierarchy |
| block pointers and tile views | shared memory tiles and cooperative loads | The same reuse idea shows up with different APIs |
| reduction combinators | warp/block reductions | Same algorithmic structure, different implementation burden |
| masks and predicates | control flow and bounds checks | Divergence and predication still matter |
## How To Compare Side By Side
1. Start from the reference PyTorch function and identify the mathematical operator.
2. In the Triton version, ask what one program instance owns.
3. In the CUDA version, ask what one block and one thread own.
4. Match the memory reads and writes, not just the variable names.
5. Write down where reduction state lives in each version.
6. For tiled code, identify when data moves from global memory to on-chip storage.
7. Only then compare performance.
## Rule Of Thumb
Triton usually compresses the "how" so you can focus on the blocked tensor math. CUDA exposes the "how" directly, which is why it is valuable to study both.