Initial project scaffold
This commit is contained in:
20
docs/blackwell_notes.md
Normal file
20
docs/blackwell_notes.md
Normal file
@@ -0,0 +1,20 @@
|
||||
# Blackwell Notes
|
||||
|
||||
This repository targets a Blackwell-style workflow, but keeps the build configuration explicit because local toolchain support may differ across systems.
|
||||
|
||||
## Build Guidance
|
||||
|
||||
- Prefer explicit architecture selection over hidden defaults.
|
||||
- Use `KERNEL_LAB_CUDA_ARCH=120` for Python-side build helpers when your local environment supports it.
|
||||
- Use `-DCMAKE_CUDA_ARCHITECTURES=120` with CMake for direct native builds.
|
||||
- If your toolkit does not yet accept the exact architecture value you want, adjust the build flag rather than editing the kernels.
|
||||
|
||||
## What To Watch On A New GPU Generation
|
||||
|
||||
- compiler support for the target architecture
|
||||
- PyTorch wheel compatibility
|
||||
- Triton support level
|
||||
- driver/toolkit mismatch
|
||||
- profiler tool compatibility
|
||||
|
||||
Treat environment validation as part of the lab, not as a one-time setup nuisance.
|
||||
40
docs/cuda_execution_model.md
Normal file
40
docs/cuda_execution_model.md
Normal file
@@ -0,0 +1,40 @@
|
||||
# CUDA Execution Model
|
||||
|
||||
## How To Read A CUDA Kernel
|
||||
|
||||
Use this short checklist every time:
|
||||
|
||||
1. Find the logical work unit.
|
||||
Ask what one thread, warp, or block is responsible for.
|
||||
2. Decode the index math.
|
||||
Look for `blockIdx`, `threadIdx`, `blockDim`, and any derived offsets.
|
||||
3. Inspect the memory accesses.
|
||||
Separate global loads, shared memory loads, stores, and reductions.
|
||||
4. Find synchronization points.
|
||||
Every `__syncthreads()` should protect a clear shared-memory phase boundary.
|
||||
5. Check boundary conditions.
|
||||
Out-of-range reads and stores are a common first bug.
|
||||
6. Compare against the reference implementation.
|
||||
Make sure the math, masking, and shape conventions still match.
|
||||
|
||||
## Execution Hierarchy
|
||||
|
||||
- Grid: all blocks launched for one kernel
|
||||
- Block: a cooperating team of threads
|
||||
- Thread: one scalar execution context
|
||||
|
||||
CUDA makes several things explicit that Triton abstracts:
|
||||
|
||||
- manual thread/block decomposition
|
||||
- pointer arithmetic
|
||||
- shared-memory allocation and reuse
|
||||
- synchronization
|
||||
- launch configuration choices
|
||||
|
||||
## Reading Order For This Lab
|
||||
|
||||
- `vector_add.cu`: pure indexing
|
||||
- `row_softmax.cu`: reduction structure
|
||||
- `tiled_matmul.cu`: shared-memory tiling
|
||||
- `online_softmax.cu`: stateful reduction recurrence
|
||||
- `flash_attention_fwd.cu`: composition of multiple ideas
|
||||
28
docs/flashattention_notes.md
Normal file
28
docs/flashattention_notes.md
Normal file
@@ -0,0 +1,28 @@
|
||||
# FlashAttention Notes
|
||||
|
||||
FlashAttention-style kernels are useful because the naive attention pipeline materializes large score tensors and spends too much bandwidth moving them.
|
||||
|
||||
## The Core Idea
|
||||
|
||||
Instead of:
|
||||
|
||||
1. computing the full score matrix
|
||||
2. writing it out
|
||||
3. running softmax
|
||||
4. reading it back
|
||||
5. multiplying by `V`
|
||||
|
||||
you process attention block by block and keep more intermediate state on chip.
|
||||
|
||||
## Why Online Softmax Matters
|
||||
|
||||
Blockwise processing changes the normalization problem. You cannot assume you have seen the full row. The running max / running sum recurrence lets you update normalization state incrementally without losing numerical stability.
|
||||
|
||||
## What This Lab Covers
|
||||
|
||||
- forward pass only
|
||||
- small-shape correctness first
|
||||
- optional causal masking
|
||||
- side-by-side Triton and CUDA skeletons
|
||||
|
||||
This repo intentionally stops short of a polished production FlashAttention implementation. The point is to expose the algorithmic structure.
|
||||
30
docs/gpu_basics.md
Normal file
30
docs/gpu_basics.md
Normal file
@@ -0,0 +1,30 @@
|
||||
# GPU Basics
|
||||
|
||||
This lab assumes you are learning GPU kernels as structured data-parallel programs.
|
||||
|
||||
## Core Ideas
|
||||
|
||||
- GPU throughput comes from massive parallelism, not a single fast thread.
|
||||
- Launch geometry determines which logical elements each thread or program instance owns.
|
||||
- Global memory is large and slow relative to on-chip storage.
|
||||
- Kernel design is often about reducing memory traffic and increasing reuse.
|
||||
|
||||
## Terms To Keep Straight
|
||||
|
||||
- thread: the smallest execution entity in CUDA
|
||||
- warp: a hardware scheduling group, usually 32 threads
|
||||
- block: a cooperating group of threads with shared memory access
|
||||
- grid: the full launch of all blocks
|
||||
- program instance: Triton's block-level work abstraction
|
||||
|
||||
## Mental Model For This Repo
|
||||
|
||||
Each task asks the same questions in both Triton and CUDA:
|
||||
|
||||
- What data does one unit of work own?
|
||||
- How is that ownership computed from launch indices?
|
||||
- Which reads are coalesced or contiguous?
|
||||
- Which intermediate values must be reduced?
|
||||
- Which values should be reused on chip?
|
||||
|
||||
Keep a notebook. Write down the answers before you code.
|
||||
87
docs/profiling_guide.md
Normal file
87
docs/profiling_guide.md
Normal file
@@ -0,0 +1,87 @@
|
||||
# Profiling Guide
|
||||
|
||||
## Profile One Kernel At A Time
|
||||
|
||||
Good profiling starts narrow:
|
||||
|
||||
- one implementation
|
||||
- one shape
|
||||
- one dtype
|
||||
- one device
|
||||
- one command you can rerun
|
||||
|
||||
If you profile a full training script too early, you will not know which kernel you are looking at.
|
||||
|
||||
## Why Warmup Matters
|
||||
|
||||
The first iterations may include:
|
||||
|
||||
- lazy module loading
|
||||
- JIT compilation
|
||||
- cache effects
|
||||
- allocator setup
|
||||
|
||||
Warm up first, then measure.
|
||||
|
||||
## Why Synchronization Matters
|
||||
|
||||
GPU work is asynchronous with respect to Python. If you do not synchronize before stopping a timer, you usually measure launch overhead instead of kernel runtime.
|
||||
|
||||
Use `torch.cuda.synchronize()` around timed regions.
|
||||
|
||||
## How To Avoid Misleading Timings
|
||||
|
||||
- keep shapes fixed
|
||||
- use multiple repetitions
|
||||
- report median, not only minimum
|
||||
- separate correctness from performance testing
|
||||
- compare implementations under the same dtype and device conditions
|
||||
- check that all inputs are already on the GPU
|
||||
|
||||
## First Metrics To Inspect
|
||||
|
||||
- kernel duration
|
||||
- achieved memory throughput
|
||||
- occupancy
|
||||
- DRAM transactions or bandwidth
|
||||
- shared-memory throughput when tiling is relevant
|
||||
- eligible warps per cycle when investigating latency hiding
|
||||
|
||||
## Practical `ncu` Examples
|
||||
|
||||
```bash
|
||||
ncu --set full --target-processes all \
|
||||
python bench/bench_vector_add.py --device cuda --mode cuda
|
||||
```
|
||||
|
||||
```bash
|
||||
ncu --metrics sm__throughput.avg.pct_of_peak_sustained_elapsed,\
|
||||
dram__throughput.avg.pct_of_peak_sustained_elapsed \
|
||||
python bench/bench_softmax.py --device cuda --mode triton
|
||||
```
|
||||
|
||||
## Practical `nsys` Examples
|
||||
|
||||
```bash
|
||||
nsys profile --trace=cuda,nvtx,osrt --sample=none \
|
||||
-o profile-output/attention_triton \
|
||||
python bench/bench_attention.py --device cuda --mode triton
|
||||
```
|
||||
|
||||
```bash
|
||||
nsys profile --trace=cuda,nvtx,osrt --sample=none \
|
||||
-o profile-output/matmul_cuda \
|
||||
python bench/bench_matmul.py --device cuda --mode cuda
|
||||
```
|
||||
|
||||
## Checklist Before Trusting A Benchmark Result
|
||||
|
||||
- Was there a warmup phase?
|
||||
- Was the device synchronized before and after timing?
|
||||
- Did all implementations run the same math?
|
||||
- Were outputs checked against a reference?
|
||||
- Were shapes and dtypes identical?
|
||||
- Was one implementation silently skipped or falling back to CPU?
|
||||
- Did you report median time over several repetitions?
|
||||
- Is the measured quantity bandwidth-bound or compute-bound?
|
||||
- Did you accidentally include setup or compilation time?
|
||||
75
docs/roadmap.md
Normal file
75
docs/roadmap.md
Normal file
@@ -0,0 +1,75 @@
|
||||
# Roadmap
|
||||
|
||||
## Week 1 Study Plan
|
||||
|
||||
Day 1:
|
||||
|
||||
- Run `tools/check_env.py`
|
||||
- Read `docs/gpu_basics.md`
|
||||
- Read `docs/cuda_execution_model.md`
|
||||
- Inspect `reference/torch_vector_add.py`
|
||||
- Implement or partially implement `tasks/01_vector_add/triton_skeleton.py`
|
||||
|
||||
Day 2:
|
||||
|
||||
- Read `docs/triton_vs_cuda.md`
|
||||
- Inspect `kernels/cuda/src/vector_add.cu`
|
||||
- Fill in vector add indexing TODOs in Triton and CUDA
|
||||
- Run `pytest -q tasks/01_vector_add/test_task.py`
|
||||
|
||||
Day 3:
|
||||
|
||||
- Read `reference/torch_row_softmax.py`
|
||||
- Read `tasks/02_row_softmax/spec.md`
|
||||
- Implement numerically stable row softmax in Triton first
|
||||
- Compare against the CUDA skeleton and map the reduction strategy
|
||||
|
||||
Day 4:
|
||||
|
||||
- Study `tasks/03_tiled_matmul/spec.md`
|
||||
- Draw the tile decomposition on paper
|
||||
- Implement one matmul tile path with correctness-only priorities
|
||||
|
||||
Day 5:
|
||||
|
||||
- Read `docs/flashattention_notes.md`
|
||||
- Read `tasks/04_online_softmax/spec.md`
|
||||
- Derive the running max / running sum recurrence informally
|
||||
|
||||
Day 6:
|
||||
|
||||
- Inspect `tasks/05_flash_attention_fwd/spec.md`
|
||||
- Trace the PyTorch reference line by line
|
||||
- Annotate where Q/K/V loads, score computation, normalization, and output accumulation happen
|
||||
|
||||
Day 7:
|
||||
|
||||
- Read `docs/profiling_guide.md`
|
||||
- Run one benchmark and one profiler command
|
||||
- Write down which numbers changed after warmup and synchronization
|
||||
|
||||
## Recommended TODO Order
|
||||
|
||||
1. Environment checks
|
||||
2. Vector add Triton
|
||||
3. Vector add CUDA
|
||||
4. Row softmax Triton
|
||||
5. Row softmax CUDA
|
||||
6. Tiled matmul Triton
|
||||
7. Tiled matmul CUDA
|
||||
8. Online softmax Triton
|
||||
9. Online softmax CUDA
|
||||
10. Flash attention forward Triton
|
||||
11. Flash attention forward CUDA
|
||||
12. PyTorch custom op binding
|
||||
13. Profiling passes and benchmark validation
|
||||
|
||||
## What To Focus On First
|
||||
|
||||
- Correctness on tiny shapes
|
||||
- Clear index math
|
||||
- Explicit shape assumptions
|
||||
- Numerically stable reductions
|
||||
- Repeatable measurement
|
||||
|
||||
Do not chase peak performance before you can explain the memory traffic and launch geometry of your kernel.
|
||||
30
docs/triton_vs_cuda.md
Normal file
30
docs/triton_vs_cuda.md
Normal file
@@ -0,0 +1,30 @@
|
||||
# Triton Vs CUDA
|
||||
|
||||
## Concept Mapping Table
|
||||
|
||||
| Triton concept | CUDA concept | What to notice |
|
||||
| --- | --- | --- |
|
||||
| `tl.program_id(axis=0)` | `blockIdx.x` and block ownership | Both assign a chunk of logical work to a block-scale unit |
|
||||
| `tl.arange(0, BLOCK)` | `threadIdx.x` or manual lane-local offsets | Triton expresses vectors of indices directly |
|
||||
| masked `tl.load` / `tl.store` | explicit `if (idx < n)` checks | Same boundary problem, different syntax |
|
||||
| blocked tensor operations | thread/block decomposition plus loops | Triton lifts index sets into tensor expressions |
|
||||
| pointer arithmetic in element units | byte-addressed pointer math and indexing | CUDA makes layout mechanics more visible |
|
||||
| implicit vectorized math | manual scalar or vector intrinsics | Triton often reads like array algebra |
|
||||
| autotuned launch parameters | manual block-size tuning | Both still depend on the memory hierarchy |
|
||||
| block pointers and tile views | shared memory tiles and cooperative loads | The same reuse idea shows up with different APIs |
|
||||
| reduction combinators | warp/block reductions | Same algorithmic structure, different implementation burden |
|
||||
| masks and predicates | control flow and bounds checks | Divergence and predication still matter |
|
||||
|
||||
## How To Compare Side By Side
|
||||
|
||||
1. Start from the reference PyTorch function and identify the mathematical operator.
|
||||
2. In the Triton version, ask what one program instance owns.
|
||||
3. In the CUDA version, ask what one block and one thread own.
|
||||
4. Match the memory reads and writes, not just the variable names.
|
||||
5. Write down where reduction state lives in each version.
|
||||
6. For tiled code, identify when data moves from global memory to on-chip storage.
|
||||
7. Only then compare performance.
|
||||
|
||||
## Rule Of Thumb
|
||||
|
||||
Triton usually compresses the "how" so you can focus on the blocked tensor math. CUDA exposes the "how" directly, which is why it is valuable to study both.
|
||||
Reference in New Issue
Block a user