174 lines
5.4 KiB
Markdown
174 lines
5.4 KiB
Markdown
# kernel-lab
|
|
|
|
`kernel-lab` is a learning-first GPU kernel workbook for studying the same operator across four layers:
|
|
|
|
1. PyTorch reference code
|
|
2. Triton kernels
|
|
3. Native CUDA C++ kernels
|
|
4. PyTorch custom operator integration
|
|
|
|
The repository is intentionally not a finished kernel library. The core Triton and CUDA implementations are left as TODO-driven lab exercises so you can study indexing, reductions, tiling, memory movement, correctness checks, and profiling in a controlled way.
|
|
|
|
## Why This Repo Exists
|
|
|
|
This lab is aimed at a modern NVIDIA workflow with a Blackwell-class consumer GPU such as an RTX 5090. The exercises themselves are mostly architecture-generic, so the project name stays broad while the build and docs keep hardware targeting explicit.
|
|
|
|
Each operator exists for a reason:
|
|
|
|
- `vector_add`: launch geometry, indexing, bounds checks
|
|
- `row_softmax`: reductions, numerical stability, bandwidth limits
|
|
- `tiled_matmul`: tiling, data reuse, memory hierarchy
|
|
- `online_softmax`: running max / running sum recurrence
|
|
- `flash_attention_fwd`: blockwise attention, masking, online normalization
|
|
- `pytorch_custom_op`: how kernels get surfaced as framework operators
|
|
- `profiling`: how to measure what actually happened
|
|
|
|
## Learning Roadmap
|
|
|
|
Start with the environment sanity task, then implement kernels in this order:
|
|
|
|
1. `tasks/00_env_sanity`
|
|
2. `tasks/01_vector_add`
|
|
3. `tasks/02_row_softmax`
|
|
4. `tasks/03_tiled_matmul`
|
|
5. `tasks/04_online_softmax`
|
|
6. `tasks/05_flash_attention_fwd`
|
|
7. `tasks/06_pytorch_custom_op`
|
|
8. `tasks/07_profiling`
|
|
|
|
The detailed week-1 plan and implementation order live in `docs/roadmap.md`.
|
|
|
|
## Triton To CUDA Mapping
|
|
|
|
The core mental model is:
|
|
|
|
- Triton `program_id` maps to CUDA block-level work assignment
|
|
- Triton blocked tensor operations map to manual thread/block index arithmetic
|
|
- Triton masks map to explicit boundary checks in CUDA
|
|
- Triton load/store helpers abstract pointer math that CUDA exposes directly
|
|
- Triton hides synchronization details that CUDA requires you to reason about
|
|
|
|
See `docs/triton_vs_cuda.md` for a longer concept table.
|
|
|
|
## Repository Layout
|
|
|
|
```text
|
|
docs/ concept notes, roadmap, profiling guidance
|
|
reference/ plain PyTorch reference implementations
|
|
kernels/ Triton and CUDA learner skeletons
|
|
tasks/ workbook specs, TODO skeletons, task-local tests and benches
|
|
tests/ repository-wide checks and correctness scaffolding
|
|
bench/ cross-implementation benchmark harnesses
|
|
tools/ environment checks, profiling helpers, comparison scripts
|
|
```
|
|
|
|
## Environment Assumptions
|
|
|
|
- Python 3.10+
|
|
- PyTorch with CUDA support
|
|
- Triton installed if you want to run Triton tasks
|
|
- CUDA toolkit installed if you want to build the native extension
|
|
- A recent NVIDIA driver and a Blackwell-capable software stack
|
|
|
|
Architecture targeting is configurable:
|
|
|
|
- `KERNEL_LAB_CUDA_ARCH=120` for Python extension loading helpers
|
|
- `-DCMAKE_CUDA_ARCHITECTURES=120` for direct CMake builds
|
|
|
|
If your toolkit, driver, or local environment does not yet expose Blackwell exactly as expected, keep the architecture explicit and adjust it instead of editing kernel source files.
|
|
|
|
## Install
|
|
|
|
```bash
|
|
uv sync
|
|
```
|
|
|
|
If you want commands to run inside the uv-managed environment without activating it manually, use `uv run`, for example:
|
|
|
|
```bash
|
|
uv run pytest -q
|
|
uv run python tools/check_env.py
|
|
```
|
|
|
|
## Run Environment Checks
|
|
|
|
```bash
|
|
uv run python tools/check_env.py
|
|
uv run python tools/print_device_info.py
|
|
```
|
|
|
|
## Run Tests
|
|
|
|
The default test suite validates references and scaffolding. Triton/CUDA task tests skip gracefully until you implement the learner TODOs.
|
|
|
|
```bash
|
|
uv run pytest -q
|
|
```
|
|
|
|
You can also use:
|
|
|
|
```bash
|
|
make sync
|
|
uv run pytest -q
|
|
uv run ./tools/run_all_tests.sh
|
|
```
|
|
|
|
## Run Benchmarks
|
|
|
|
Benchmarks compare PyTorch, Triton, and CUDA when available. Incomplete implementations are reported and skipped.
|
|
|
|
```bash
|
|
uv run python bench/bench_vector_add.py --device cuda
|
|
uv run python bench/bench_softmax.py --device cuda
|
|
uv run python bench/bench_matmul.py --device cuda
|
|
uv run python bench/bench_attention.py --device cuda
|
|
uv run python bench/compare_impls.py --task vector_add
|
|
```
|
|
|
|
Or run the helper:
|
|
|
|
```bash
|
|
uv run ./tools/run_all_benchmarks.sh
|
|
```
|
|
|
|
## Build The CUDA Extension
|
|
|
|
Two paths are provided:
|
|
|
|
1. CMake-first native build:
|
|
|
|
```bash
|
|
cmake -S kernels/cuda -B build/cuda -DCMAKE_CUDA_ARCHITECTURES=${KERNEL_LAB_CUDA_ARCH:-120}
|
|
cmake --build build/cuda -j
|
|
```
|
|
|
|
2. Python-driven extension loading for lab experiments:
|
|
|
|
```bash
|
|
uv run python tasks/06_pytorch_custom_op/extension_skeleton.py
|
|
```
|
|
|
|
The binding and CUDA source files build a minimal extension skeleton. The learner is expected to fill in operator registration and kernel dispatch details.
|
|
|
|
## Profile Kernels
|
|
|
|
Start from one kernel, one shape, one implementation:
|
|
|
|
```bash
|
|
uv run ./tools/profile_ncu.sh python bench/bench_vector_add.py --device cuda --mode cuda
|
|
uv run ./tools/profile_nsys.sh python bench/bench_attention.py --device cuda --mode triton
|
|
```
|
|
|
|
See `docs/profiling_guide.md` for warmup, synchronization, and first metrics to inspect.
|
|
|
|
## How To Use The Workbook
|
|
|
|
- Read the `spec.md` for the current task.
|
|
- Run the reference implementation and tests first.
|
|
- Read the Triton and CUDA skeleton side by side.
|
|
- Fill in one TODO at a time.
|
|
- Re-run correctness tests before looking at benchmark numbers.
|
|
- Only profile after the kernel is correct on small shapes.
|
|
|
|
This repo is designed to make the learning path visible. The TODOs are the point.
|