Initial project scaffold
This commit is contained in:
173
README.md
Normal file
173
README.md
Normal file
@@ -0,0 +1,173 @@
|
||||
# kernel-lab
|
||||
|
||||
`kernel-lab` is a learning-first GPU kernel workbook for studying the same operator across four layers:
|
||||
|
||||
1. PyTorch reference code
|
||||
2. Triton kernels
|
||||
3. Native CUDA C++ kernels
|
||||
4. PyTorch custom operator integration
|
||||
|
||||
The repository is intentionally not a finished kernel library. The core Triton and CUDA implementations are left as TODO-driven lab exercises so you can study indexing, reductions, tiling, memory movement, correctness checks, and profiling in a controlled way.
|
||||
|
||||
## Why This Repo Exists
|
||||
|
||||
This lab is aimed at a modern NVIDIA workflow with a Blackwell-class consumer GPU such as an RTX 5090. The exercises themselves are mostly architecture-generic, so the project name stays broad while the build and docs keep hardware targeting explicit.
|
||||
|
||||
Each operator exists for a reason:
|
||||
|
||||
- `vector_add`: launch geometry, indexing, bounds checks
|
||||
- `row_softmax`: reductions, numerical stability, bandwidth limits
|
||||
- `tiled_matmul`: tiling, data reuse, memory hierarchy
|
||||
- `online_softmax`: running max / running sum recurrence
|
||||
- `flash_attention_fwd`: blockwise attention, masking, online normalization
|
||||
- `pytorch_custom_op`: how kernels get surfaced as framework operators
|
||||
- `profiling`: how to measure what actually happened
|
||||
|
||||
## Learning Roadmap
|
||||
|
||||
Start with the environment sanity task, then implement kernels in this order:
|
||||
|
||||
1. `tasks/00_env_sanity`
|
||||
2. `tasks/01_vector_add`
|
||||
3. `tasks/02_row_softmax`
|
||||
4. `tasks/03_tiled_matmul`
|
||||
5. `tasks/04_online_softmax`
|
||||
6. `tasks/05_flash_attention_fwd`
|
||||
7. `tasks/06_pytorch_custom_op`
|
||||
8. `tasks/07_profiling`
|
||||
|
||||
The detailed week-1 plan and implementation order live in `docs/roadmap.md`.
|
||||
|
||||
## Triton To CUDA Mapping
|
||||
|
||||
The core mental model is:
|
||||
|
||||
- Triton `program_id` maps to CUDA block-level work assignment
|
||||
- Triton blocked tensor operations map to manual thread/block index arithmetic
|
||||
- Triton masks map to explicit boundary checks in CUDA
|
||||
- Triton load/store helpers abstract pointer math that CUDA exposes directly
|
||||
- Triton hides synchronization details that CUDA requires you to reason about
|
||||
|
||||
See `docs/triton_vs_cuda.md` for a longer concept table.
|
||||
|
||||
## Repository Layout
|
||||
|
||||
```text
|
||||
docs/ concept notes, roadmap, profiling guidance
|
||||
reference/ plain PyTorch reference implementations
|
||||
kernels/ Triton and CUDA learner skeletons
|
||||
tasks/ workbook specs, TODO skeletons, task-local tests and benches
|
||||
tests/ repository-wide checks and correctness scaffolding
|
||||
bench/ cross-implementation benchmark harnesses
|
||||
tools/ environment checks, profiling helpers, comparison scripts
|
||||
```
|
||||
|
||||
## Environment Assumptions
|
||||
|
||||
- Python 3.10+
|
||||
- PyTorch with CUDA support
|
||||
- Triton installed if you want to run Triton tasks
|
||||
- CUDA toolkit installed if you want to build the native extension
|
||||
- A recent NVIDIA driver and a Blackwell-capable software stack
|
||||
|
||||
Architecture targeting is configurable:
|
||||
|
||||
- `KERNEL_LAB_CUDA_ARCH=120` for Python extension loading helpers
|
||||
- `-DCMAKE_CUDA_ARCHITECTURES=120` for direct CMake builds
|
||||
|
||||
If your toolkit, driver, or local environment does not yet expose Blackwell exactly as expected, keep the architecture explicit and adjust it instead of editing kernel source files.
|
||||
|
||||
## Install
|
||||
|
||||
```bash
|
||||
uv sync
|
||||
```
|
||||
|
||||
If you want commands to run inside the uv-managed environment without activating it manually, use `uv run`, for example:
|
||||
|
||||
```bash
|
||||
uv run pytest -q
|
||||
uv run python tools/check_env.py
|
||||
```
|
||||
|
||||
## Run Environment Checks
|
||||
|
||||
```bash
|
||||
uv run python tools/check_env.py
|
||||
uv run python tools/print_device_info.py
|
||||
```
|
||||
|
||||
## Run Tests
|
||||
|
||||
The default test suite validates references and scaffolding. Triton/CUDA task tests skip gracefully until you implement the learner TODOs.
|
||||
|
||||
```bash
|
||||
uv run pytest -q
|
||||
```
|
||||
|
||||
You can also use:
|
||||
|
||||
```bash
|
||||
make sync
|
||||
uv run pytest -q
|
||||
uv run ./tools/run_all_tests.sh
|
||||
```
|
||||
|
||||
## Run Benchmarks
|
||||
|
||||
Benchmarks compare PyTorch, Triton, and CUDA when available. Incomplete implementations are reported and skipped.
|
||||
|
||||
```bash
|
||||
uv run python bench/bench_vector_add.py --device cuda
|
||||
uv run python bench/bench_softmax.py --device cuda
|
||||
uv run python bench/bench_matmul.py --device cuda
|
||||
uv run python bench/bench_attention.py --device cuda
|
||||
uv run python bench/compare_impls.py --task vector_add
|
||||
```
|
||||
|
||||
Or run the helper:
|
||||
|
||||
```bash
|
||||
uv run ./tools/run_all_benchmarks.sh
|
||||
```
|
||||
|
||||
## Build The CUDA Extension
|
||||
|
||||
Two paths are provided:
|
||||
|
||||
1. CMake-first native build:
|
||||
|
||||
```bash
|
||||
cmake -S kernels/cuda -B build/cuda -DCMAKE_CUDA_ARCHITECTURES=${KERNEL_LAB_CUDA_ARCH:-120}
|
||||
cmake --build build/cuda -j
|
||||
```
|
||||
|
||||
2. Python-driven extension loading for lab experiments:
|
||||
|
||||
```bash
|
||||
uv run python tasks/06_pytorch_custom_op/extension_skeleton.py
|
||||
```
|
||||
|
||||
The binding and CUDA source files build a minimal extension skeleton. The learner is expected to fill in operator registration and kernel dispatch details.
|
||||
|
||||
## Profile Kernels
|
||||
|
||||
Start from one kernel, one shape, one implementation:
|
||||
|
||||
```bash
|
||||
uv run ./tools/profile_ncu.sh python bench/bench_vector_add.py --device cuda --mode cuda
|
||||
uv run ./tools/profile_nsys.sh python bench/bench_attention.py --device cuda --mode triton
|
||||
```
|
||||
|
||||
See `docs/profiling_guide.md` for warmup, synchronization, and first metrics to inspect.
|
||||
|
||||
## How To Use The Workbook
|
||||
|
||||
- Read the `spec.md` for the current task.
|
||||
- Run the reference implementation and tests first.
|
||||
- Read the Triton and CUDA skeleton side by side.
|
||||
- Fill in one TODO at a time.
|
||||
- Re-run correctness tests before looking at benchmark numbers.
|
||||
- Only profile after the kernel is correct on small shapes.
|
||||
|
||||
This repo is designed to make the learning path visible. The TODOs are the point.
|
||||
Reference in New Issue
Block a user