# kernel-lab `kernel-lab` is a learning-first GPU kernel workbook for studying the same operator across four layers: 1. PyTorch reference code 2. Triton kernels 3. Native CUDA C++ kernels 4. PyTorch custom operator integration The repository is intentionally not a finished kernel library. The core Triton and CUDA implementations are left as TODO-driven lab exercises so you can study indexing, reductions, tiling, memory movement, correctness checks, and profiling in a controlled way. ## Why This Repo Exists This lab is aimed at a modern NVIDIA workflow with a Blackwell-class consumer GPU such as an RTX 5090. The exercises themselves are mostly architecture-generic, so the project name stays broad while the build and docs keep hardware targeting explicit. Each operator exists for a reason: - `vector_add`: launch geometry, indexing, bounds checks - `row_softmax`: reductions, numerical stability, bandwidth limits - `tiled_matmul`: tiling, data reuse, memory hierarchy - `online_softmax`: running max / running sum recurrence - `flash_attention_fwd`: blockwise attention, masking, online normalization - `pytorch_custom_op`: how kernels get surfaced as framework operators - `profiling`: how to measure what actually happened ## Learning Roadmap Start with the environment sanity task, then implement kernels in this order: 1. `tasks/00_env_sanity` 2. `tasks/01_vector_add` 3. `tasks/02_row_softmax` 4. `tasks/03_tiled_matmul` 5. `tasks/04_online_softmax` 6. `tasks/05_flash_attention_fwd` 7. `tasks/06_pytorch_custom_op` 8. `tasks/07_profiling` The detailed week-1 plan and implementation order live in `docs/roadmap.md`. ## Triton To CUDA Mapping The core mental model is: - Triton `program_id` maps to CUDA block-level work assignment - Triton blocked tensor operations map to manual thread/block index arithmetic - Triton masks map to explicit boundary checks in CUDA - Triton load/store helpers abstract pointer math that CUDA exposes directly - Triton hides synchronization details that CUDA requires you to reason about See `docs/triton_vs_cuda.md` for a longer concept table. ## Repository Layout ```text docs/ concept notes, roadmap, profiling guidance reference/ plain PyTorch reference implementations kernels/ Triton and CUDA learner skeletons tasks/ workbook specs, TODO skeletons, task-local tests and benches tests/ repository-wide checks and correctness scaffolding bench/ cross-implementation benchmark harnesses tools/ environment checks, profiling helpers, comparison scripts ``` ## Environment Assumptions - Python 3.10+ - PyTorch with CUDA support - Triton installed if you want to run Triton tasks - CUDA toolkit installed if you want to build the native extension - A recent NVIDIA driver and a Blackwell-capable software stack Architecture targeting is configurable: - `KERNEL_LAB_CUDA_ARCH=120` for Python extension loading helpers - `-DCMAKE_CUDA_ARCHITECTURES=120` for direct CMake builds If your toolkit, driver, or local environment does not yet expose Blackwell exactly as expected, keep the architecture explicit and adjust it instead of editing kernel source files. ## Install ```bash uv sync ``` If you want commands to run inside the uv-managed environment without activating it manually, use `uv run`, for example: ```bash uv run pytest -q uv run python tools/check_env.py ``` ## Run Environment Checks ```bash uv run python tools/check_env.py uv run python tools/print_device_info.py ``` ## Run Tests The default test suite validates references and scaffolding. Triton/CUDA task tests skip gracefully until you implement the learner TODOs. ```bash uv run pytest -q ``` You can also use: ```bash make sync uv run pytest -q uv run ./tools/run_all_tests.sh ``` ## Run Benchmarks Benchmarks compare PyTorch, Triton, and CUDA when available. Incomplete implementations are reported and skipped. ```bash uv run python bench/bench_vector_add.py --device cuda uv run python bench/bench_softmax.py --device cuda uv run python bench/bench_matmul.py --device cuda uv run python bench/bench_attention.py --device cuda uv run python bench/compare_impls.py --task vector_add ``` Or run the helper: ```bash uv run ./tools/run_all_benchmarks.sh ``` ## Build The CUDA Extension Two paths are provided: 1. CMake-first native build: ```bash cmake -S kernels/cuda -B build/cuda -DCMAKE_CUDA_ARCHITECTURES=${KERNEL_LAB_CUDA_ARCH:-120} cmake --build build/cuda -j ``` 2. Python-driven extension loading for lab experiments: ```bash uv run python tasks/06_pytorch_custom_op/extension_skeleton.py ``` The binding and CUDA source files build a minimal extension skeleton. The learner is expected to fill in operator registration and kernel dispatch details. ## Profile Kernels Start from one kernel, one shape, one implementation: ```bash uv run ./tools/profile_ncu.sh python bench/bench_vector_add.py --device cuda --mode cuda uv run ./tools/profile_nsys.sh python bench/bench_attention.py --device cuda --mode triton ``` See `docs/profiling_guide.md` for warmup, synchronization, and first metrics to inspect. ## How To Use The Workbook - Read the `spec.md` for the current task. - Run the reference implementation and tests first. - Read the Triton and CUDA skeleton side by side. - Fill in one TODO at a time. - Re-run correctness tests before looking at benchmark numbers. - Only profile after the kernel is correct on small shapes. This repo is designed to make the learning path visible. The TODOs are the point.