Initial project scaffold

2026-04-10 13:15:06 +00:00
commit a4a6b1f1c8
94 changed files with 3964 additions and 0 deletions
--- a/README.md
+++ b/README.md
@@ -0,0 +1,173 @@
+# kernel-lab
+
+`kernel-lab` is a learning-first GPU kernel workbook for studying the same operator across four layers:
+
+1. PyTorch reference code
+2. Triton kernels
+3. Native CUDA C++ kernels
+4. PyTorch custom operator integration
+
+The repository is intentionally not a finished kernel library. The core Triton and CUDA implementations are left as TODO-driven lab exercises so you can study indexing, reductions, tiling, memory movement, correctness checks, and profiling in a controlled way.
+
+## Why This Repo Exists
+
+This lab is aimed at a modern NVIDIA workflow with a Blackwell-class consumer GPU such as an RTX 5090. The exercises themselves are mostly architecture-generic, so the project name stays broad while the build and docs keep hardware targeting explicit.
+
+Each operator exists for a reason:
+
+- `vector_add`: launch geometry, indexing, bounds checks
+- `row_softmax`: reductions, numerical stability, bandwidth limits
+- `tiled_matmul`: tiling, data reuse, memory hierarchy
+- `online_softmax`: running max / running sum recurrence
+- `flash_attention_fwd`: blockwise attention, masking, online normalization
+- `pytorch_custom_op`: how kernels get surfaced as framework operators
+- `profiling`: how to measure what actually happened
+
+## Learning Roadmap
+
+Start with the environment sanity task, then implement kernels in this order:
+
+1. `tasks/00_env_sanity`
+2. `tasks/01_vector_add`
+3. `tasks/02_row_softmax`
+4. `tasks/03_tiled_matmul`
+5. `tasks/04_online_softmax`
+6. `tasks/05_flash_attention_fwd`
+7. `tasks/06_pytorch_custom_op`
+8. `tasks/07_profiling`
+
+The detailed week-1 plan and implementation order live in `docs/roadmap.md`.
+
+## Triton To CUDA Mapping
+
+The core mental model is:
+
+- Triton `program_id` maps to CUDA block-level work assignment
+- Triton blocked tensor operations map to manual thread/block index arithmetic
+- Triton masks map to explicit boundary checks in CUDA
+- Triton load/store helpers abstract pointer math that CUDA exposes directly
+- Triton hides synchronization details that CUDA requires you to reason about
+
+See `docs/triton_vs_cuda.md` for a longer concept table.
+
+## Repository Layout
+
+```text
+docs/         concept notes, roadmap, profiling guidance
+reference/    plain PyTorch reference implementations
+kernels/      Triton and CUDA learner skeletons
+tasks/        workbook specs, TODO skeletons, task-local tests and benches
+tests/        repository-wide checks and correctness scaffolding
+bench/        cross-implementation benchmark harnesses
+tools/        environment checks, profiling helpers, comparison scripts
+```
+
+## Environment Assumptions
+
+- Python 3.10+
+- PyTorch with CUDA support
+- Triton installed if you want to run Triton tasks
+- CUDA toolkit installed if you want to build the native extension
+- A recent NVIDIA driver and a Blackwell-capable software stack
+
+Architecture targeting is configurable:
+
+- `KERNEL_LAB_CUDA_ARCH=120` for Python extension loading helpers
+- `-DCMAKE_CUDA_ARCHITECTURES=120` for direct CMake builds
+
+If your toolkit, driver, or local environment does not yet expose Blackwell exactly as expected, keep the architecture explicit and adjust it instead of editing kernel source files.
+
+## Install
+
+```bash
+uv sync
+```
+
+If you want commands to run inside the uv-managed environment without activating it manually, use `uv run`, for example:
+
+```bash
+uv run pytest -q
+uv run python tools/check_env.py
+```
+
+## Run Environment Checks
+
+```bash
+uv run python tools/check_env.py
+uv run python tools/print_device_info.py
+```
+
+## Run Tests
+
+The default test suite validates references and scaffolding. Triton/CUDA task tests skip gracefully until you implement the learner TODOs.
+
+```bash
+uv run pytest -q
+```
+
+You can also use:
+
+```bash
+make sync
+uv run pytest -q
+uv run ./tools/run_all_tests.sh
+```
+
+## Run Benchmarks
+
+Benchmarks compare PyTorch, Triton, and CUDA when available. Incomplete implementations are reported and skipped.
+
+```bash
+uv run python bench/bench_vector_add.py --device cuda
+uv run python bench/bench_softmax.py --device cuda
+uv run python bench/bench_matmul.py --device cuda
+uv run python bench/bench_attention.py --device cuda
+uv run python bench/compare_impls.py --task vector_add
+```
+
+Or run the helper:
+
+```bash
+uv run ./tools/run_all_benchmarks.sh
+```
+
+## Build The CUDA Extension
+
+Two paths are provided:
+
+1. CMake-first native build:
+
+```bash
+cmake -S kernels/cuda -B build/cuda -DCMAKE_CUDA_ARCHITECTURES=${KERNEL_LAB_CUDA_ARCH:-120}
+cmake --build build/cuda -j
+```
+
+2. Python-driven extension loading for lab experiments:
+
+```bash
+uv run python tasks/06_pytorch_custom_op/extension_skeleton.py
+```
+
+The binding and CUDA source files build a minimal extension skeleton. The learner is expected to fill in operator registration and kernel dispatch details.
+
+## Profile Kernels
+
+Start from one kernel, one shape, one implementation:
+
+```bash
+uv run ./tools/profile_ncu.sh python bench/bench_vector_add.py --device cuda --mode cuda
+uv run ./tools/profile_nsys.sh python bench/bench_attention.py --device cuda --mode triton
+```
+
+See `docs/profiling_guide.md` for warmup, synchronization, and first metrics to inspect.
+
+## How To Use The Workbook
+
+- Read the `spec.md` for the current task.
+- Run the reference implementation and tests first.
+- Read the Triton and CUDA skeleton side by side.
+- Fill in one TODO at a time.
+- Re-run correctness tests before looking at benchmark numbers.
+- Only profile after the kernel is correct on small shapes.
+
+This repo is designed to make the learning path visible. The TODOs are the point.