kernel-lab/README.md

# kernel-lab

`kernel-lab` is a learning-first GPU kernel workbook for studying the same operator across four layers:

1. PyTorch reference code
2. Triton kernels
3. Native CUDA C++ kernels
4. PyTorch custom operator integration

The repository is intentionally not a finished kernel library. The core Triton and CUDA implementations are left as TODO-driven lab exercises so you can study indexing, reductions, tiling, memory movement, correctness checks, and profiling in a controlled way.

## Why This Repo Exists

This lab is aimed at a modern NVIDIA workflow with a Blackwell-class consumer GPU such as an RTX 5090. The exercises themselves are mostly architecture-generic, so the project name stays broad while the build and docs keep hardware targeting explicit.

Each operator exists for a reason:

- `vector_add`: launch geometry, indexing, bounds checks
- `row_softmax`: reductions, numerical stability, bandwidth limits
- `tiled_matmul`: tiling, data reuse, memory hierarchy
- `online_softmax`: running max / running sum recurrence
- `flash_attention_fwd`: blockwise attention, masking, online normalization
- `pytorch_custom_op`: how kernels get surfaced as framework operators
- `profiling`: how to measure what actually happened

## Learning Roadmap

Start with the environment sanity task, then implement kernels in this order:

1. `tasks/00_env_sanity`
2. `tasks/01_vector_add`
3. `tasks/02_row_softmax`
4. `tasks/03_tiled_matmul`
5. `tasks/04_online_softmax`
6. `tasks/05_flash_attention_fwd`
7. `tasks/06_pytorch_custom_op`
8. `tasks/07_profiling`

The detailed week-1 plan and implementation order live in `docs/roadmap.md`.

## Triton To CUDA Mapping

The core mental model is:

- Triton `program_id` maps to CUDA block-level work assignment
- Triton blocked tensor operations map to manual thread/block index arithmetic
- Triton masks map to explicit boundary checks in CUDA
- Triton load/store helpers abstract pointer math that CUDA exposes directly
- Triton hides synchronization details that CUDA requires you to reason about

See `docs/triton_vs_cuda.md` for a longer concept table.

## Repository Layout

```text
docs/         concept notes, roadmap, profiling guidance
reference/    plain PyTorch reference implementations
kernels/      Triton and CUDA learner skeletons
tasks/        workbook specs, TODO skeletons, task-local tests and benches
tests/        repository-wide checks and correctness scaffolding
bench/        cross-implementation benchmark harnesses
tools/        environment checks, profiling helpers, comparison scripts
```

## Environment Assumptions

- Python 3.10+
- PyTorch with CUDA support
- Triton installed if you want to run Triton tasks
- CUDA toolkit installed if you want to build the native extension
- A recent NVIDIA driver and a Blackwell-capable software stack

Architecture targeting is configurable:

- `KERNEL_LAB_CUDA_ARCH=120` for Python extension loading helpers
- `-DCMAKE_CUDA_ARCHITECTURES=120` for direct CMake builds

If your toolkit, driver, or local environment does not yet expose Blackwell exactly as expected, keep the architecture explicit and adjust it instead of editing kernel source files.

## Install

```bash
uv sync
```

If you want commands to run inside the uv-managed environment without activating it manually, use `uv run`, for example:

```bash
uv run pytest -q
uv run python tools/check_env.py
```

## Run Environment Checks

```bash
uv run python tools/check_env.py
uv run python tools/print_device_info.py
```

## Run Tests

The default test suite validates references and scaffolding. Triton/CUDA task tests skip gracefully until you implement the learner TODOs.

```bash
uv run pytest -q
```

You can also use:

```bash
make sync
uv run pytest -q
uv run ./tools/run_all_tests.sh
```

## Run Benchmarks

Benchmarks compare PyTorch, Triton, and CUDA when available. Incomplete implementations are reported and skipped.

```bash
uv run python bench/bench_vector_add.py --device cuda
uv run python bench/bench_softmax.py --device cuda
uv run python bench/bench_matmul.py --device cuda
uv run python bench/bench_attention.py --device cuda
uv run python bench/compare_impls.py --task vector_add
```

Or run the helper:

```bash
uv run ./tools/run_all_benchmarks.sh
```

## Build The CUDA Extension

Two paths are provided:

1. CMake-first native build:

```bash
cmake -S kernels/cuda -B build/cuda -DCMAKE_CUDA_ARCHITECTURES=${KERNEL_LAB_CUDA_ARCH:-120}
cmake --build build/cuda -j
```

2. Python-driven extension loading for lab experiments:

```bash
uv run python tasks/06_pytorch_custom_op/extension_skeleton.py
```

The binding and CUDA source files build a minimal extension skeleton. The learner is expected to fill in operator registration and kernel dispatch details.

## Profile Kernels

Start from one kernel, one shape, one implementation:

```bash
uv run ./tools/profile_ncu.sh python bench/bench_vector_add.py --device cuda --mode cuda
uv run ./tools/profile_nsys.sh python bench/bench_attention.py --device cuda --mode triton
```

See `docs/profiling_guide.md` for warmup, synchronization, and first metrics to inspect.

## How To Use The Workbook

- Read the `spec.md` for the current task.
- Run the reference implementation and tests first.
- Read the Triton and CUDA skeleton side by side.
- Fill in one TODO at a time.
- Re-run correctness tests before looking at benchmark numbers.
- Only profile after the kernel is correct on small shapes.

This repo is designed to make the learning path visible. The TODOs are the point.