Files
kernel-lab/README.md
2026-04-10 13:15:06 +00:00

5.4 KiB

kernel-lab

kernel-lab is a learning-first GPU kernel workbook for studying the same operator across four layers:

  1. PyTorch reference code
  2. Triton kernels
  3. Native CUDA C++ kernels
  4. PyTorch custom operator integration

The repository is intentionally not a finished kernel library. The core Triton and CUDA implementations are left as TODO-driven lab exercises so you can study indexing, reductions, tiling, memory movement, correctness checks, and profiling in a controlled way.

Why This Repo Exists

This lab is aimed at a modern NVIDIA workflow with a Blackwell-class consumer GPU such as an RTX 5090. The exercises themselves are mostly architecture-generic, so the project name stays broad while the build and docs keep hardware targeting explicit.

Each operator exists for a reason:

  • vector_add: launch geometry, indexing, bounds checks
  • row_softmax: reductions, numerical stability, bandwidth limits
  • tiled_matmul: tiling, data reuse, memory hierarchy
  • online_softmax: running max / running sum recurrence
  • flash_attention_fwd: blockwise attention, masking, online normalization
  • pytorch_custom_op: how kernels get surfaced as framework operators
  • profiling: how to measure what actually happened

Learning Roadmap

Start with the environment sanity task, then implement kernels in this order:

  1. tasks/00_env_sanity
  2. tasks/01_vector_add
  3. tasks/02_row_softmax
  4. tasks/03_tiled_matmul
  5. tasks/04_online_softmax
  6. tasks/05_flash_attention_fwd
  7. tasks/06_pytorch_custom_op
  8. tasks/07_profiling

The detailed week-1 plan and implementation order live in docs/roadmap.md.

Triton To CUDA Mapping

The core mental model is:

  • Triton program_id maps to CUDA block-level work assignment
  • Triton blocked tensor operations map to manual thread/block index arithmetic
  • Triton masks map to explicit boundary checks in CUDA
  • Triton load/store helpers abstract pointer math that CUDA exposes directly
  • Triton hides synchronization details that CUDA requires you to reason about

See docs/triton_vs_cuda.md for a longer concept table.

Repository Layout

docs/         concept notes, roadmap, profiling guidance
reference/    plain PyTorch reference implementations
kernels/      Triton and CUDA learner skeletons
tasks/        workbook specs, TODO skeletons, task-local tests and benches
tests/        repository-wide checks and correctness scaffolding
bench/        cross-implementation benchmark harnesses
tools/        environment checks, profiling helpers, comparison scripts

Environment Assumptions

  • Python 3.10+
  • PyTorch with CUDA support
  • Triton installed if you want to run Triton tasks
  • CUDA toolkit installed if you want to build the native extension
  • A recent NVIDIA driver and a Blackwell-capable software stack

Architecture targeting is configurable:

  • KERNEL_LAB_CUDA_ARCH=120 for Python extension loading helpers
  • -DCMAKE_CUDA_ARCHITECTURES=120 for direct CMake builds

If your toolkit, driver, or local environment does not yet expose Blackwell exactly as expected, keep the architecture explicit and adjust it instead of editing kernel source files.

Install

uv sync

If you want commands to run inside the uv-managed environment without activating it manually, use uv run, for example:

uv run pytest -q
uv run python tools/check_env.py

Run Environment Checks

uv run python tools/check_env.py
uv run python tools/print_device_info.py

Run Tests

The default test suite validates references and scaffolding. Triton/CUDA task tests skip gracefully until you implement the learner TODOs.

uv run pytest -q

You can also use:

make sync
uv run pytest -q
uv run ./tools/run_all_tests.sh

Run Benchmarks

Benchmarks compare PyTorch, Triton, and CUDA when available. Incomplete implementations are reported and skipped.

uv run python bench/bench_vector_add.py --device cuda
uv run python bench/bench_softmax.py --device cuda
uv run python bench/bench_matmul.py --device cuda
uv run python bench/bench_attention.py --device cuda
uv run python bench/compare_impls.py --task vector_add

Or run the helper:

uv run ./tools/run_all_benchmarks.sh

Build The CUDA Extension

Two paths are provided:

  1. CMake-first native build:
cmake -S kernels/cuda -B build/cuda -DCMAKE_CUDA_ARCHITECTURES=${KERNEL_LAB_CUDA_ARCH:-120}
cmake --build build/cuda -j
  1. Python-driven extension loading for lab experiments:
uv run python tasks/06_pytorch_custom_op/extension_skeleton.py

The binding and CUDA source files build a minimal extension skeleton. The learner is expected to fill in operator registration and kernel dispatch details.

Profile Kernels

Start from one kernel, one shape, one implementation:

uv run ./tools/profile_ncu.sh python bench/bench_vector_add.py --device cuda --mode cuda
uv run ./tools/profile_nsys.sh python bench/bench_attention.py --device cuda --mode triton

See docs/profiling_guide.md for warmup, synchronization, and first metrics to inspect.

How To Use The Workbook

  • Read the spec.md for the current task.
  • Run the reference implementation and tests first.
  • Read the Triton and CUDA skeleton side by side.
  • Fill in one TODO at a time.
  • Re-run correctness tests before looking at benchmark numbers.
  • Only profile after the kernel is correct on small shapes.

This repo is designed to make the learning path visible. The TODOs are the point.