gahow/kernel-lab

Fork 0

Files

wjh a4a6b1f1c8 Initial project scaffold

2026-04-10 13:15:06 +00:00

5.4 KiB

Raw Blame History

kernel-lab

kernel-lab is a learning-first GPU kernel workbook for studying the same operator across four layers:

PyTorch reference code
Triton kernels
Native CUDA C++ kernels
PyTorch custom operator integration

The repository is intentionally not a finished kernel library. The core Triton and CUDA implementations are left as TODO-driven lab exercises so you can study indexing, reductions, tiling, memory movement, correctness checks, and profiling in a controlled way.

Why This Repo Exists

This lab is aimed at a modern NVIDIA workflow with a Blackwell-class consumer GPU such as an RTX 5090. The exercises themselves are mostly architecture-generic, so the project name stays broad while the build and docs keep hardware targeting explicit.

Each operator exists for a reason:

vector_add: launch geometry, indexing, bounds checks
row_softmax: reductions, numerical stability, bandwidth limits
tiled_matmul: tiling, data reuse, memory hierarchy
online_softmax: running max / running sum recurrence
flash_attention_fwd: blockwise attention, masking, online normalization
pytorch_custom_op: how kernels get surfaced as framework operators
profiling: how to measure what actually happened

Learning Roadmap

Start with the environment sanity task, then implement kernels in this order:

tasks/00_env_sanity
tasks/01_vector_add
tasks/02_row_softmax
tasks/03_tiled_matmul
tasks/04_online_softmax
tasks/05_flash_attention_fwd
tasks/06_pytorch_custom_op
tasks/07_profiling

The detailed week-1 plan and implementation order live in docs/roadmap.md.

Triton To CUDA Mapping

The core mental model is:

Triton program_id maps to CUDA block-level work assignment
Triton blocked tensor operations map to manual thread/block index arithmetic
Triton masks map to explicit boundary checks in CUDA
Triton load/store helpers abstract pointer math that CUDA exposes directly
Triton hides synchronization details that CUDA requires you to reason about

See docs/triton_vs_cuda.md for a longer concept table.

Repository Layout

docs/         concept notes, roadmap, profiling guidance
reference/    plain PyTorch reference implementations
kernels/      Triton and CUDA learner skeletons
tasks/        workbook specs, TODO skeletons, task-local tests and benches
tests/        repository-wide checks and correctness scaffolding
bench/        cross-implementation benchmark harnesses
tools/        environment checks, profiling helpers, comparison scripts

Environment Assumptions

Python 3.10+
PyTorch with CUDA support
Triton installed if you want to run Triton tasks
CUDA toolkit installed if you want to build the native extension
A recent NVIDIA driver and a Blackwell-capable software stack

Architecture targeting is configurable:

KERNEL_LAB_CUDA_ARCH=120 for Python extension loading helpers
-DCMAKE_CUDA_ARCHITECTURES=120 for direct CMake builds

If your toolkit, driver, or local environment does not yet expose Blackwell exactly as expected, keep the architecture explicit and adjust it instead of editing kernel source files.

Install

uv sync

If you want commands to run inside the uv-managed environment without activating it manually, use uv run, for example:

uv run pytest -q
uv run python tools/check_env.py

Run Environment Checks

uv run python tools/check_env.py
uv run python tools/print_device_info.py

Run Tests

The default test suite validates references and scaffolding. Triton/CUDA task tests skip gracefully until you implement the learner TODOs.

uv run pytest -q

You can also use:

make sync
uv run pytest -q
uv run ./tools/run_all_tests.sh

Run Benchmarks

Benchmarks compare PyTorch, Triton, and CUDA when available. Incomplete implementations are reported and skipped.

uv run python bench/bench_vector_add.py --device cuda
uv run python bench/bench_softmax.py --device cuda
uv run python bench/bench_matmul.py --device cuda
uv run python bench/bench_attention.py --device cuda
uv run python bench/compare_impls.py --task vector_add

Or run the helper:

uv run ./tools/run_all_benchmarks.sh

Build The CUDA Extension

Two paths are provided:

CMake-first native build:

cmake -S kernels/cuda -B build/cuda -DCMAKE_CUDA_ARCHITECTURES=${KERNEL_LAB_CUDA_ARCH:-120}
cmake --build build/cuda -j

Python-driven extension loading for lab experiments:

uv run python tasks/06_pytorch_custom_op/extension_skeleton.py

The binding and CUDA source files build a minimal extension skeleton. The learner is expected to fill in operator registration and kernel dispatch details.

Profile Kernels

Start from one kernel, one shape, one implementation:

uv run ./tools/profile_ncu.sh python bench/bench_vector_add.py --device cuda --mode cuda
uv run ./tools/profile_nsys.sh python bench/bench_attention.py --device cuda --mode triton

See docs/profiling_guide.md for warmup, synchronization, and first metrics to inspect.

How To Use The Workbook

Read the spec.md for the current task.
Run the reference implementation and tests first.
Read the Triton and CUDA skeleton side by side.
Fill in one TODO at a time.
Re-run correctness tests before looking at benchmark numbers.
Only profile after the kernel is correct on small shapes.

This repo is designed to make the learning path visible. The TODOs are the point.

5.4 KiB Raw Blame History