gahow/kernel-lab

Files

Gahow Wang 7fa69b1354 Initial project scaffold

2026-04-10 13:22:19 +00:00

1.1 KiB

Raw Permalink Blame History

GPU Basics

This lab assumes you are learning GPU kernels as structured data-parallel programs.

Core Ideas

GPU throughput comes from massive parallelism, not a single fast thread.
Launch geometry determines which logical elements each thread or program instance owns.
Global memory is large and slow relative to on-chip storage.
Kernel design is often about reducing memory traffic and increasing reuse.

Terms To Keep Straight

thread: the smallest execution entity in CUDA
warp: a hardware scheduling group, usually 32 threads
block: a cooperating group of threads with shared memory access
grid: the full launch of all blocks
program instance: Triton's block-level work abstraction

Mental Model For This Repo

Each task asks the same questions in both Triton and CUDA:

What data does one unit of work own?
How is that ownership computed from launch indices?
Which reads are coalesced or contiguous?
Which intermediate values must be reduced?
Which values should be reused on chip?

Keep a notebook. Write down the answers before you code.