Files
kernel-lab/docs/gpu_basics.md
2026-04-10 13:22:19 +00:00

1.1 KiB

GPU Basics

This lab assumes you are learning GPU kernels as structured data-parallel programs.

Core Ideas

  • GPU throughput comes from massive parallelism, not a single fast thread.
  • Launch geometry determines which logical elements each thread or program instance owns.
  • Global memory is large and slow relative to on-chip storage.
  • Kernel design is often about reducing memory traffic and increasing reuse.

Terms To Keep Straight

  • thread: the smallest execution entity in CUDA
  • warp: a hardware scheduling group, usually 32 threads
  • block: a cooperating group of threads with shared memory access
  • grid: the full launch of all blocks
  • program instance: Triton's block-level work abstraction

Mental Model For This Repo

Each task asks the same questions in both Triton and CUDA:

  • What data does one unit of work own?
  • How is that ownership computed from launch indices?
  • Which reads are coalesced or contiguous?
  • Which intermediate values must be reduced?
  • Which values should be reused on chip?

Keep a notebook. Write down the answers before you code.