# GPU Basics This lab assumes you are learning GPU kernels as structured data-parallel programs. ## Core Ideas - GPU throughput comes from massive parallelism, not a single fast thread. - Launch geometry determines which logical elements each thread or program instance owns. - Global memory is large and slow relative to on-chip storage. - Kernel design is often about reducing memory traffic and increasing reuse. ## Terms To Keep Straight - thread: the smallest execution entity in CUDA - warp: a hardware scheduling group, usually 32 threads - block: a cooperating group of threads with shared memory access - grid: the full launch of all blocks - program instance: Triton's block-level work abstraction ## Mental Model For This Repo Each task asks the same questions in both Triton and CUDA: - What data does one unit of work own? - How is that ownership computed from launch indices? - Which reads are coalesced or contiguous? - Which intermediate values must be reduced? - Which values should be reused on chip? Keep a notebook. Write down the answers before you code.