Initial project scaffold
This commit is contained in:
30
docs/gpu_basics.md
Normal file
30
docs/gpu_basics.md
Normal file
@@ -0,0 +1,30 @@
|
||||
# GPU Basics
|
||||
|
||||
This lab assumes you are learning GPU kernels as structured data-parallel programs.
|
||||
|
||||
## Core Ideas
|
||||
|
||||
- GPU throughput comes from massive parallelism, not a single fast thread.
|
||||
- Launch geometry determines which logical elements each thread or program instance owns.
|
||||
- Global memory is large and slow relative to on-chip storage.
|
||||
- Kernel design is often about reducing memory traffic and increasing reuse.
|
||||
|
||||
## Terms To Keep Straight
|
||||
|
||||
- thread: the smallest execution entity in CUDA
|
||||
- warp: a hardware scheduling group, usually 32 threads
|
||||
- block: a cooperating group of threads with shared memory access
|
||||
- grid: the full launch of all blocks
|
||||
- program instance: Triton's block-level work abstraction
|
||||
|
||||
## Mental Model For This Repo
|
||||
|
||||
Each task asks the same questions in both Triton and CUDA:
|
||||
|
||||
- What data does one unit of work own?
|
||||
- How is that ownership computed from launch indices?
|
||||
- Which reads are coalesced or contiguous?
|
||||
- Which intermediate values must be reduced?
|
||||
- Which values should be reused on chip?
|
||||
|
||||
Keep a notebook. Write down the answers before you code.
|
||||
Reference in New Issue
Block a user