Initial project scaffold

2026-04-10 13:22:19 +00:00
commit 7fa69b1354
94 changed files with 3964 additions and 0 deletions
--- a/docs/gpu_basics.md
+++ b/docs/gpu_basics.md
@@ -0,0 +1,30 @@
+# GPU Basics
+
+This lab assumes you are learning GPU kernels as structured data-parallel programs.
+
+## Core Ideas
+
+- GPU throughput comes from massive parallelism, not a single fast thread.
+- Launch geometry determines which logical elements each thread or program instance owns.
+- Global memory is large and slow relative to on-chip storage.
+- Kernel design is often about reducing memory traffic and increasing reuse.
+
+## Terms To Keep Straight
+
+- thread: the smallest execution entity in CUDA
+- warp: a hardware scheduling group, usually 32 threads
+- block: a cooperating group of threads with shared memory access
+- grid: the full launch of all blocks
+- program instance: Triton's block-level work abstraction
+
+## Mental Model For This Repo
+
+Each task asks the same questions in both Triton and CUDA:
+
+- What data does one unit of work own?
+- How is that ownership computed from launch indices?
+- Which reads are coalesced or contiguous?
+- Which intermediate values must be reduced?
+- Which values should be reused on chip?
+
+Keep a notebook. Write down the answers before you code.