Initial project scaffold

2026-04-10 13:15:06 +00:00
commit a4a6b1f1c8
94 changed files with 3964 additions and 0 deletions
--- a/tasks/07_profiling/init.py
+++ b/tasks/07_profiling/init.py
@@ -0,0 +1 @@
+"""Profiling task."""
--- a/tasks/07_profiling/profile_examples.md
+++ b/tasks/07_profiling/profile_examples.md
@@ -0,0 +1,23 @@
+# Profiling Examples
+
+## Nsight Compute
+
+```bash
+./tools/profile_ncu.sh python bench/bench_vector_add.py --device cuda --mode triton
+./tools/profile_ncu.sh python bench/bench_softmax.py --device cuda --mode torch
+```
+
+## Nsight Systems
+
+```bash
+./tools/profile_nsys.sh python bench/bench_matmul.py --device cuda --mode triton
+./tools/profile_nsys.sh python bench/bench_attention.py --device cuda --mode torch
+```
+
+## First Things To Inspect
+
+- median runtime from the benchmark harness
+- whether warmup was excluded
+- whether kernels overlap or serialize
+- whether memory throughput is near a practical ceiling
+- whether a kernel launch is tiny enough that launch overhead matters
--- a/tasks/07_profiling/spec.md
+++ b/tasks/07_profiling/spec.md
@@ -0,0 +1,40 @@
+# Task 07: Profiling
+
+## 1. Problem Statement
+
+Profile one kernel at a time and learn to interpret the first few metrics before tuning anything.
+
+## 2. Expected Input/Output Shapes
+
+Use the same shapes as your benchmark harness so measurements stay comparable.
+
+## 3. Performance Intuition
+
+Profiling is how you turn guesses into evidence. Use it after correctness is established.
+
+## 4. Memory Access Discussion
+
+Profilers can tell you whether the kernel is limited by memory throughput, occupancy, or something else. Interpret those numbers in terms of the operator's access pattern.
+
+## 5. What Triton Is Abstracting
+
+Triton hides low-level details in code, but profilers still show the resulting kernels and hardware behavior.
+
+## 6. What CUDA Makes Explicit
+
+CUDA kernels expose their launch shapes, synchronization behavior, and memory hierarchy choices more directly, which can make profiler results easier to map back to code.
+
+## 7. Reflection Questions
+
+- Did you profile a single kernel or an entire script?
+- Did you warm up before timing?
+- Which metric was the first signal that the kernel was bandwidth-bound or compute-bound?
+
+## 8. Implementation Checklist
+
+- Pick one benchmark and one implementation
+- Warm up first
+- Synchronize before and after timing
+- Run `ncu` and inspect a small set of metrics
+- Run `nsys` and inspect the timeline
+- Write down what you learned before changing the kernel