Initial project scaffold

This commit is contained in:
2026-04-10 13:22:19 +00:00
commit 7fa69b1354
94 changed files with 3964 additions and 0 deletions

View File

@@ -0,0 +1,40 @@
# Task 07: Profiling
## 1. Problem Statement
Profile one kernel at a time and learn to interpret the first few metrics before tuning anything.
## 2. Expected Input/Output Shapes
Use the same shapes as your benchmark harness so measurements stay comparable.
## 3. Performance Intuition
Profiling is how you turn guesses into evidence. Use it after correctness is established.
## 4. Memory Access Discussion
Profilers can tell you whether the kernel is limited by memory throughput, occupancy, or something else. Interpret those numbers in terms of the operator's access pattern.
## 5. What Triton Is Abstracting
Triton hides low-level details in code, but profilers still show the resulting kernels and hardware behavior.
## 6. What CUDA Makes Explicit
CUDA kernels expose their launch shapes, synchronization behavior, and memory hierarchy choices more directly, which can make profiler results easier to map back to code.
## 7. Reflection Questions
- Did you profile a single kernel or an entire script?
- Did you warm up before timing?
- Which metric was the first signal that the kernel was bandwidth-bound or compute-bound?
## 8. Implementation Checklist
- Pick one benchmark and one implementation
- Warm up first
- Synchronize before and after timing
- Run `ncu` and inspect a small set of metrics
- Run `nsys` and inspect the timeline
- Write down what you learned before changing the kernel