1.4 KiB
1.4 KiB
Task 07: Profiling
1. Problem Statement
Profile one kernel at a time and learn to interpret the first few metrics before tuning anything.
2. Expected Input/Output Shapes
Use the same shapes as your benchmark harness so measurements stay comparable.
3. Performance Intuition
Profiling is how you turn guesses into evidence. Use it after correctness is established.
4. Memory Access Discussion
Profilers can tell you whether the kernel is limited by memory throughput, occupancy, or something else. Interpret those numbers in terms of the operator's access pattern.
5. What Triton Is Abstracting
Triton hides low-level details in code, but profilers still show the resulting kernels and hardware behavior.
6. What CUDA Makes Explicit
CUDA kernels expose their launch shapes, synchronization behavior, and memory hierarchy choices more directly, which can make profiler results easier to map back to code.
7. Reflection Questions
- Did you profile a single kernel or an entire script?
- Did you warm up before timing?
- Which metric was the first signal that the kernel was bandwidth-bound or compute-bound?
8. Implementation Checklist
- Pick one benchmark and one implementation
- Warm up first
- Synchronize before and after timing
- Run
ncuand inspect a small set of metrics - Run
nsysand inspect the timeline - Write down what you learned before changing the kernel