# Task 07: Profiling ## 1. Problem Statement Profile one kernel at a time and learn to interpret the first few metrics before tuning anything. ## 2. Expected Input/Output Shapes Use the same shapes as your benchmark harness so measurements stay comparable. ## 3. Performance Intuition Profiling is how you turn guesses into evidence. Use it after correctness is established. ## 4. Memory Access Discussion Profilers can tell you whether the kernel is limited by memory throughput, occupancy, or something else. Interpret those numbers in terms of the operator's access pattern. ## 5. What Triton Is Abstracting Triton hides low-level details in code, but profilers still show the resulting kernels and hardware behavior. ## 6. What CUDA Makes Explicit CUDA kernels expose their launch shapes, synchronization behavior, and memory hierarchy choices more directly, which can make profiler results easier to map back to code. ## 7. Reflection Questions - Did you profile a single kernel or an entire script? - Did you warm up before timing? - Which metric was the first signal that the kernel was bandwidth-bound or compute-bound? ## 8. Implementation Checklist - Pick one benchmark and one implementation - Warm up first - Synchronize before and after timing - Run `ncu` and inspect a small set of metrics - Run `nsys` and inspect the timeline - Write down what you learned before changing the kernel