Task 07: Profiling

1. Problem Statement

Profile one kernel at a time and learn to interpret the first few metrics before tuning anything.

2. Expected Input/Output Shapes

Use the same shapes as your benchmark harness so measurements stay comparable.

3. Performance Intuition

Profiling is how you turn guesses into evidence. Use it after correctness is established.

4. Memory Access Discussion

Profilers can tell you whether the kernel is limited by memory throughput, occupancy, or something else. Interpret those numbers in terms of the operator's access pattern.

5. What Triton Is Abstracting

Triton hides low-level details in code, but profilers still show the resulting kernels and hardware behavior.

6. What CUDA Makes Explicit

CUDA kernels expose their launch shapes, synchronization behavior, and memory hierarchy choices more directly, which can make profiler results easier to map back to code.

7. Reflection Questions

Did you profile a single kernel or an entire script?
Did you warm up before timing?
Which metric was the first signal that the kernel was bandwidth-bound or compute-bound?

8. Implementation Checklist

Pick one benchmark and one implementation
Warm up first
Synchronize before and after timing
Run ncu and inspect a small set of metrics
Run nsys and inspect the timeline
Write down what you learned before changing the kernel

1.4 KiB Raw Blame History