# Task 07: Profiling

## 1. Problem Statement

Profile one kernel at a time and learn to interpret the first few metrics before tuning anything.

## 2. Expected Input/Output Shapes

Use the same shapes as your benchmark harness so measurements stay comparable.

## 3. Performance Intuition

Profiling is how you turn guesses into evidence. Use it after correctness is established.

## 4. Memory Access Discussion

Profilers can tell you whether the kernel is limited by memory throughput, occupancy, or something else. Interpret those numbers in terms of the operator's access pattern.

## 5. What Triton Is Abstracting

Triton hides low-level details in code, but profilers still show the resulting kernels and hardware behavior.

## 6. What CUDA Makes Explicit

CUDA kernels expose their launch shapes, synchronization behavior, and memory hierarchy choices more directly, which can make profiler results easier to map back to code.

## 7. Reflection Questions

- Did you profile a single kernel or an entire script?
- Did you warm up before timing?
- Which metric was the first signal that the kernel was bandwidth-bound or compute-bound?

## 8. Implementation Checklist

- Pick one benchmark and one implementation
- Warm up first
- Synchronize before and after timing
- Run `ncu` and inspect a small set of metrics
- Run `nsys` and inspect the timeline
- Write down what you learned before changing the kernel