Initial project scaffold
This commit is contained in:
1
tasks/07_profiling/__init__.py
Normal file
1
tasks/07_profiling/__init__.py
Normal file
@@ -0,0 +1 @@
|
||||
"""Profiling task."""
|
||||
23
tasks/07_profiling/profile_examples.md
Normal file
23
tasks/07_profiling/profile_examples.md
Normal file
@@ -0,0 +1,23 @@
|
||||
# Profiling Examples
|
||||
|
||||
## Nsight Compute
|
||||
|
||||
```bash
|
||||
./tools/profile_ncu.sh python bench/bench_vector_add.py --device cuda --mode triton
|
||||
./tools/profile_ncu.sh python bench/bench_softmax.py --device cuda --mode torch
|
||||
```
|
||||
|
||||
## Nsight Systems
|
||||
|
||||
```bash
|
||||
./tools/profile_nsys.sh python bench/bench_matmul.py --device cuda --mode triton
|
||||
./tools/profile_nsys.sh python bench/bench_attention.py --device cuda --mode torch
|
||||
```
|
||||
|
||||
## First Things To Inspect
|
||||
|
||||
- median runtime from the benchmark harness
|
||||
- whether warmup was excluded
|
||||
- whether kernels overlap or serialize
|
||||
- whether memory throughput is near a practical ceiling
|
||||
- whether a kernel launch is tiny enough that launch overhead matters
|
||||
40
tasks/07_profiling/spec.md
Normal file
40
tasks/07_profiling/spec.md
Normal file
@@ -0,0 +1,40 @@
|
||||
# Task 07: Profiling
|
||||
|
||||
## 1. Problem Statement
|
||||
|
||||
Profile one kernel at a time and learn to interpret the first few metrics before tuning anything.
|
||||
|
||||
## 2. Expected Input/Output Shapes
|
||||
|
||||
Use the same shapes as your benchmark harness so measurements stay comparable.
|
||||
|
||||
## 3. Performance Intuition
|
||||
|
||||
Profiling is how you turn guesses into evidence. Use it after correctness is established.
|
||||
|
||||
## 4. Memory Access Discussion
|
||||
|
||||
Profilers can tell you whether the kernel is limited by memory throughput, occupancy, or something else. Interpret those numbers in terms of the operator's access pattern.
|
||||
|
||||
## 5. What Triton Is Abstracting
|
||||
|
||||
Triton hides low-level details in code, but profilers still show the resulting kernels and hardware behavior.
|
||||
|
||||
## 6. What CUDA Makes Explicit
|
||||
|
||||
CUDA kernels expose their launch shapes, synchronization behavior, and memory hierarchy choices more directly, which can make profiler results easier to map back to code.
|
||||
|
||||
## 7. Reflection Questions
|
||||
|
||||
- Did you profile a single kernel or an entire script?
|
||||
- Did you warm up before timing?
|
||||
- Which metric was the first signal that the kernel was bandwidth-bound or compute-bound?
|
||||
|
||||
## 8. Implementation Checklist
|
||||
|
||||
- Pick one benchmark and one implementation
|
||||
- Warm up first
|
||||
- Synchronize before and after timing
|
||||
- Run `ncu` and inspect a small set of metrics
|
||||
- Run `nsys` and inspect the timeline
|
||||
- Write down what you learned before changing the kernel
|
||||
Reference in New Issue
Block a user