46 lines
1.5 KiB
Markdown
46 lines
1.5 KiB
Markdown
# Task 06: PyTorch Custom Op
|
|
|
|
## 1. Problem Statement
|
|
|
|
Expose a CUDA kernel as a PyTorch operator so Python code can call it and test it like any other operator.
|
|
|
|
## 2. Expected Input/Output Shapes
|
|
|
|
For the starter binding, use vector add:
|
|
|
|
- `x`: `[N]`
|
|
- `y`: `[N]`
|
|
- output: `[N]`
|
|
|
|
The same pattern can later be extended to the other operators.
|
|
|
|
## 3. Performance Intuition
|
|
|
|
The binding layer is not usually where the kernel time goes, but it determines whether you can test, benchmark, and profile the CUDA implementation from Python.
|
|
|
|
## 4. Memory Access Discussion
|
|
|
|
The binding itself does not optimize memory traffic; it passes tensors and dispatches the kernel. Still, the binding must preserve shape, dtype, device, and contiguity assumptions.
|
|
|
|
## 5. What Triton Is Abstracting
|
|
|
|
Triton often avoids a separate C++ binding layer because Python can launch the JIT kernel directly.
|
|
|
|
## 6. What CUDA Makes Explicit
|
|
|
|
CUDA plus PyTorch binding requires you to define function signatures, operator registration, and build integration explicitly.
|
|
|
|
## 7. Reflection Questions
|
|
|
|
- What assumptions should the binding validate before calling a CUDA kernel?
|
|
- Why is operator registration useful for testing and benchmarking?
|
|
- What changes once you want autograd support?
|
|
|
|
## 8. Implementation Checklist
|
|
|
|
- Read `kernels/cuda/binding/binding.cpp`
|
|
- Build or load the extension from Python
|
|
- Call the operator from `torch.ops.kernel_lab`
|
|
- Add correctness checks once the CUDA kernel is implemented
|
|
- Try `torch.library.opcheck` if your PyTorch build provides it
|