Files
2026-04-10 13:22:19 +00:00

46 lines
1.5 KiB
Markdown

# Task 06: PyTorch Custom Op
## 1. Problem Statement
Expose a CUDA kernel as a PyTorch operator so Python code can call it and test it like any other operator.
## 2. Expected Input/Output Shapes
For the starter binding, use vector add:
- `x`: `[N]`
- `y`: `[N]`
- output: `[N]`
The same pattern can later be extended to the other operators.
## 3. Performance Intuition
The binding layer is not usually where the kernel time goes, but it determines whether you can test, benchmark, and profile the CUDA implementation from Python.
## 4. Memory Access Discussion
The binding itself does not optimize memory traffic; it passes tensors and dispatches the kernel. Still, the binding must preserve shape, dtype, device, and contiguity assumptions.
## 5. What Triton Is Abstracting
Triton often avoids a separate C++ binding layer because Python can launch the JIT kernel directly.
## 6. What CUDA Makes Explicit
CUDA plus PyTorch binding requires you to define function signatures, operator registration, and build integration explicitly.
## 7. Reflection Questions
- What assumptions should the binding validate before calling a CUDA kernel?
- Why is operator registration useful for testing and benchmarking?
- What changes once you want autograd support?
## 8. Implementation Checklist
- Read `kernels/cuda/binding/binding.cpp`
- Build or load the extension from Python
- Call the operator from `torch.ops.kernel_lab`
- Add correctness checks once the CUDA kernel is implemented
- Try `torch.library.opcheck` if your PyTorch build provides it