1.5 KiB
Task 06: PyTorch Custom Op
1. Problem Statement
Expose a CUDA kernel as a PyTorch operator so Python code can call it and test it like any other operator.
2. Expected Input/Output Shapes
For the starter binding, use vector add:
x:[N]y:[N]- output:
[N]
The same pattern can later be extended to the other operators.
3. Performance Intuition
The binding layer is not usually where the kernel time goes, but it determines whether you can test, benchmark, and profile the CUDA implementation from Python.
4. Memory Access Discussion
The binding itself does not optimize memory traffic; it passes tensors and dispatches the kernel. Still, the binding must preserve shape, dtype, device, and contiguity assumptions.
5. What Triton Is Abstracting
Triton often avoids a separate C++ binding layer because Python can launch the JIT kernel directly.
6. What CUDA Makes Explicit
CUDA plus PyTorch binding requires you to define function signatures, operator registration, and build integration explicitly.
7. Reflection Questions
- What assumptions should the binding validate before calling a CUDA kernel?
- Why is operator registration useful for testing and benchmarking?
- What changes once you want autograd support?
8. Implementation Checklist
- Read
kernels/cuda/binding/binding.cpp - Build or load the extension from Python
- Call the operator from
torch.ops.kernel_lab - Add correctness checks once the CUDA kernel is implemented
- Try
torch.library.opcheckif your PyTorch build provides it