Task 06: PyTorch Custom Op

1. Problem Statement

Expose a CUDA kernel as a PyTorch operator so Python code can call it and test it like any other operator.

2. Expected Input/Output Shapes

For the starter binding, use vector add:

x: [N]
y: [N]
output: [N]

The same pattern can later be extended to the other operators.

3. Performance Intuition

The binding layer is not usually where the kernel time goes, but it determines whether you can test, benchmark, and profile the CUDA implementation from Python.

4. Memory Access Discussion

The binding itself does not optimize memory traffic; it passes tensors and dispatches the kernel. Still, the binding must preserve shape, dtype, device, and contiguity assumptions.

5. What Triton Is Abstracting

Triton often avoids a separate C++ binding layer because Python can launch the JIT kernel directly.

6. What CUDA Makes Explicit

CUDA plus PyTorch binding requires you to define function signatures, operator registration, and build integration explicitly.

7. Reflection Questions

What assumptions should the binding validate before calling a CUDA kernel?
Why is operator registration useful for testing and benchmarking?
What changes once you want autograd support?

8. Implementation Checklist

Read kernels/cuda/binding/binding.cpp
Build or load the extension from Python
Call the operator from torch.ops.kernel_lab
Add correctness checks once the CUDA kernel is implemented
Try torch.library.opcheck if your PyTorch build provides it

1.5 KiB Raw Blame History