# Task 06: PyTorch Custom Op ## 1. Problem Statement Expose a CUDA kernel as a PyTorch operator so Python code can call it and test it like any other operator. ## 2. Expected Input/Output Shapes For the starter binding, use vector add: - `x`: `[N]` - `y`: `[N]` - output: `[N]` The same pattern can later be extended to the other operators. ## 3. Performance Intuition The binding layer is not usually where the kernel time goes, but it determines whether you can test, benchmark, and profile the CUDA implementation from Python. ## 4. Memory Access Discussion The binding itself does not optimize memory traffic; it passes tensors and dispatches the kernel. Still, the binding must preserve shape, dtype, device, and contiguity assumptions. ## 5. What Triton Is Abstracting Triton often avoids a separate C++ binding layer because Python can launch the JIT kernel directly. ## 6. What CUDA Makes Explicit CUDA plus PyTorch binding requires you to define function signatures, operator registration, and build integration explicitly. ## 7. Reflection Questions - What assumptions should the binding validate before calling a CUDA kernel? - Why is operator registration useful for testing and benchmarking? - What changes once you want autograd support? ## 8. Implementation Checklist - Read `kernels/cuda/binding/binding.cpp` - Build or load the extension from Python - Call the operator from `torch.ops.kernel_lab` - Add correctness checks once the CUDA kernel is implemented - Try `torch.library.opcheck` if your PyTorch build provides it