kernel-lab/tasks/06_pytorch_custom_op/spec.md

# Task 06: PyTorch Custom Op

## 1. Problem Statement

Expose a CUDA kernel as a PyTorch operator so Python code can call it and test it like any other operator.

## 2. Expected Input/Output Shapes

For the starter binding, use vector add:

- `x`: `[N]`
- `y`: `[N]`
- output: `[N]`

The same pattern can later be extended to the other operators.

## 3. Performance Intuition

The binding layer is not usually where the kernel time goes, but it determines whether you can test, benchmark, and profile the CUDA implementation from Python.

## 4. Memory Access Discussion

The binding itself does not optimize memory traffic; it passes tensors and dispatches the kernel. Still, the binding must preserve shape, dtype, device, and contiguity assumptions.

## 5. What Triton Is Abstracting

Triton often avoids a separate C++ binding layer because Python can launch the JIT kernel directly.

## 6. What CUDA Makes Explicit

CUDA plus PyTorch binding requires you to define function signatures, operator registration, and build integration explicitly.

## 7. Reflection Questions

- What assumptions should the binding validate before calling a CUDA kernel?
- Why is operator registration useful for testing and benchmarking?
- What changes once you want autograd support?

## 8. Implementation Checklist

- Read `kernels/cuda/binding/binding.cpp`
- Build or load the extension from Python
- Call the operator from `torch.ops.kernel_lab`
- Add correctness checks once the CUDA kernel is implemented
- Try `torch.library.opcheck` if your PyTorch build provides it