# Task 01: Vector Add ## 1. Problem Statement Implement `out[i] = x[i] + y[i]` in both Triton and CUDA, then compare both against the PyTorch reference. ## 2. Expected Input/Output Shapes - Input: two tensors with identical 1D or flattened shapes - Output: one tensor with the same shape ## 3. Performance Intuition Vector add is simple enough that launch overhead and memory bandwidth dominate quickly. It is a good place to learn indexing before the math becomes interesting. ## 4. Memory Access Discussion This kernel should read `x[i]` and `y[i]` once and write `out[i]` once. The main thing to inspect is whether neighboring threads or lanes access neighboring elements. ## 5. What Triton Is Abstracting Triton lets you express one block of contiguous offsets with `program_id` and `tl.arange`, then apply a mask on the tail. ## 6. What CUDA Makes Explicit CUDA makes you compute `global_idx` from block and thread indices yourself and write the boundary check explicitly. ## 7. Reflection Questions - What is the exact correspondence between `program_id` and `blockIdx.x` here? - Why is a mask or bounds check required on the final block? - How would the ownership change if one thread handled multiple elements? ## 8. Implementation Checklist - Confirm the reference implementation - Fill in the Triton masked loads, add, and store - Fill in the CUDA thread ownership and store - Test small and non-multiple-of-block-size shapes - Benchmark bandwidth on larger vectors