Task 01: Vector Add

1. Problem Statement

Implement out[i] = x[i] + y[i] in both Triton and CUDA, then compare both against the PyTorch reference.

2. Expected Input/Output Shapes

Input: two tensors with identical 1D or flattened shapes
Output: one tensor with the same shape

3. Performance Intuition

Vector add is simple enough that launch overhead and memory bandwidth dominate quickly. It is a good place to learn indexing before the math becomes interesting.

4. Memory Access Discussion

This kernel should read x[i] and y[i] once and write out[i] once. The main thing to inspect is whether neighboring threads or lanes access neighboring elements.

5. What Triton Is Abstracting

Triton lets you express one block of contiguous offsets with program_id and tl.arange, then apply a mask on the tail.

6. What CUDA Makes Explicit

CUDA makes you compute global_idx from block and thread indices yourself and write the boundary check explicitly.

7. Reflection Questions

What is the exact correspondence between program_id and blockIdx.x here?
Why is a mask or bounds check required on the final block?
How would the ownership change if one thread handled multiple elements?

8. Implementation Checklist

Confirm the reference implementation
Fill in the Triton masked loads, add, and store
Fill in the CUDA thread ownership and store
Test small and non-multiple-of-block-size shapes
Benchmark bandwidth on larger vectors

1.5 KiB Raw Blame History