Files
2026-04-10 13:22:19 +00:00

1.5 KiB

Task 01: Vector Add

1. Problem Statement

Implement out[i] = x[i] + y[i] in both Triton and CUDA, then compare both against the PyTorch reference.

2. Expected Input/Output Shapes

  • Input: two tensors with identical 1D or flattened shapes
  • Output: one tensor with the same shape

3. Performance Intuition

Vector add is simple enough that launch overhead and memory bandwidth dominate quickly. It is a good place to learn indexing before the math becomes interesting.

4. Memory Access Discussion

This kernel should read x[i] and y[i] once and write out[i] once. The main thing to inspect is whether neighboring threads or lanes access neighboring elements.

5. What Triton Is Abstracting

Triton lets you express one block of contiguous offsets with program_id and tl.arange, then apply a mask on the tail.

6. What CUDA Makes Explicit

CUDA makes you compute global_idx from block and thread indices yourself and write the boundary check explicitly.

7. Reflection Questions

  • What is the exact correspondence between program_id and blockIdx.x here?
  • Why is a mask or bounds check required on the final block?
  • How would the ownership change if one thread handled multiple elements?

8. Implementation Checklist

  • Confirm the reference implementation
  • Fill in the Triton masked loads, add, and store
  • Fill in the CUDA thread ownership and store
  • Test small and non-multiple-of-block-size shapes
  • Benchmark bandwidth on larger vectors