// Workbook-local CUDA sketch for row softmax. // // Reflection prompt: // Softmax is usually bandwidth-bound because the math is cheap but the rows are read and written a lot. // Keep track of how many global-memory passes your implementation needs. // TODO(student): // 1. Assign one block or block tile to a row. // 2. Compute the row max. // 3. Compute the sum of exp(x - row_max). // 4. Normalize the row.