// Workbook-local CUDA sketch for tiled matmul.
//
// TODO(student):
// 1. Choose a block tile size, for example 16x16 or 32x32.
// 2. Load one A tile and one B tile into shared memory.
// 3. Synchronize.
// 4. Accumulate partial products.
// 5. Synchronize before loading the next tile.
// 6. Store the final C element or tile.