Files
obsidian/study/CUDA notes.md

69 lines
1.7 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

## Thread Hierarchy
thread, block, (cluster), grid
> for a two-dimensional block of size `(Dx, Dy)`, the thread ID of a thread of index `(x, y)` is `(x + y * Dx)`; for a three-dimensional block of size `(Dx, Dy, Dz)`, the thread ID of a thread of index `(x, y, z)` is `(x + y * Dx + z * Dx * Dy)`
## SIMT architecture (single-instruction multi-thread)
32 threads as a warp
一个 warp 内 single instruction不同 thread 若执行的指令流一致,则并行,否则分别执行,因此尽可能保证一个 wrap 内的 thread 执行相同的指令流。Actually in nowadays:
> Starting with the NVIDIA Volta architecture, Independent Thread Scheduling allows full concurrency between threads, regardless of warp.
> threads can now diverge and reconverge at sub-warp granularity.
You can test it by following code:
```cpp
#include <chrono>
#include <iostream>
#define T 1000000
__global__ void f(float *x) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
for (int t = 0; t < T; t++) {
if (i < (1 << 10)) {
x[i] *= 1.1;
} else {
x[i] += 1;
}
// if (i % 2 == 0) {
// x[i] *= 1.1;
// } else {
// x[i] += 1;
// }
if (x[i] > 1e6) {
x[i] = 1.0;
}
}
}
int main(void) {
int N = 1 << 20;
float *x, *d_x;
x = (float *)malloc(N * sizeof(float));
cudaMalloc(&d_x, N * sizeof(float));
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
}
cudaMemcpy(d_x, x, N * sizeof(float), cudaMemcpyHostToDevice);
auto start = std::chrono::steady_clock::now();
f<<<N / 256, 256>>>(d_x);
auto end = std::chrono::steady_clock::now();
auto elapse = end - start;
std::cout << "time used: " << elapse.count() << " ns" << '\n';
cudaMemcpy(x, d_x, N * sizeof(float), cudaMemcpyDeviceToHost);
cudaFree(d_x);
free(x);
}
```
## Summary