Do CUDA threads instructions run on SIMD vector ALU or parallel scalar ALU units then?


My mistake. The ALUs are SIMD f32 functional units.


So let me try and summarize the difference between CPU SIMD / ISPC and cuda warp / thread:

CPU SIMD / ISPC: - 1 core has a vector ALU (can have more ALUs, but keep things simple) - Vector instructions can operate across all lanes (ISPC instances) WITHIN 1 execution context (or do we call this 8 execution contexts for 8-wide SIMD?). So this is SIMD within 1 thread.

CUDA: - 1 core has many scalar ALUs (need vector ALUs?) - if all execution instructions the same, 1 CUDA thread = 1 SIMD lane - 1 GPU can have up to NUM_WARPS * 32 execution contexts. - Vector instructions can operate across all lanes / threads in a WARP. In fact, this is the way it is run. SIMD is the only way a warp executes instructions, so there is necessarily SIMD divergence the 32 instructions are not the same.



Ah I tried to do bullet points by using dashes. Didn't work; please treat the dashes as new lines!

