How do context switches work in CUDA? It seems like we generally want to avoid them to maximize arith. intensity
potato
I thought that warps in a thread block could be scheduled across different SMs like in the previous fictitious example with 4 warps but 8 warps were needed for the thread block
How do context switches work in CUDA? It seems like we generally want to avoid them to maximize arith. intensity