Just like how we can use cudaDeviceSynchronize() to synchronize between the host and the device, we can use __syncthreads() inside the CUDA kernel functions to wait on all threads within the block to finish their operations.
amohamdy
I know that trying to get a function like __syncthreads() to work across different blocks would introduce deadlock situations, but I'm curious if it would be helpful for some applications to have some cuda sync primitives across blocks as long as either the cuda compiler or the programmer can guarantee that there will be at least one block free so that useful work can be done, or perhaps some primitives to for priority scheduling for the block scheduler. I'm not sure if this would even be possible in CUDA.
Just like how we can use cudaDeviceSynchronize() to synchronize between the host and the device, we can use __syncthreads() inside the CUDA kernel functions to wait on all threads within the block to finish their operations.