In this case, is the maximal thread block size 4 * 32? Can we schedule multiple thread blocks on the same core simultaneously if, say, thread block size is 2*32 or less?
I think the max thread block size here is 1632 as we have 16 warps on each sub-core. And we should be able to schedule multiple thread blocks on the same core simultaneously as long as the block size is smaller than half of this max which is 832.
Sorry I meant 16 * 32 and 8 * 32 (couldn't edit my previous comment for some reason).
Would a thread block on a warp ever be moved to a warp on another sub core?
The if statement (if threadIdx.x < 2 ... ) was initially confusing to me. But now I understand that it is used to cater for the 129th and 130th elements of the shared support array of the block, since we only have 128 threads per block.
Is it better to use shared memory or just redundant copies per thread when the data is small?
Please log in to leave a comment.