When we exceed 1MM elements, the number of elements in base array would be greater than 32 and so we need > 1 warp to execute? IS this what we mean by the line in the slide?
shivalgo
Also, aren't there 64 thread blocks so we can process 64*32 = 2048 elements at a time?
pizza
@shivalgo, for your second question, I'm not fully sure but I think warps are a set of 32 threads within a thread block, and each thread block can have lots of warps, so we can maybe process a lot more than num_thread_blocks*32 elements at a time
When we exceed 1MM elements, the number of elements in base array would be greater than 32 and so we need > 1 warp to execute? IS this what we mean by the line in the slide?