Was about to make the same comment -- this seems quite complex as we are using implementation details to implement our CUDA function -- as gohan said, it seems this only works because we know the program will be executed in 32 SIMD fashion, so after each if check/line, all of the CUDA threads have done the appropriate partial sums such that the next step of the algorithm (in a span sense) can run
yup...it was a bit tough to figure out why memory corruption wont happen here as we use the same ptr variable for dependent steps. Since CUDA may choose any thread block and thread to run at random, we need the idx mod 32 thing. However, I am not able to understand one thing:won't thread with idx = 34, let's say, once it quickly finishes lane>=1 condition immediately start working on lane>=2 condition? This means, even before the relevant RHS operands are made available from lane>=1 line, lane>=2 starts executing, and therefore leads to incorrect answers as a read happened before a write from an earlier step completed. Would this situation not require a _syncthreads() between the if statements?
Is it the case that this function returns the scan on 32-element array only on SIMD? In other words, it seems like the if(lane >= 1) part needs to be completed by all ALUs together before if(lane >= 2) is executed.
Exclusive scan is useful when we have a flags array (ones and zeros / true false values) and we want to figure out which indices to put elements that have a certain property.
Please log in to leave a comment.
This code took me a while to understand that CUDA runs threads in lock steps for each warp so there is no need for barriers.