If we do assign each particle to a CUDA thread, and now have 16x less contention, how does this compare to the fifth solution in terms of speed and efficiency? I would assume it is slower, but how much slower?
crs
I don't know if its necessarily slower in every case. It probably depends on number of particles and amount of overlap.
If we do assign each particle to a CUDA thread, and now have 16x less contention, how does this compare to the fifth solution in terms of speed and efficiency? I would assume it is slower, but how much slower?