Ah, I understand. partial is 1/8th of the sum.
Why is x[i]
a different value for each program instance? The caller provides an array into the second parameter, and then ISPC creates the program instances using the same values across them. I don't see how different values of x[i]
could mess the sum up since they'd get the same value after the for loop
oh since sum is uniform, it can be different at one instance since x[i]
values are non-unique so it breaks the uniform condition. In that case, I'm assuming it would be possible to have parallelism if sum were not uniform?
As they must be written somewhere in memory, is there a way we can look at each of the partial sums? Or is that information hidden from the user?
float partial is initialized for each instance of the iteration, and each instance has it's own copy: they all start at 0.0f.
In each iteration, they add themselves to 0.0f.
Then reduce_add adds together all of the partial sums; it has access to all N instance-local float partial values.
Since we remove the uniform and now each program instance has to track its own value for sum, will we see a significant decrease in the speedup?
Kayvon highlighted that reduce_add combines all the partials of all the gang instances into one overall sum.
@joshcho I think it would be possible to call sum += reduce_add(x[i]) directly in each loop with no errors, it would just be really inefficient to compute the sum that way as it would lose the benefit of the vectorization though now as there might be some added communication between program instances for the reduction step. More on this can be found in https://ispc.github.io/perfguide.html.
Please log in to leave a comment.
What would be a problem with having sum = reduce_add(x[i]) in the foreach loop?