The locking/unlocking in the inner for loop will result in high synchronization overhead. It is more efficient to compute a partial sum per worker and then add them up after the for loops.
alexder
This implementation is basically serial. As @spendharkar, you should be able be joining the partial sums after loops have completed.
On the same note, I'm curious how this joining operation is performed. Can we think of it similar to a map reduce function?
The locking/unlocking in the inner for loop will result in high synchronization overhead. It is more efficient to compute a partial sum per worker and then add them up after the for loops.