Performance Optimization II: Locality, Communication, and Contention

Previous | Next --- Slide 19 of 90

spendharkar

The locking/unlocking in the inner for loop will result in high synchronization overhead. It is more efficient to compute a partial sum per worker and then add them up after the for loops.

alexder

This implementation is basically serial. As @spendharkar, you should be able be joining the partial sums after loops have completed.

On the same note, I'm curious how this joining operation is performed. Can we think of it similar to a map reduce function?

Please log in to leave a comment.