Previous | Next --- Slide 29 of 66
Back to Lecture Thumbnails
fromscratch

To answer the question on the slide, I think this is akin to the "test and test-and-set" reduction for contention, or to coherence traffic. The inner loop on the right-hand side can often keep re-using the local cache's value of l, whereas atomicCAS would have to invalidate the line everywhere by attempting to gain exclusive/write access to it in a spinning fashion, generating lots of unnecessary traffic.

(Aside: This code is just illustrative, but I'm not sure if the current code as written is strictly valid C/C++, though. I think we'd need to tell the compiler that we need a memory fence, say. Otherwise, technically, it might optimize the inner loop to simply an infinite loop if l equals one initially. Then again I suppose it's just an illustrative example.)

abraoliv

Curious about @fromscratch's aside note! Clearly one of the main technolgies that makes up CUDA is the compiler that compiles for NVIDIA hardware. Would this compiler optimize out instructions like this that are based on shared memory? What is a memory fence?

gsamp

@abraoliv, this slide explains memory fences: http://35.227.169.186/cs149/fall21/lecture/consistency/slide_49

Please log in to leave a comment.