Previous | Next --- Slide 85 of 116
Back to Lecture Thumbnails
Kecleon

I think it wouldn't, because loading/storing inputs needs few arithmetic instructions, and it's hard to hide latency given limited number of threads. Additionally, loading millions of elements all at once would flood the L1 and L2 cache. A better way to write this logic would be breaking down input A, B into smaller chunks.

Please log in to leave a comment.