Previous | Next --- Slide 24 of 63
Back to Lecture Thumbnails
jagriti

The slide in a nutshell: We produce 256x32 sized chunks. The parallelism exists for the outer for loop where each thread takes up all the tiles in a row of tiles, and at a time processes only one 256x32 sized tile (in groups of 8-wide elements for SIMD utilization)

czh

Adding to @jagriti’s comment: for each 256*32 chunk, perform two passes, using a temporary buffer to store shared computations.

leo

I was feeling exactly like the emojis when I tried to read through this slide. Thanks for the extra explanations @jagriti and @czh!

ghecko

I agree, those extra explanations really helped a lot!

Please log in to leave a comment.