The slide in a nutshell: We produce 256x32 sized chunks. The parallelism exists for the outer for loop where each thread takes up all the tiles in a row of tiles, and at a time processes only one 256x32 sized tile (in groups of 8-wide elements for SIMD utilization)
czh
Adding to @jagriti’s comment: for each 256*32 chunk, perform two passes, using a temporary buffer to store shared computations.
leo
I was feeling exactly like the emojis when I tried to read through this slide. Thanks for the extra explanations @jagriti and @czh!
ghecko
I agree, those extra explanations really helped a lot!
The slide in a nutshell: We produce 256x32 sized chunks. The parallelism exists for the outer for loop where each thread takes up all the tiles in a row of tiles, and at a time processes only one 256x32 sized tile (in groups of 8-wide elements for SIMD utilization)