Previous | Next --- Slide 37 of 74
Back to Lecture Thumbnails
It's me!

The two parameters in convolve <<>> are not passed to the function, but are used to provide the built-in variable blockDim.

It's me!

Here, for computing the output, each thread in the block loads 3 input[] elements from the global device memory. So every block loads 3*128 times from the global device memory. As this incurs a huge amount of latency, there is shared 'per block' memory available, which is discussed in the next slide.


I don't quite understand what you mean by utilizing the 1-D aspects of the thread grid size/thread index....would love some clarification!


The difference between a device and host function was confusing for me at first, but made a lot more sense after starting A3 - a host function is like the "main" code that executes on the calling thread, and a device function is a per-thread function (similar to a child thread that spawns). It's actually a much clearer way of understanding what each of the threads do!


Something that was confusing to me in lecture was the use of N/THREADS_PER_BLK and THREADS_PER_BLK here, when we had previously used dim3 parameters. To clarify,

dim3 is an integer vector type based on uint3 that is used to specify dimensions. When defining a variable of type dim3, any component left unspecified is initialized to 1.

So, we're only working with a 1-D version of the CUDA grid. @ccheng18, this might apply to your question as well.

See more at

Please log in to leave a comment.