Previous | Next --- Slide 32 of 116
Back to Lecture Thumbnails
victor

How does the intrinsic function work under the hood? What would happen if we tried it with a conditional? Does that just make it suboptimal, or are there also correctness concerns?

probreather101

Isn't gcc capable of automatic SIMD parallelization with the appropriate compiler flags? What are the flags required for this and how does the code need to be structured for this to automatically be done? I assume that gcc would struggle to process poorly structured code and would thus be unable to parallelize, but it seems like obvious parallel opportunities should be recognized and compiled automatically by gcc.

noelma

I'm still not comfortable with the concept of AVX intrinsics? Simply put, are they calls a programmer can make in their code to take advantage of SIMDs? Will we be expected to know how to use intrinsics or at least be able to read them in this class? What are the best resources available on this topic?

kristinayige

What would be the significance of vector width for a vector program? Is it related to the number of iterations that we would need to make?

qwerty

My understanding is that the vector width (vector_width) reduces the number of iterations N that we would need to make by a factor of N / vector_width, because vector_width elements can be executed in parallel. This is also apparent in Program 2 of the first programming assignment.

spendharkar

My understanding of AVX intrinsics (the programming assignment helps a lot with this) - we are breaking up our work into vector of size vector_width and giving this vector as input to intrinsic functions (special vector instructions). I think of intrinsics as functions that take vector input. For conditionals, we use masks which could reduce the effectiveness of using SIMD if there is lane divergence. The vector width could affect lane divergence.

timothy

I just wanted to note that the intrinsics guide includes throughput, in terms of cycles per instruction! I was impressed by the _mm256_sqrt_ps (sqrt), which can do vector sqrt in only 6 cycles on Icelake, rather putting our homework implementation to shame!

Please log in to leave a comment.