juliob

So is the idea that with these SIMD instructions, it's effectively adding / multiplying / operating on 8 numbers, all in one atomic instruction, as opposed to doing it in 8 separate instructions?

nassosterz

@juliob I believe that this is the idea. However this got me into looking more at applications or when to use SIMD. It turns out that SIMD is mainly applicable to input / data that can be vectorized. Moreover, to maximize advantages of SIMD, knowledge of the input helps, since, as we saw with the if / else example on the lecture, there can be moments where only one of the 8 computations actually matters. As we discussed, reorganizing the input (padding it also) helps maximize the benefits.

stao18

What happens if you write a program with SIMD instructions but the processor doesn't support SIMD? Would it know how to convert it back to int/float etc.?

tonycai

How would we handle conditional statements with AVX intrinsics? It seems like AVX intrinsics only support applying the exact same operations on a contiguous block of memory.
If we don't have good cache locality in data access, does that mean parallelism suffers tremendously (since only one ALU will do work at a time)?

jennaruzekowicz

What is the ideal vector size here? In typical computers/systems, how often do we see the exact same instruction needing to be applied to a vector of items, and what is that typical length?

leo

Since this is essentially doing the same operations on 8 numbers/data points, along with a set of instructions, what happens if we have a non-multiple of 8 data points to do operations on? Since it follows the same code, wouldn't there be a SEGFAULT or something similar? I'm assuming it should be handled by the caller using intrinsics, but maybe it is dealt with at the hardware level?

sanjayen

@leo, I believe there are ways to handle this issue at the software level. We could pad the number of data points to be a multiple of 8 by adding dummy elements that we don't care about. I'm not sure about the hardware level, but we could potentially specify a size in our intrinsics calls that tells the ALUs to pad/bitmask as appropriate to deal with missing elements. In any case, I believe the course staff mentioned that this issue will come up on Assignment 1!

subscalar

@stao18, if you attempt to run a program with SIMD instructions on a processor that does not understand SIMD instructions, the processor should let you (or really the OS) know with an exception. By default, your program will crash, although maybe on Unix-like systems you can intercept and handle the signal. Check out the "invalid opcode" exception on x86 and "SIGILL" for the signal.

huangda

IF there is a single element that spans the list, such as perhaps we are multiplying by the product of the previous loop iteration, does the benefit of SIMD break down entirely? As in, does it become a completely sequential operation that will only run on one SIMD core?