More detail:
Simultaneous multi-threading involves executing instructions from two different threads in parallel on a core. In class, I mainly described and illustrated interleaved multi-threading, where each clock the core chooses one runnable thread and executes the next instruction in that thread's instruction stream using the core's execution resources.
To fast-forward a bit in the lecture, a more modern NVIDIA GPU, the V100 (see slide 62) is able to maintain state for up to 64 execution contexts (called "warps" in NVIDIA-speak) on its "SM" cores, and each clock it chooses up to four of those 64 threads to execute instructions from. Those four threads execute simultaneously on the core using four different sets of execution resources. So there is interleaved multi-threading in that the chip interleaves up to 64 execution contexts, and simultaneous multi-threading in that it chooses up to four of those contexts to run each clock.
Intel CPUs typically have two hardware threads per core. (in marketing speak its called Hyper-threading). Recall an Intel core has the capability to decode and execute a complex mixture of instructions per clock (see here). Intel's Hyper-threading implementation works like this: each clock the core takes a look at both threads and tries to find a set of instructions that can fill up these execution units. Therefore, that mixture might be independent instructions from one thread, if ILP exists (traditional superscalar execution) or instructions drawn from both threads (bringing us into simultaneous multi-threading territory). This design makes sense if you consider the context: Intel had spent years building superscalar processors that could perform a number of different instructions per clock (within a single instruction stream). But as we discussed, it's not always possible for one instruction stream to have the right mixture of independent instructions to utilize all the available units in the core. (This is the case of insufficient ILP). Therefore, it's a logical step to say, hey, to increase the CPU's chance of finding the right mix, let's modify our processor to have two threads available to choose instructions from instead of one!
And a bit more:
Of course, running two threads is not always better than one, since these threads might thrash each other's data in the core's L1 or L2 cache, resulting in more cache misses that ultimately cause far more stalls than Hyper-Threading could ever hope to fill. On the other hand, running two threads at once can also be beneficial in terms of cache behavior if the threads access similar data. One thread might access address X, bringing it into cache. Then, if X is accessed by the other thread for the first time, what normally would have been a cold miss in a single thread system turns out to be a cache hit!
Finally, it might also be instructive for students to note that the motivation for adding multi-threading to an Intel CPU (called hyper-threading) was different from the motivation for large-scale multi-threading in a GPU. GPUs support many hardware threads for the primary purpose of hiding memory latency. Intel HyperThreading isn't really intended to hide all memory latency (it only has two threads, and that's not enough to hide the long latencies of out to memory). Instead, Intel HyperThreading exists to make it easier for the core's scheduler to find enough independent instructions to fill the multiple ALUs in a modern superscalar Intel CPU.
Another good example of a multi-threaded processor, where support of multiple threads was intended to hide memory access latency, is the UltraSPARC T2 chip, which features eight threads per core. An academic paper about T2 is here.
A historical note. The UltraSPARC T1 and T2 chips, were also referred to as "Niagara", and you may recognize the name of one of the primary architects of that original chip.
Thanks for this detailed info. I was wanting to read up a bit more about Intel's hyperthreading later, but this was more than enough for me for now.
This is super helpful. So just to confirm, two hyperthreaded threads on a single core are not running in parallel, correct? Just concurrently because they need to be interleaved?
^Or rather that it's partially parallel because it could or could not depending on what the next instructions are on each?
Another point I'm curious about multi-threading support at HW level: how does the OS signal the CPU that there are n different threads running in parallel? Is it through the use of UNIX-like semantics like pthreads? Or is it through the scheduling multiple OS-level threads to the same core, then treat the HW multi-threading as a black box?
@potato. Your understanding is correct. A modern Intel chip that supports hyper-threading is just going to find instructions that can be executed simultaneously. It may draw instructions from one or both of the two available threads.
@sirah. Think about it this way. The OS is responsible for initializing the state of the hardware's execution contexts to the state of software threads on the machine. If the OS decides that a two-threaded processor should be running software threads A and B, it copies the register state of those threads onto the hardware's execution contexts (including the program counter register PC). The processor just starts executing the next instruction associated with its execution contexts, as given by the PC of each execution context.
I was reading up about Intel's hyperthreading, and I found a very interesting article on how the resources are shared between the two threads - which includes memory caches. Apparently, the hyperthreading feature has been the center of a lot of debate for Intel - because it makes the system vulnerable to security threats because a CPU core which isn't supposed to have access to what another core (running a different application) was doing - could potentially have access to it. (Link to one of the articles I found: https://www.techradar.com/news/intel-cpus-can-be-exploited-unless-you-disable-hyper-threading-linux-dev-claims)
When I refer to a core - I mean virtual core (for clarity)
Please log in to leave a comment.
I was thinking about simultaneous versus temporal multithreading are. The only answer I thought of was that SMT will incur lesser context-switching penalties compared to TMT. But otherwise, SMT can just be thought of as fine-grained multithreading where TMT can be considered coarse-grained multithreading. If there is anything significant here that I am missing, let's discuss.