I think it's a huge consideration, especially as you move towards more specialized chips. For example, if you look at the last few generations of NVIDIA gpus, they have been introducing new type of execution units in the ray tracing cores and also the tensor cores and I'm sure that there is a lot of discussion in how much benefit these new cores bring to users, since implementing them costs silicon that could go towards other things.
Example (1) I'm sure is used to some extent, and I think in a lot of domains which utilize highly parallel computations throwing more cores at the problem seems to be a valid strategy (just look at the core counts per generation for datacenter CPUs and GPUs). Example (2) is definitely harder, and I'm not sure if anyone's had a ton of success doing something like that, but if you're interested, there's a cool paper on how Google used deep reinforcement learning to optimize chip floorplanning for their next gen TPUs here: https://www.nature.com/articles/s41586-021-03544-w.pdf. Another related research area you can look into is called neural architecture search, which optimizes the architecture of a neural network (rather than hardware components) based on its performance.
A summary of some of the past few slides that's helped me achieve some level of clarity: Memory latency is the amount of time it takes for a processor to receive data it has requested from memory. Memory bandwidth is the rate at which the memory system can provide data to a processor. Bandwidth is a measure of throughput (how much data per unit time?), while latency measures the time for any one item to be received.
As an example, to improve bandwidth, one could use wider communication channels, but this would retain the same memory latency.
@ghostcow. Correct. If the speed limit of a highway is fixed at S. Driving from point A to B at that speed limit is the best one can do in terms of latency. However, if the highway can many lanes, then many cars per second could get from A to B. The latency of completing the trip on the highway for any one car is the same, but the throughput of the highway in terms of cars per second goes up!
And just like on highways, it is general is much more feasible for a computer network designers to substantially increase the bandwidth of links (their "width"), than it is to substantially reduce their latency.
I'm still a bit confused by "Bandwidth-Limited" - In one of the written-assignments, we had a bandwidth-limited set of work, but the amount of memory we accessed wasn't enough to hit the "Memory Bandwidth" (In other words, if the Bandwidth was 4GB / sec, we would only ever access < 1GB).
If the computation is indeed BW-limited, then does this mean that at any memory size, we'd be trying to access memory faster than we could supply it?
Please log in to leave a comment.
Question about the hardware design aspect of this: I'm assuming processor designers base many design considerations off of target workloads (e.g., servers vs scientific computing etc.). To what extent do they do this?
What I mean is: for example, (1) you could say "for workload X, more thread-parallelism helps, so let's add more cores" in a very general sense, or (2) you could set up an actual optimization problem that, given a fixed number of resources, optimizes the # cores, # SIMD processors, cache sizes, etc. for minimizing time to run workloads X,Y,Z (or other performance-relevant metrics). I imagine that (2) would be difficult to do exactly, but something along those lines.