Previous | Next --- Slide 87 of 116
Back to Lecture Thumbnails
rtan21

5120 ALUs assuming 2 64-bit inputs for each ALU in each clock cycle that's ~80kb worth of data per clock? I did a quick Google search and it looks like the V100 has a base clock of 1245MHz, which means ~112GB of data each second, which is a lot of data but surprisingly (or unsurprisingly) not throttled by the memory bandwidth!

bryu

@rtan21, the V100 shown here has a memory bandwidth of 900GB/s, but V100 is already more than three years old. The newer A100 based on the Ampere technology has a bandwidth of over 1500GB/s. Examining tasks based on memory bound vs. compute bound is a key concept in GPU computing that GPU programmers apparently think about. For example, taking the exponent of a vector is going to be always going to be memory bound because the amount of computation to perform is not heavy [O(N)], but multiplying large matrices should become compute bound because of the computationally intense nature of the operation [O(N^3)].

Please log in to leave a comment.