finn

Note to someone studying for exam: Some instructions in a sequentially-written program MUST occur in different clock cycles, no matter how many processors your machine has. This means the benefit of a multi-processor chip is useless without a multi-processing program structure.

apappu

To clarify above -- this is because of dependency structure (I believe), for instance, the final computation of the three additions (bottommost layer above) relies on the intermediate partial sum computed in the second layer

joshcho

What kind of work has been made to debug parallel programs by constructing DAGs? Are there other representations that work better?

jennaruzekowicz

I think it is super cool, and even warming that much of computer science loops back on itself. From this image I can immediately see the relationship between the traversing a tree and looking at various levels of trees, and actually designing a scheduler.

jennaruzekowicz

@finn Can you clarify what you mean by a multi-processing program? Does this simply mean the multi-processor chip is useless without forking new pids/using threads? A little rusty on this!

ismael

@finn and @jennaruzekowicz I am also a bit confused about what you mean by a multi-processing program structure. If you are referring to forking new processes or leveraging threads, I think that the concept of superscalar execution (covered in the next slide) may conflict with the idea that a multi-processor chip becomes useless without a multi-processing program structure. The idea behind superscalar execution is that the processor automatically finds independent instructions in your program and then goes ahead and runs them in parallel on multiple execution units. This essentially gives programmers the benefits of parallelizing computation without having to implement any parallelization logic into our programs. Apologies if I misunderstood what you meant by "multi-processing program structure".

ismael

After watching lecture 2, I realized that I was mistaking an execution unit with an additional core in my previous comment.

czh

Here are some background reading on ILP that I found helpful: https://www.hpl.hp.com/techreports/92/HPL-92-132.pdf

finn

@jennaruzekowicz forking / different threads

xyz

It's so cool that ILP boils down to (it seems) path traversal in directed acyclic graphs! It seems then that you could produce some metric of the ability to parallelize a computation using ILP by finding the longest path from source to sink within it. I wonder then if modern compilers do/could attempt to produce this graph at compile time to speed up execution. From lecture it sounded as though these speed ups only happen on a sliding window of instructions at execution time.

german.enik

what i'm confused about is how exactly these trees are constructed at such a low level? in more high-level programming courses we design graphs/trees as classes, but the high-level programming builds on top of the low-level programming. We are as low as the processor here, so there is no OOP, etc. -- how exactly does ILP keep track of everything?

tcr

Thinking about how compilers/hardware exploit instruction level parallelism has also made me think about how I can statically parallelize code (i.e., loop unrolling) when I write it.

I'm also confused about how hardware and compilers efficiently figure out dependencies -- that feels like it would get really complicated really quickly.