Kecleon

stomic compare and swap is needed because otherwise there might be a data race condition when cond(i) returns true, parents[i] gets updated by another vertex, and update(s, d) attempts to update parents[d] again.

michzrrr

Pretty cool how we can have a concept like EDGEMAP which takes such specific inputs and has such a well-defined use case, yet since we are in the domain of graph processing, it can be used so generally for almost any graph processing tasks. Really shows how much domain specifics knowledge can improve things.

rahulahoop

In what sense would this code be more performant than something we might write on PA4? I see the domain specificity here, but I'm curious how the implementation of Ligra actually speeds up graph processing.

sanjayen

@rahulahoop I imagine that the Ligra implementation has very finely tuned implementations of all the major algorithms used in graph processing (things like traversals, partitions, etc.), above and beyond our PA4 implementations by having custom scheduling policies based on more complex heuristics. Then, the composability of these algorithms leads to more general performant code.

riglesias

@rahulahoop Another thing you could consider is that Ligra may have better knowledge of how to effectively use your computer's resources. The abstraction is defined such that Ligra is free to auto-vectorize your code, improving the throughput of your function!

ccheng18

So Ligra and other very specialized programming languages, like specialized hardware are considered more energy efficient? How is this implemented to be the case - would we not be able to get the same speed up by programming parallel-y in C or OMP?

riglesias

@ccheng18 There's nothing inherent to these libraries that make it "more efficient" than code that could be written by hand. The reason they usually outperform code written in the traditional way is that the library has domain-specific optimizations enabled. In other words, it is possible to get the same speedup in C by using your computer's resources efficiently (meaning you use SIMD in cases of low divergence, use the cache intelligently, etc...)

For example, we made a thread pool in Assignment 2 that was subject to some design considerations. If we wanted to make a thread-pool that we can use to parallelize graph algorithms, we could make different decisions that could potentially speed it up more.

riglesias

@ccheng In contrast, specialized hardware has a different "processing pipeline" than a general purpose CPU. One analogy you could use is with moving cargo. A CPU would be like putting a single package on your car and driving to/from the destination. A piece of hardware specializing in an algorithm like FFTs would be like a train: Very limited flexibility, but incredibly high throughput due to not needing to consider different use cases.

I would say that DSL's aren't quite the same as hardware accelerators and ASICs, though they both use domain-specific knowledge to improve the speed of your program.