What does it mean to use a cluster of machines to store graph in memory? If you use a cluster of machines, are you communicating the graph data between machines via network traffic? There was a diagram in a slide many lectures ago that stated bandwidth for network traffic and disk and I thought the conclusion we drew was that the bandwidths were similar. Why would we prefer using a cluster of machines to store a graph rather than storing it in disk?
@shreya_ravi, The spark lecture said that (http://35.227.169.186/cs149/fall21/lecture/spark/slide_6) cluster node's have roughly the same bandwidth from disk as the have across the network, and the bandwidth from memory is ~100x both.
So, it might be more advantageous to store a graph in memory across multiple machines, rather than a single one on disk, if we are compute bound. Even if the bandwidth is the same, having multiple CPUs processing the graph could make the cluster faster.
Depending on the problem setting, we might not need to communicate between nodes as often as we would read from disk. This might occur if our per-vertex computation is more local, e.g. some of the vertex centrality measures (https://en.wikipedia.org/wiki/Centrality) only rely on the neighborhood surrounding each vertex.
Please log in to leave a comment.
The idea of partitioning big graphs reminds me of SimRank (https://en.wikipedia.org/wiki/SimRank), where people work on approximate PageRank algorithms on big graphs.
The basic essence of SimRank is to trade off accuracy and speed; the approximate algorithm computes scores on smaller partitions of graph with a much faster speed; scores of node may not be as accurate as defined, but the bias is bound by a configurable threshold.