Previous | Next --- Slide 23 of 55
Back to Lecture Thumbnails
rahulahoop

How can we detect whether a node has failed? Barring something completely obvious like a rack catching fire, how can we check whetehr a node is sending us valid or junk values?

pranil

Many distributed computing systems including Spark use heartbeat mechanisms. Every node periodically sends a quick ping message to the central node to assure it that it is alive and well. I could see some research going on about whether this is the best way to achieve this. Because of network latency, at times these heartbeat messages could time out and the node could be wrongly declared dead.

brianamb

A question that I had in class is what is the cost of recalculating/redoing jobs on a machine if failure is detected? Just curious as to how this could hurt the overall performance and whether there is a way to better accomodate for this cost.

Please log in to leave a comment.