ccmp1

This is an implementation of the gossip-style fault detector in this paper

Each member maintains a list with for each known member its address and an integer which is going to be used for failure detection. We call the integer the heartbeat counter. Every Tgossip seconds, each member increments its own heartbeat counter, and selects one other member at random to send its list to. Upon receipt of such a gossip message, a member merges the list in the message with its own list, and adopts the maximum heartbeat counter for each member.

Each member occasionally broadcasts its list in order to be located initially and also to recover from network partitions (see Section 5). In the absence of a broadcast capability, the network could be equiped with a few gossip servers, that differ from other members only in that they are hosted at well known addresses, and placed so that they are unlikely to be unavailable due to network partitioning.

Each member also maintains, for each other member in the list, the last time that its corresponding heartbeat counter has increased. If the heartbeat counter has not increased for more than Tfail seconds, then the member is considered failed. Tfail is selected so that the probability that anybody makes an erroneous failure detection is less than some small threshold Pmistake.

After a member is considered faulty, it cannot immediately be forgotten about. The problem is that not all members will detect failures at the same time, and thus a member A may receive a gossip about another member B that A has previously detected as faulty. If A had forgotten about B, it would reinstall B in its membership list, since A would think that it was seeing B’s heartbeat for the first time. A would continue to gossip this information on to other members, and, in effect, the faulty member B never quite disappears from the membership.

Therefore, the failure detector does not remove a member from its membership list until after Tcleanup seconds (Tcleanup ≥ Tfail). Tcleanup is chosen so that the probability that a gossip is received about this member, after it has been detected as faulty, is less than some small threshold Pcleanup. We can make Pcleanup equal to Pfail by setting Tcleanup to 2 × Tfail.

Importantly, when a node deems an entry failed and a heartbeat from the corresponding server reaches the node, the entry should not be deemed failed anymore.

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
testcases		testcases
.gitignore		.gitignore
Application.cpp		Application.cpp
Application.h		Application.h
EmulNet.cpp		EmulNet.cpp
EmulNet.h		EmulNet.h
Grader.sh		Grader.sh
Log.cpp		Log.cpp
Log.h		Log.h
MP1Node.cpp		MP1Node.cpp
MP1Node.h		MP1Node.h
Makefile		Makefile
Member.cpp		Member.cpp
Member.h		Member.h
Params.cpp		Params.cpp
Params.h		Params.h
Queue.h		Queue.h
README.md		README.md
mp1_specifications.pdf		mp1_specifications.pdf
run.sh		run.sh
stdincludes.h		stdincludes.h
submit.py		submit.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ccmp1

About

Releases

Packages

Languages

zkl94/ccmp1

Folders and files

Latest commit

History

Repository files navigation

ccmp1

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages