next up previous contents
Next: Solving consensus using S Up: Lecture 7 Previous: Unreliable Failure Detectors for Systems*

Properties of failure detectors

Each process will have its own failure detector, which it can query like an oracle. The failure detector will then return a list of the processes it suspects to be faulty. What makes this approach interesting is that the detectors can make mistakes. In fact, they can make arbitrarily many mistakes and still, if some minimal conditions are satisfied, they can be used to solve consensus.

The first property our failure detector should satisfy is some sort of completeness property, which roughly means that if a process is faulty it should also be suspected by the detector.

Definition 12478

(Completeness)

Notice that with weak completeness the information that a process is crashed is somewhere in the system, but it is unknown where. For instance, some detectors can say that p is down while others can say that p is up and this can go on forever.

The second kind of property we would like to have is accuracy , which means that, if a process is suspected, it should also be faulty. We shall distinguish four cases.

Definition 12481

(Accuracy)

In the perpetual case, the detectors are guaranteed to work correctly from the beginning, whereas in the eventual case they are guaranteed to do so only from some point on. Notice that this leaves still a lot of uncertainty. For instance, a detector can say that both p and q are correct while in fact one of them is down. Also, with eventual accuracy the moment after which the detectors work properly in not known. In general ``each local failure detector module can repeatedly add and remove correct processes from its list of suspects [and] some correct process may be erroneously suspected to have crashed by all other processes throughout the entire execution'' [11]. Still, if any combination of completeness and accuracy properties is satisfied, consensus can be attained.

The above properties can be combined in eight ways, yielding the detectors shown in table 7.1. It is left as an exercise to show that weak and strong completeness are actually equivalent in strength, which means that the eight different detectors collapse into four.

 
Table 7.1: The eight different combinations of detector properties. P stands for perfect; Q for quasi-perfect; S for strong; and W for weak. The sign \ensuremath{\diamondsuit} comes from temporal logic, and should be pronounced ``eventual''.
  4c|accuracy      
  2c|perpetual 2c|eventual    
  strong weak strong weak
strong completeness P S \ensuremath{\diamondsuit}P \ensuremath{\diamondsuit}S
weak completeness Q W \ensuremath{\diamondsuit}Q \ensuremath{\diamondsuit}W

In fact, the whole table collapse to just two entries. We shall prove the following. We will give an (n-1)-resilient protocol for consensus with the aid of the S-detector; since strong completeness and weak completeness are equivalent this will establish the fact that W is enough for (n-1)-consensus. and hence that P=W. Then we will show that it is impossible to reach consensus with n/2 faulty processors if we only have the \ensuremath{\diamondsuit}P-detector. Since S is the weakest detector with perpetual accuracy, and \ensuremath{\diamondsuit}P is the strongest detector with eventual accuracy, this highlights the big difference between these two accuracy properties. Next lecture, we will show that consensus can be reached even with the \ensuremath{\diamondsuit}W-detector, if the number of faulty processors is less than n/2. We remind the reader that we are considering only crash failures. Altogether this shows that in reality there are only two cases two consider: P=Q=S=W and $\ensuremath{\diamondsuit}P = \ensuremath{\diamondsuit}Q = \ensuremath{\diamondsuit}S = \ensuremath{\diamondsuit}P$.

We should also point out that $\ensuremath{\diamondsuit}W$ is the weakest possible failure detector for solving consensus in the sense that any failure detector that solves consensus must satisfy weak completeness and eventual weak accuracy . The difficult and interesting proof can be found in [12].

As a final comment we remark that the protocols we will be discussing do not require that the failure detector properties hold forever. It is enough that they hold ``for sufficiently long time''. In this sense, failure detectors are a neat abstraction of the time-out mechanisms which are used in practice. In fact, several results obtained for models assuming various degrees of mild synchronicity can be recast and proven in terms of failure detectors [11].


next up previous contents
Next: Solving consensus using S Up: Lecture 7 Previous: Unreliable Failure Detectors for Systems*
Viggo Kann
Sat Dec 20 23:41:16 MET 1997