Byzantium, 1453 AD.
The city of Constantinople, the last remnants of the hoary Roman Empire,
is under siege.
Powerful Ottoman battalions are camped around the city
on both sides of the Bosporus,
poised to launch the next, perhaps final, attack.
Sitting in their respective camps, the generals are meditating.
Because of the redoubtable fortifications, no battalion by itself can succeed;
the attack must be carried out by several of them together
or otherwise they would be thrusted back and incur heavy losses that
would infuriate the Grand Sultan. Worse, that would jeopardize the
prospects of a defeated general to become Vizier.
The generals can agree on a common plan of action
by communicating thanks to the
messenger service of the Ottoman Army which can deliver messages
within an hour, certifying
the identity of the sender and preserving the content of the message.
Some of the generals however, are
secretly conspiring against the others.
Their aim is to confuse their peers so that an insufficient number
of generals is deceived into attacking. The resulting defeat
will enhance their own status in the eyes of the Grand Sultan.
The generals start shuffling messages around, the ones trying
to agree on a time to launch the offensive, the others trying to split
their ranks...
Menlo Park, 1982 AD. The situation above describes a classical coordination problem in distributed computing known as byzantine agreement which was introduced in two seminal papers by Lamport, Pease and Shostak [23,30]. Broadly stated, a basic problem in distributed computing is this: Can a set of concurrent processes achieve coordination in spite of the faulty behaviour of some of them? The faults to be tolerated can be of various kinds. The most stringent requirement for a fault-tolerant protocol is to be resilient to so-called byzantine failures: a faulty process can behave in any arbitrary way, even conspire together with other faulty processes in an attempt to make the protocol work incorrectly. The identity of faulty processes is unknown, reflecting the fact that faults can (and do) happen unpredictably.
What kind of applications need this kind of malicious fault to be handled? Part of the answer can be found in the footnote of the original paper by Lamport, Shostack and Pease listing the sponsors: NASA, the Ballistic Missile Defense Systems Command and the Army Research Office. Besides the usual applications to killing and terrorizing people, protocols that withstand byzantine failures are useful in any situation where it is vital that a system performs correctly in spite of the malfunctioning of some its subparts, regardless of the type of malfunctioning. Examples include aircraft control systems and applications of computer technology to medicine. According to Nancy Lynch's book ``the agreement problem is a simplified version of a problem that originally arose in the development of on-board aircraft control systems. In this problem, a collection of processors, each with access to a separate altimeter, and some of which may be faulty, attempt to agree on the airplane's altitude. Byzantine agreement algorithms have also been incorporated into the hardware of fault-tolerant multiprocessor systems; there, they are used to help a small collection of processors to carry out identical computations, agreeing on the results at every step. This redundancy allows the processors to tolerate the (Byzantine) failure of one processor. Byzantine agreement algorithms are also useful in processor fault diagnosis, where they can permit a collection of processors to agree on which of their number have failed (and should therefore be replaced or ignored)'' [27].
The problem we want to study is whether coordination among processes is at all possible in spite of byzantine faults. We begin with a precise formulation of the problem.
We are given a set G of processes, called generals, which is partitioned into two sets L and T of, respectively, loyal generals and traitors. Henceforth, t denotes the number of traitors and n the overall number of processes. Each general Gi has an input bit xi- Gi's initial assessment- and must produce an output bit di called the decision value of Gi. The underlying communication mechanism is
Limited dithering: Eventually all loyal generals should come to a decision.This is less stringent than limited-bureaucracy because a process might commit to a decision value and yet keep participating to the protocol which might even run forever. We shall see later examples of this.