Distributed Error Confinement
Yossi Azar, Shay Kutten, and Boaz Patt-Shamir
Abstract
We initiate the study of error confinement in distributed
applications, where the goal is that only nodes that were directly
hit by a fault may deviate from their correct external behavior, and
only temporarily. The external behavior of all other nodes must
remain impeccable, even though their internal state may be affected.
Error confinement is impossible if an adversary
is allowed to inflict arbitrary transient faults on the system,
since the faults might completely wipe out input values. We
introduce a new fault tolerance measure we call \emph{agility},
which quantifies the strength of an algorithm that disseminate information,
against
state corrupting faults.
We study the basic problem of broadcast, and propose algorithms that
guarantee error confinement with optimal agility to within a
constant factor, even in asynchronous networks when the topology is
unknown. These algorithms can serve as building blocks in more
general reactive systems. Previous results in exploring locality in
reactive systems were not error confined, and relied on the
assumption (not used in current paper) that the errors hitting each
node are probabilistic, such that a faulty node itself, or its neighbor, can
detect the nod
e
faulty.
The main algorithm uses the novel {\em core bootstrapping}
technique, that seems inherent for voting in reactive networks; its
analysis leads to an interesting combinatorial problem. The
technique and the analysis may be of independent interest.