AN ALTERNATIVE PIECE OF SOFTWARE THAT CAN BE USED IN THE EVENT OF T...
3.
an alternative piece of software that can be used in the event of the failure of the
primary module.
We also need, of course, a mechanism for taking dumps of the system state and for
restoring the system state. The recovery block notation embodies all of these features.
Taking as an example a program that uses a method to sort some information, a fault
tolerant fragment of program looks like this:
ensure dataStillValid
>
by
superSort
else by
quickSort
slowButSureSort
else error
Here
supersort
is the primary component. When it has tried to sort the infor-
mation, the method
dataStillValid
tests to see whether a failure occurred. If there
was a fault, the state of the program is restored to what it was before the sort method
was executed. The alternative method
quickSort
is then executed. Should this now
fail, a third alternative is provided. If this fails, there is no other alternative available,
and the whole component has failed. This does not necessarily mean that the whole
program will fail, as there may be other recovery blocks programmed by the user of
this sort module.
What kinds of fault is this scheme designed to cope with? The recovery block mech-
anism is designed primarily to deal with unanticipated faults that arise from bugs
(design faults) in the software. When a piece of software is complete, it is to be expected
that there will be residual faults in it, but what cannot be anticipated is the whereabouts
of the bugs.
Recovery blocks will, however, also cope with hardware faults. For example, suppose
that a fault develops in the region of main memory containing the primary sort method.
The recovery block mechanism can then recover by switching over to an alternative
method. There are stories that the developers of the recovery block mechanism at
Newcastle University, England, used to invite visitors to remove memory boards from
a live computer and observe that the computer continued apparently unaffected.
We now examine some of the other aspects of recovery blocks.
The acceptance test
You might think that acceptance tests would be cumbersome methods, incurring high
overheads, but this need not be so. Consider for example a method to calculate a square
root. A method to check the outcome, simply by multiplying the answer by itself, is short
and fast. Often, however, an acceptance test cannot be completely foolproof – because
of the performance overhead. Take the example of the sort method. The acceptance test
could check that the information had been sorted, that is, is in sequence. However, this
does not guarantee that items have not been lost or created. An acceptance test, there-
fore, does not normally attempt to ensure the correctness of the software, but instead
carries out a check to see whether the results are acceptably good.
Note that if a fault like division by zero, a protection violation, an array subscript out
of range occurs while one of the sort methods is being executed, then these also con-
stitute the result of checks on the behavior of the software. (These are checks carried
out by the hardware or the run-time system.) Thus either software acceptance tests or
hardware checks can trigger fault tolerance.
The alternatives
The software components provided as backups must accomplish the same end as the
primary module. But they should achieve this by means of a different algorithm so that
the same problem doesn’t arise. Ideally the alternatives should be developed by differ-
ent programmers, so that they are not unwittingly sharing assumptions. The alterna-
tives should also be less complex than the primary, so that they will be less likely to fail.
For this reason they will probably be poorer in their performance (speed).
Another approach is to create alternatives that provide an increasingly degraded service.
This allows the system to exhibit what is termed graceful degradation. As an example of
graceful degradation, consider a steel rolling mill in which a computer controls a machine
that chops off the required lengths of steel. Normally the computer employs a sophisticat-
ed algorithm to make optimum use of the steel, while satisfying customers’ orders. Should
this algorithm fail, a simpler algorithm can be used that processes the orders strictly
sequentially. This means that the system will keep going, albeit less efficiently.
Implementation
The language constructs of the recovery block mechanism hide the preservation of vari-
ables. The programmer does not need to explicitly declare which variables should be
stored and when. The system must save values before any of the alternatives is executed,
and restore them should any of the alternatives fail. Although this may seem a formidable
task, only the values of variables that are changed need to be preserved, and the nota-
tion highlights which ones these are. Variables local to the alternatives need not be
stored, nor need parameters passed by value. Only global variables that are changed need
to be preserved. Nonetheless, storing data in this manner probably incurs too high an
overhead if it is carried out solely by software. Studies indicate that, suitably implement-
ed with hardware assistance, the speed overhead might be no more than about 15%.
No programming language has yet incorporated the recovery block notation. Even
so, the idea provides a framework which can be used, in conjunction with any pro-
gramming language, to structure fault tolerant software.