AN ALTERNATIVE PIECE OF SOFTWARE THAT CAN BE USED IN THE EVENT OF T...

3.

an alternative piece of software that can be used in the event of the failure of the

primary module.

We also need, of course, a mechanism for taking dumps of the system state and for

restoring the system state. The recovery block notation embodies all of these features.

Taking as an example a program that uses a method to sort some information, a fault

tolerant fragment of program looks like this:

ensure dataStillValid

>

by

superSort

else by

quickSort

slowButSureSort

else error

Here

supersort

is the primary component. When it has tried to sort the infor-

mation, the method

dataStillValid

tests to see whether a failure occurred. If there

was a fault, the state of the program is restored to what it was before the sort method

was executed. The alternative method

quickSort

is then executed. Should this now

fail, a third alternative is provided. If this fails, there is no other alternative available,

and the whole component has failed. This does not necessarily mean that the whole

program will fail, as there may be other recovery blocks programmed by the user of

this sort module.

What kinds of fault is this scheme designed to cope with? The recovery block mech-

anism is designed primarily to deal with unanticipated faults that arise from bugs

(design faults) in the software. When a piece of software is complete, it is to be expected

that there will be residual faults in it, but what cannot be anticipated is the whereabouts

of the bugs.

Recovery blocks will, however, also cope with hardware faults. For example, suppose

that a fault develops in the region of main memory containing the primary sort method.

The recovery block mechanism can then recover by switching over to an alternative

method. There are stories that the developers of the recovery block mechanism at

Newcastle University, England, used to invite visitors to remove memory boards from

a live computer and observe that the computer continued apparently unaffected.

We now examine some of the other aspects of recovery blocks.

The acceptance test

You might think that acceptance tests would be cumbersome methods, incurring high

overheads, but this need not be so. Consider for example a method to calculate a square

root. A method to check the outcome, simply by multiplying the answer by itself, is short

and fast. Often, however, an acceptance test cannot be completely foolproof – because

of the performance overhead. Take the example of the sort method. The acceptance test

could check that the information had been sorted, that is, is in sequence. However, this

does not guarantee that items have not been lost or created. An acceptance test, there-

fore, does not normally attempt to ensure the correctness of the software, but instead

carries out a check to see whether the results are acceptably good.

Note that if a fault like division by zero, a protection violation, an array subscript out

of range occurs while one of the sort methods is being executed, then these also con-

stitute the result of checks on the behavior of the software. (These are checks carried

out by the hardware or the run-time system.) Thus either software acceptance tests or

hardware checks can trigger fault tolerance.

The alternatives

The software components provided as backups must accomplish the same end as the

primary module. But they should achieve this by means of a different algorithm so that

the same problem doesn’t arise. Ideally the alternatives should be developed by differ-

ent programmers, so that they are not unwittingly sharing assumptions. The alterna-

tives should also be less complex than the primary, so that they will be less likely to fail.

For this reason they will probably be poorer in their performance (speed).

Another approach is to create alternatives that provide an increasingly degraded service.

This allows the system to exhibit what is termed graceful degradation. As an example of

graceful degradation, consider a steel rolling mill in which a computer controls a machine

that chops off the required lengths of steel. Normally the computer employs a sophisticat-

ed algorithm to make optimum use of the steel, while satisfying customers’ orders. Should

this algorithm fail, a simpler algorithm can be used that processes the orders strictly

sequentially. This means that the system will keep going, albeit less efficiently.

Implementation

The language constructs of the recovery block mechanism hide the preservation of vari-

ables. The programmer does not need to explicitly declare which variables should be

stored and when. The system must save values before any of the alternatives is executed,

and restore them should any of the alternatives fail. Although this may seem a formidable

task, only the values of variables that are changed need to be preserved, and the nota-

tion highlights which ones these are. Variables local to the alternatives need not be

stored, nor need parameters passed by value. Only global variables that are changed need

to be preserved. Nonetheless, storing data in this manner probably incurs too high an

overhead if it is carried out solely by software. Studies indicate that, suitably implement-

ed with hardware assistance, the speed overhead might be no more than about 15%.

No programming language has yet incorporated the recovery block notation. Even

so, the idea provides a framework which can be used, in conjunction with any pro-

gramming language, to structure fault tolerant software.