9 ● DISCUSSIONFAULT TOLERANCE IN HARDWARE HAS LONG BEEN RECOGNIZED...

17.9

Discussion

Fault tolerance in hardware has long been recognized – and accommodated. Electronic

engineers have frequently incorporated redundancy, such as triple modular redundancy,

within the design of circuits to provide for hardware failure. Fault tolerance in software

has become more widely addressed in the design of computer systems as it has become

recognized that it is almost impossible to produce correct software. Exception handling

is now supported by all the mainstream software engineering languages – Ada, C++,

Visual Basic, C# and Java. This means that designers can provide for failure in an organ-

ized manner, rather than in an ad hoc fashion. Particularly in safety-critical systems,

either recovery blocks or n-programming is used to cope with design faults and enhance

reliability.

Fault tolerance does, of course, cost money. It requires extra design and program-

ming effort, extra memory and extra processing time to check for and handle excep-

tions. Some applications need greater attention to fault tolerance than others, and

safety-critical systems are more likely to merit the extra attention of fault tolerance.

However, even software packages that have no safety requirements often need fault

tolerance of some kind. For example, we now expect a word processor to perform

periodic and automatic saving of the current document, so that recovery can be per-

formed in the event of power failure or software crash. End users are increasingly

demanding that the software cleans up properly after failures, rather than leave them

with a mess that they cannot salvage. Thus it is likely that ever-increasing attention

will be paid to improving the fault tolerance of software.

Summary

Faults in computer systems are caused by hardware failure, software bugs and user

error. Software fault tolerance is concerned with:

detecting faults

assessing damage

repairing the damage

continuing.

Of these, faults can be detected by both hardware and software.

One hardware mechanism for fault detection is protection mechanisms, which have

two roles: