9 ● DISCUSSIONFAULT TOLERANCE IN HARDWARE HAS LONG BEEN RECOGNIZED...
17.9
●
Discussion
Fault tolerance in hardware has long been recognized – and accommodated. Electronic
engineers have frequently incorporated redundancy, such as triple modular redundancy,
within the design of circuits to provide for hardware failure. Fault tolerance in software
has become more widely addressed in the design of computer systems as it has become
recognized that it is almost impossible to produce correct software. Exception handling
is now supported by all the mainstream software engineering languages – Ada, C++,
Visual Basic, C# and Java. This means that designers can provide for failure in an organ-
ized manner, rather than in an ad hoc fashion. Particularly in safety-critical systems,
either recovery blocks or n-programming is used to cope with design faults and enhance
reliability.
Fault tolerance does, of course, cost money. It requires extra design and program-
ming effort, extra memory and extra processing time to check for and handle excep-
tions. Some applications need greater attention to fault tolerance than others, and
safety-critical systems are more likely to merit the extra attention of fault tolerance.
However, even software packages that have no safety requirements often need fault
tolerance of some kind. For example, we now expect a word processor to perform
periodic and automatic saving of the current document, so that recovery can be per-
formed in the event of power failure or software crash. End users are increasingly
demanding that the software cleans up properly after failures, rather than leave them
with a mess that they cannot salvage. Thus it is likely that ever-increasing attention
will be paid to improving the fault tolerance of software.
Summary
Faults in computer systems are caused by hardware failure, software bugs and user
error. Software fault tolerance is concerned with:
■
detecting faults
■
assessing damage
■
repairing the damage
■
continuing.
Of these, faults can be detected by both hardware and software.
One hardware mechanism for fault detection is protection mechanisms, which have
two roles: