With software becoming a key element in the control of aircraft, medical devices, automobiles, and industrial automation, the danger presented by programming glitches can't be ignored.
Peter Bell Jon Garnsworthy
Cambridge Consultants Ltd.
Science Park, Cambridge, UK
In the good old days, when you stepped on the accelerator pedal of your car, a system of levers and linkages connected your foot directly to the carburetor. Today, more than likely, the accelerator pedal is merely connected to an encoder that sends a command to an engine-control computer. Software in the computer then puts out appropriate commands to a fuel-injection system.
There is even talk of replacing conventional hydraulic power steering with encoders and actuators so the steering wheel will have no direct connection to the front wheels. Prototype systems of this type have long been in development.
None of this is particularly startling to people who design systems for industrial automation. They have long become accustomed to channeling input commands through a computer before they activate actuators and servomechanisms.
No question software-based control is here to stay. It continues to be a mainstay of industrial automation. But that brings the question of safety to the forefront.
Everybody knows that software often-contains errors, and that computers do not always execute instruction correctly. So what practices should be in place to account for possible glitches?
The automotive industry often looks to the aircraft industry for pointers in designing safe x-by-wire hardware. The problem is that the automobile industry is forced to come up with systems at rock-bottom prices, while the aircraft industry has fewer cost constraints. So lessons learned with aircraft do not transfer readily to automobiles. Airplanes often use triply redundant systems, but how many cars could afford to have three fully independent traction control systems?
A better place to look for guidance is the medical-device industry, which has long been mass-producing equipment in which safety-critical software plays an important part. The medical device industry is arguably the first that has needed to develop complex safety-critical systems with high software content. Many of the factors governing safe design in medical devices apply equally to automotive systems and other areas.
Previously, the usual approach to medical-device safety was to apply the single-fault hypothesis. This hypothesis says the product is safe enough if no single fault can lead to death or injury. This requires that for each product, two questions must be considered: 1. Over what time period can one consider the single-fault hypothesis to hold? 2. How quickly must the presence of a single fault be detected?
The answer to the first question defines how often any self-test needs to be run to detect problems with a fault-detection mechanism. In a dialysis machine, for example, this is usually the length of a single therapy session normally 30 min to 3 hr.
The answer to the second question defines the response time of any fault detection mechanism. For a dialysis machine, this is usually 1 sec.
As a design approach, the single-fault hypothesis is appealing for the design of safety-critical software in motion-control applications. However, the same two issues still arise. Over what time period does the hypothesis hold? For example, is it key-on to key-off, typical journey time, or something else? And how quickly must a fault be detected?
THE RISK-BASED APPROACH
Just as in automation, medical products have become more complex and are being sold in higher volumes. Therefore, the single-fault hypothesis has become less attractive.
Also, product-based standards have become more difficult to write, and the so-called "type approval" approach has become untenable. This has led the medical-device industry to move to a risk-based approach. This approach tends to focus standardization on the development process rather than on the features and qualities of a product. The risk-based approach puts the focus on the faults, design elements, and activities that help improve product safety. The key process behind a riskbased approach is that possible causes of death or injury should be identified and the likelihood of occurrence and
level of consequence then identified. Action is then taken to eliminate the cause and reduce the likelihood of occurrence or reduce the level of the consequence.
When a new safety-critical medical device is being designed, there are several issues that need to be considered from the start: Engineers designing motion systems must often evaluate these same issues.
- Can the system be designed to be failsafe, or must it be designed to be faulttolerant?
- Can standard Commercial Off-the-Shelf components be used for key parts, and can the system be protected against failure of COTS components?
- Is there a need for Built-In-Test-Equipment, or BITE?
- How can safety-critical issues associated with the user interface be addressed?
One of the first design tasks is to figure out whether a system should be failsafe or fault-tolerant.
Fail-safe: A fail-safe system has a safe state, and a failure in the system will return it to this state. For medical devices, this usually means that the machine automatically goes to safe state if the device loses power.
Fault tolerant: A fault-tolerant system is designed to be fault tolerant if it doesn't have a fail-safe state. For instance, a life-support system will need to continue functioning regardless of any fault. Fault-tolerant systems tend to be more complex to design and are more costly.
Because of the high cost and complexity of fault-tolerant systems, medical-device manufacturers try to design systems with a fail-safe state.
Safety critical does not necessarily mean all components have to be high integrity. In the aerospace industry, there is a tendency to design all software from the ground up in safety-critical systems so that bugs cannot affect the overall performance of the system. In contrast, the medical-device industry cannot afford such an allembracing approach. Instead, it relies more heavily on commercial-off-the-shelf products for key components such as the user interface, the operating system, and communications networks. It does this by trading safety for availability. In practice, this means that the engineers will treat a COTS component such as the operating system in much the same way as they might a mechanical component.
They will identify failure modes of the COTS component and then build measures into the system to detect the failure mode. For example, a failure mode might be that the scheduler in the OS locks out a process. The designers will counteract this by requiring the process to transmit a heartbeat to a watchdog process, possibly running on another processor. If the watchdog detects the fault, it will put the system into a safe state.
Most complex medical devices have a large amount of built-in-test equipment. This stems from the use of multiprocessors for safety and from the way in which the " singlefault hypothesis" is interpreted.
The BITE tends to be of two forms. One type detects the occurrence of a single fault. This type usually operates continuously in a monitoring mode. For example, in a dialysis machine, there usually is one sensor to control fluid-delivery temperature and one to check that the temperature is in the correct range.
The other type is that used to check that the first set of BITE is operating correctly. This type will usually operate intermittently, normally as a set of defined self-tests. For example, in a dialysis machine, before therapy commences, the control system will set the temperature of the fluid to a set of defined values. At each of these values, the other sensor confirms that this temperature has been hit.
While BITE is familiar to most engineers, the implications of BITE for safetycritical applications are much further reaching. For example, how can a safety sensor that measures front wheel position be tested to ensure that it has not failed perhaps towards the extreme of the position? Running such a test may require the car to put the mechanical systems into a number of defined states before the driver can drive off.
DESIGNING USER INTERFACES
One of the key issues in medical devices is ensuring that safety-critical parameters have been set correctly. Initially, medical devices were designed with two user interfaces. The first was connected to the control system and the second was the protective system. The operator would enter critical parameters using the interface for the control system. The control system would then relay the critical parameters to the protective system, which would display them on its user interface. The operator then had to compare the two values.
Subsequently, this system was found to be unacceptable to many users, so approaches employing a single interface have been developed. These approaches either eliminate common points of failure (or at least reduce the likelihood of a failure going undetected by the user), or check that no failure has occurred.
The first of these approaches, though the device no longer has two physical user interfaces, aims to retain some logical separation. For example, when a critical parameter is displayed, two copies of that parameter will be displayed in different fonts on different parts of the display. One copy will be taken, for example, from the control system, the other from the protective system.
The second approach is to read the displayed values back from the display in some way (e.g. by reading the video memory and then applying optical character recognition), - and then checking that the values actually displayed are those intended. This approach is more complicated but can provide an interface that is easier to follow.
A key consideration is the architectural approach taken. In the medicaldevice industry three approaches have become common:
- Dual processors with equal peers, the preferred route when designing fault-tolerant systems. However, this approach is costly.
- Dual processors with control and protective systems, the preferred approach when designing fail-safe systems. Often, the second processor is much smaller than the first, allowing cost savings.
- Single processor, the option usually chosen for items produced in high volumes and subject to severe cost constraints. However, there needs to be considerably more work to show that the system remains safe even with catastrophic failure of the processor.
Dual processor with two equal peers: "Two equal peers" is an approach to redundancy that performs the same process on two equivalent systems and then takes action only if both agree. (The concept can, of course, be used with more than two processors.) Such an approach is common in avionics systems because of reliability requirements and the lack of a safe state. However, the high cost makes it a challenge. Therefore, this approach probably will be used only for high-value, safety-critical applications.
Dual processor with a protection system: A large number of therapeutic medical devices (i.e. those that actively do something to a patient, rather than passively measure a value) make use of two systems. One is the control system, the other a protective system.
The control system implements the required function and the protective system checks to ensure the control system is behaving "safely." If the protective system detects a problem, it can then place the medical device into the safe state.
This capability usually makes this type of architecture most applicable for systems with straightforward safe states. Also, it can be used in other applications, including measurement devices, when reasonableness checks may be made on the final output. It is important to note that the protective system will not necessarily ensure that the therapy is being performed correctly. In dialysis, for example, the protective system will check on whether or not the fluids are within an acceptable pH and temperature range. It does not ensure that the fluids conform to the requirements of the particular patient's dialysis treatment. The protective system is usually much simpler than the control system and whereas, in complex medical devices, the control system normally consists of programmable electronics, the protective system is occasionally implemented by electronic and electromechanical components.
Alternatively, on some low-cost designs, two control processors are used with the control split between them. The protective function associated with the first control processor is then allocated to the second control processor, and vice-versa.
Though in many ways the most straightforward of architectures, care is needed to ensure that the control and protective systems are properly independent. For example, how does the protective system know what mode the overall device is in and therefore what it should be monitoring? If it is simply told by the control system, then there is a chance for a single fault to lead to failure. Such an approach has already been used in automotive ABS systems.
Single processor: This approach is used normally when low cost is a major goal. There are a number of things that must be done when a single processor is used for a safety-critical device
A highly detailed analysis must be made to identify how failures might lead to hazards and to identify what protection mechanisms are needed. A periodic self-test mechanism will be required since there is not a second system to detect failure, and an exhaustive set of selftests will be needed. These self-tests must be run frequently enough to detect a failure before it can cause a hazard. In addition, it is necessary to ensure that a critical failure cannot mask failure of the self-tests.
The question arises as to how to protect against processor failure or bit corruption. One approach is for all critical data to be stored in normal and inverted form. Then algorithms manipulating this data are duplicated, with one working on the normal form and one on the inverted form. A voting circuit is then used when any critical data is output. Also, a test harness is built, allowing for all identified failure modes to be injected into the system to confirm that the defined protection mechanisms work correctly. However, it can be difficult to prove the system is safe.
Cambridge Consultants Ltd, +44 (0) 1223 420024, www.CambridgeConsultants.com