It is often said that the probability of a serious accident at a nuclear power plant or weapon is astronomically small, on the order of a million or a billion to one. However, in light of the Y2K problem, it is urgent that we reexamine precisely how these probabilities of a catastrophe are calculated. We will find, in particular, that nuclear power plants and nuclear weapons are indeed vulnerable to the Y2K problem because of a hidden Achilles Heel, i.e. multiple mode and common mode failures. The Y2K problem, by creating a chain of multiple mode and common mode failures, nullifies all the complex computer programs used to calculate the probability of a catastrophic accident, which are exclusively based on single mode failure. Thus, even systems which are deemed fully "Y2K compliant" are still subject to multiple and common mode failure. All the reassurances of the military and nuclear plant operators that they have tested isolated "mission critical" systems mean nothing in light of multiple and common mode failures. As Deputy Defense Secretary John Harme said, "The year 2000 problem is the electronic equivalent of El Nino." And like El Nino, the Y2K can cause multiple and common mode failures which can wipe out Y2K compliant computers.
Single Event Failures
Historically, the method used to determine the chances of a catastrophic Class 9 nuclear accident has been the "single event tree analysis," culminating in the WASH 1400 or the Rasmussen Report. Over the past 50 years, this single event tree analysis has been used to calculate the probabilities of a wide range of technologies, including rocket failures for NASA, and in particular the probabilities of a shuttle accident or an accident involving the Cassini mission. Unfortunately, this ote single mode mentality" is still with us when the Pentagon and the NRC assures us that certain isolated systems are Y2K compliant. Multiple mode and common mode failures, because they are highly non-linear and affect whole networks of computers, can bring down computer networks which are individually Y2K complaint.
Basically, the single event tree analysis is based on the reasonable idea that an accident begins with a single "initiating event" (such as a leak in a pipe or a failure in a computer). This event, in turn, creates a series of secondary events, such as a broken valve or a malfunction in a pump. Each secondary event, in turn, creates a series of tertiary events. As one can see, one has a cascading sequence of events, like a tree, branching from a single event. This event tree is supposed to represent the totality of all conceivable accident modes stemming from the initiating event.
Next, one assigns probabilities to each point along the branch of the tree. For example, a pump may fail after several hundred years of operation. So for each particular branch of the tree, we multiply the probability of the primary event, the secondary event, the tertiary event, and so on. This gives an overall probability for each branch of the tree.
Lastly, we sum over all possible branches of the tree. This thus gives us the over-all probability for the accident, which are often in the one-in-a-million range.
When applied to nuclear power and weapons accidents, the results are truly impressive. Literally thousands of pages of computer print-outs can be generated by the event tree analysis, yielding impressive figures, like one failure in 100 billion for certain weapons accidents.
Even though NASA, nuclear power plant operators, and the Pentagon make extensive use of event tree analysis, upon closer examination, the event tree analysis has been a sham, and do not explain many accidents like TMI or Chernobyl, which were multiple mode and common mode failures. Some problems with the single event tree analysis are as follows.
For example, at Chernobyl, a combination of several failures took place. First, the nuclear engineers manually disengaged the SCRAM system of control rods in this carbon moderated reactor. Second, there was a power surge or transient in the core. Power transients are common nuisances at nuclear power plants. But because of the carbon moderation and loss of the SCRAM system, the transient grew unchecked and became auto-catalytic, which caused the core to undergo a power excursion and explosion. It thus took a combination of two events, not one, to set off the accident. Neither incident was sufficient by itself to cause the accident, but together they produced a major tragedy, allowing, at the minimum, over 80 million curies, or 5% of the core's radioactivity, to be lofted into the air over Europe.
Similarly, at TMI, the accident was caused by multiple failure. First, the Pressure Operated Relief Valve was in the stuck position. Second, the control panel was designed to read "closed" when the valve was actually open. Third, the reactor had no water level indicator to tell the operator that the core was being evacuated. Again, it was a combination of multiple modes which caused the accident.
Another type of accident would be a common mode accident, a particular sub-variety of the multiple mode accident. The Brown's Ferry accident in Alabama was an example of a common mode failure, caused by a fire which wiped out multiple systems simultaneously. Workers used a candle flame to detect leaks in the plant. The candle accidentally caused a fire in the insulation, which in turn knocked out the Emergency Core Cooling Pumps, causing the water in the core to drop to dangerous levels. It was eventually the local fire department, not the utility, which brought this catastrophic fire under control.
Other types of common mode failures would involve an earthquake, an airplane crash into a reactor, an electrical black out, a flood, a telecommunications failure, etc. Each of these accidents can trigger many simultaneous failures in multiple systems.
If all the major accidents in the past were due to multiple mode and common mode failure, then the logical question is: why don't engineers abandon the single event tree analysis? Why don't engineers incorporate the wealth of information that has been gleaned from TMI, Chernobyl, and Brown's Ferry?
The answer is simple: no computer on the earth can properly model multiple mode and common mode failures. Instead of one tree, you now have many trees, with the branches interlocking. Instead of a simple tree, you now have a forest of possibilities. Even simple multiple mode and common mode failures would exhaust the capabilities of the largest computer on earth. Consider the Internet, where a computer half way around the world may trigger a computer failure in the U.S. because computer are linked in a highly non-linear way. It is nearly impossible to write down an accident tree for the Internet wiping out computer systems. Plus, how does one model human error and stupidity? One can carefully design the safest car ever built, with redundant seat bags and seat belts, but then some idiot will run the car over a cliff.
Y2K and Multiple Mode and Common Mode Failures
This, then, brings up the topic of Y2K, which is obviously of the multiple mode and common mode type. On Jan. 1, 1999, one can expect that several computer systems may simultaneously fail at a nuclear power or weapons facility, creating a multiple and common mode sequence for which computer programs are useless. Computers which are embedded into systems or hooked up to the Internet only compound this non-linear problem. However, the NRC and the Pentagon act as the Y2K were a single mode failure. They certify isolated, individual systems as being Y2K compliant. By themselves, these isolated systems may work perfectly fine on Jan. 1, 2000. But with multiple mode and common mode failures, the whole system may collapse.
For example, think of Chernobyl again. The accident depended crucially on the engineers manually disengaging the SCRAM safety system. This, in turn, set the stage for the power transient, which then caused the reactor to explode.
Similarly, it is unlikely that a single failure due to Y2K will cause a catastrophic accident. But a combination of key multiple computer systems failing would be sufficient to cause a major accident.
Assume, for example, that the ECCS emergency pumps at a nuclear power plant are disabled because of Y2K. Now assume that a simple leaking valve causes loss of water in a reactor (a LOCA). A LOCA by itself would be a Class 8 accident. But the addition of a disabled ECCS system could cause a Class 9 accident if the core is uncovered and begins to melt.
Or imagine that the SCRAM system is disabled because of Y2K. Then a power transient occurs. Power transients, in fact, are a common occurrence at power plants. Without a SCRAM system, the power transient may grow unchecked, as in Chernobyl. This could cause an explosion.
Or imagine that a few of the key gauges on the control panel are disabled because of Y2K. These gauges may, for example, indicate the water level in the core, or whether key valves, like the PORV, are closed. Now assume that a leak occurs, either through a small pipe break or a stuck valve. Then a Class 8 accident could easily spiral into a Class 9 accident.
Similarly, nuclear weapons can also be compromised by the Y2K, because the command and control of nuclear weapons is highly computerized. Again, a single mode failure is unlikely to set off a major accident involving a nuclear weapon. But a combination of failures can trigger unforeseen events.
At present, there are about 36,000 nuclear warheads on the planet, about 5,000 of which are placed on launch on warning. Unfortunately, Representative Stephen Horn (R-CA), Chairman of the House Subcommittee on Government Management, Information, and Technology, gave the Department of Defense a "D-" grade for their Y2K work. "I remain deeply concerned about the Department of Defense's D- grade," he said. "It goes without saying that there is zero tolerance for error when you are dealing with the defense of our country."
In Sept. 25, 1998, at a meeting at the Pentagon, DoD Deputy Secretary John Hamre declared that the Y2K was under control. However, Admiral Richard Mies, Commander-in-Chief of the U.S. Strategic Command (STRATCOM) told the meeting startling news, that 11 STRATCOM nuclear systems would not be fixed on time. He added that 12 new systems currently under development would also not make the deadline. In other words, up to 23 STRATCOM nuclear systems might fall victim of the Y2K problem.
But even if isolated systems are made Y2K compliant, this may be useless since computers are networked with each other, and one system failing may cause a cascade of other systems failing. For example, the Internet was originally created to fight and monitor and nuclear war. However, if one computer system fails, it may easily bring down other systems, even if these other systems are Y2K complaint. Thus, the Y2K may act like a virus infecting thousands of computers via the Internet.
What might happen?
For example, one wonders why a simple power failure can bring down power to the entire Northeast. This is because our power grid is set up to handle isolated, single mode failures. If a plant fails, the other plants take up the slack and provide additional power. But sometimes these surrounding plants provide more power then they are capable, and they fail. This causes simultaneous failures of several plants, which the system cannot handle.
These systems are not designed to handle several plants which black out simultaneously. Then you can have a chain reaction, a black hole, whereby many systems begin to get sucked in and fail simultaneously. Thus, a system which is relatively invulnerable to a single mode failure can be shut down by multiple failure. Similarly, key computers in our early warning system, or nodal points, could bring down the entire system if their problem infects other computers.
Another example is that electrical failures in a computer system monitoring enemy missiles could easily give the false impression that a nuclear attack is underway. A few years ago, Boris Yeltsin was given the task of deciding whether a missile headed to Russia was on a simple scientific mission, or was the beginning of a first strike. He was, in fact, given the nuclear button by his aides and asked to decide. The missile was on a scientific mission from a Scandinavian country to analyze the weather, and the Russian authorities were in fact notified of this test, but word never reached the Kremlin. One can imagine that a failure in the computer systems in control of nuclear weapons could also set off false alarms. Again, a combination of multiple events can set off a disaster.
The problem is that our nuclear missiles are placed on hair-trigger alert, and there has been enormous pressure, especially on the Russians, to adopt a "launch on warning" strategy, i.e. launching your missiles on the hint of an enemy attack, because to delay may mean you are vaporized in a few minutes. One must use their nuclear missiles while they still have them.
This "use them or lose them" position puts an enormous amount of emphasis on the command, control, communications, and intelligence systems of any country. Unfortunately, the C3I system of the U.S. is patch-work, a crazy-quilt of overlapping radar systems, computer networks, command centers, etc. There are vulnerable points in this highly non-linear grid which, if they fail, can bring down the entire system.
The Pentagon, of course, has given us reassurances. In Jan. of this year, DoD Deputy Defense Secretary John Hamre announced that, as of Dec. 21, 1998, the Pentagon had certified 81% of "mission critical" systems. By March, he hoped that the Pentagon would be 93% compliant. The total bill, Hamre estimated, would total $2.5 billion. The Trident submarine, for example, has already been declared certified by the Pentagon. But earlier, Pres. Clinton had asked that all government agencies reach the 100% mark by March 1999, a target that the Pentagon will miss.
But given the fact that the Pentagon has about 10,000 computer systems, about 2,300 of which are termed critical, and an unknown number of embedded chips, it is unlikely that their systems will be 100% compliant.
One problem is that certain agencies will simply lie. One agency, according to the New York Times and U.S. Today, was caught lying about the level of its compliance. This Pentagon agency was partly responsible for the command and control of nuclear weapons. Although the problem was ultimately caught and found to be relatively minor, it points up the problem that certain agencies will, consciously or unconsciously, give erroneous data for Y2K compliance.
Potential problems with our nuclear weapon systems are as follows:
What's the Track Record?
One would like to believe that the Y2K problem can be solved by throwing money at it. Unfortunately, the track record so far is not encouraging:
Some positive measures that can be taken
Some positive steps that can be taken are as follows:
Home Page