Avoiding Catastrophic Failure
A surprising number of systems are vulnerable to “unexpected” failure modes. Their datacentre cooling fails during a heatwave. Their phone system overloads with unexpected results. The Baltimore bridge collapsed due to “grandfathering” rules. Here’s our take on predicting the “unpredictable” and avoiding disaster.
Why Failures Become Disasters

Failures happen and sometimes they turn into disasters, even if a design is to code. The reactors at Fukushima, Japan were built to code. However, the designers never considered how a tsunami knocking out the backup power supply could ever happen. The bridge in Baltimore was built to the code that applied at the time of construction. It was designed to withstand an impact from the biggest ship in existence. Time passed, ships got much bigger, and the bridge was no longer safe.
There are many examples of failures where hindsight shows that a different design would have been prudent, or that a design that used to be adequate might no longer be up to the job – but how do you do that in foresight, before a failure has exposed a weakness?
The answers come through ‘Failure Mode Analysis’, which starts by looking at ways that a system can fail. Push any system beyond its design parameters and it will fail. This could be a power station, a bridge, a cell tower, a datacentre, a phone system, etc. The question to ponder is “what happens when it does fail and will it turn into a disaster.” The follow-on question is “what should we do to avoid that?” There are two options:
- design to never fail
- design to soft fail
Designs That Don’t Fail
There are systems, or parts of systems, that should be designed as far as possible to not fail. That a failure in an aircraft should not lead to a crash is self-evident. You may not be designing aircraft but there may well be parts of your systems that, in case of a failure, could potentially cause injuries or worse. You might have:
- a radio tower on site – could it fail, leading to a collapse?
- a source of water that could get into your data centre:
- water pipes
- a roof leak
- a backed-up storm drain
Designing for Soft Failure
If the consequences of a subsystem failure are not catastrophic, you might design for soft failure.
You might have a call centre designed to handle 200 simultaneous calls. If the system were to get hit with 500 calls, unexpected results could occur. Windstorm events have caused serious problems for 911 call centres. Excess traffic was designed to receive a “please hold” message, their planned soft failure mode. Unfortunately, when the hold queue was full, new callers received a busy signal. This is not a soft failure outcome.
During the recent WestJet industrial action, passengers were referred to the company web site to re-book or adjust their flights. However, the load was many times greater than the server design allowed and most users encountered unhelpful error messages.
Installing overload control software in your access network can be part of a soft failure design. Overload traffic would be diverted to an alternate service handler, such as a soft failure web page or a recorded announcement.
We are reminded of a long forgotten soft failure design for a help desk system where the maximum design capacity was 5 simultaneous calls. The soft fail design directed calls to overflow to the IT manager if wait times exceeded an unacceptable threshold. Unfortunately, the feature was forgotten and the phone number was subsequently reassigned to a non-IT employee.
What To Do
Review your systems to understand what potential failure modes exist and what will happen when that failure occurs. You need a plan, whether that is a better plan A or a recovery plan B.
Learn from the mistakes of others. A good example is the 1969 collapse of a 385m tall, guyed mast on Emley Moor in Northern England. It collapsed due to an unanticipated combination of ice build-up and wind vortices. The disaster resulted in vortex damping being installed at similar masts in the UK with no further failures of this type occurring. A negative example is the Baltimore bridge case. The increased risks from bigger ships could have been anticipated and improved protection retrofitted. Several states are currently looking at bridges that may be at risk.
Upgrade your design. As designing systems to never fail is often too expensive, your chosen redesign becomes a compromise between resilience and cost, for which the Cost Of Failure is an important consideration. If the Cost Of Resilience is significantly more than the consequential Cost Of Failure, then economics would rule out adding resilience. In that case, designing in the best soft failure makes sense. Soft failure designs can be as simple as load shedding when a data centre gets too hot or requiring tugboat escorts for large ships.
In a related article, Kristin looks at examples of some failure risks that you may currently face, and shows how you might mitigate them.
If you’d like to discuss potential failure modes of your systems, or to comment on this article, please email me at peter.
This article was published in the
July 2024
edition of The TMC Advisor
- ISSN 2369-663X Volume:11 Issue:5
©2024 TMC Consulting