No such thing as Root Cause

I've never really been comfortable with the concept of Root Cause. It's time to be explicit: there is no such thing.

I remember one root cause analysis workshop I conducted where it became very clear to me that there wasn't a single root cause that could be identified as having led to the incident that we were looking at. That's when I began to use the term "Primary Cause[s]" to indicate that there was no single root cause. I blogged about this back in 2010.

Then another contributor to my discomfort with the term root cause was the writer Richard Cook who wrote a brilliant paper on complex systems which I also blogged about (in 2009!!). That paper brilliantly lays out the argument for how complex systems are permanently broken and the operators of the complex systems are constantly acting to keep them functional despite the broken components. For there to be a catastrophic failure, multiple faults need to line up to create the perfect storm which causes the outage. Therefore there are always be multiple causes to any system failure. This is clear to anybody who watches Air Crash Investigation on the television.

But I was surprised to discover recently that I've never actually blogged about how there's no such thing as Root Cause. Today I'm correcting that. It's not a new assertion: others such as Ian Clayton have said so in the past. But it is still often regarded as heresy to suggest this, and methodologies such as "the 5 Whys" imply that you can eventually dig deep enough to find the root cause of the problem.

We need to move away from this kind of thinking. Assigning a root cause is in fact an arbitrary choice between primary causes and we often make that choice on political grounds. For example it is expedient to blame an external supplier rather than to accept blame internally. So let's throw away the phrase "root cause" and talk about primary causes.

Sure it makes sense to prioritise the order in which we address the primary causes and perhaps the one that we address first is the one that we would in the past have called the root cause. But let's be clear about what we're doing and accept that we are addressing one of many causes.

As an aside:
All primary causes are human.
Geeks love to blame things, but a broken thing is not a preventable cause.
If the component failed, why did it fail? Why wasn't it replaced regularly? why wasn't maintenance done? Why wan't there redundancy? why weren't warning signs picked up? etc
You can always get to a human cause.
This doesn't mean we are on a witch-hunt.
It means what we have to fix is invariably training, procedures, responsibilities, or behaviours.

Let's be clear that addressing any one cause is not going to prevent future outages: there are many causes floating around in a complex system that will be combined in future in some different permutation of causes to create some different problem for us. An improvement program needs to be constantly capturing all these causes and addressing as many of them as possible, not just some single cause that was arbitrary fingered by a review. Just because we managed to feel that one cause stood out from the others this time, will mean little in future incidents where the mix will certainly be different.

There is no such thing as a root cause. There is no one cause more important than others. The world finds new and exciting ways to combine causes to catch us out. We must chase them all.

Syndicate content