Defining terms in Root Cause Analysis - let's be clear what we mean

We have a great discussion going regarding Root Cause. As usual so much comes down to precise definition of terms. What does Root Cause mean?

I use a model of Direct Cause, Contributing Cause, and Root Cause. I'm not sure where it came from originally but it shows up often enough.

Some software tools are pretty good at finding Direct Cause because that is the easy bit and that is often defined in technology terms.

Contributing Cause is the old "it takes two mistakes to make an accident[incident]". We would have weathered the Direct Cause if not for the Contributing Cause.

If we fix the Direct Cause it is going to happen again until we fix the Root Cause. i.e. Root Cause = Problem.

The IT Skeptic loves to generalise and over-simplify. In my simple little world, the Root Cause is the one when you keep asking "why?" and there are no more (useful) underlying reasons. See the scenario in the previous post.

I stick to my earlier assertion in that post that Root Cause is more often a process problem than a technology problem. The only way to find those is to have human(s) look at it, which measn you won't do it for every incident. In general you'll only know Direct Cause for an Incdent (if anything). You do RCA for Major Incidents and for Problems.

In order to keep us all entertained, ITIL v3 appears to use Root Cause only in the glossary [correct, readers?]. Likewise SFA appears only in the glossary [CORRECTION: ITILv3 discusses Service Failure Analysis in detail on pages 108-110 of Service Design. (pp397-399 of the PDF version)] (SFA no longer means what I used to use it to mean ("Sweet Fanny Adams" is the mild version), but now means Service Failure Analysis. It has a new meaning in Real ITSM too). In the books they refer to Kepner & Tregoe's "true cause" (V2 SS p 119, V3 SO p62).

To me
root cause
= true cause
= problem
usually = process error

Comments

not usually process error, but usually "people" error

Hi Skep,

Just up to the line saying = problem, I agree with your definitions. But the last line is a little too much focused on process. There are a few other "usuals" that I come across:
often = architecture error
often = requirements error
often = management error (wrong style of management, or just blatant ignorance ;-) )
often = process error
often = set of individual errors

allmost allways = people error

So basically it boils down to some people (IT-archtitect, management of customer divisions, IT-management, process manager, anybody in the process) doing something which should have been handled differently. The real hard part, is getting these people (and I mean all these people) to understand that finding such a root cause is a good thing and that we all should learn from it. The higher up the hierarchy, the harder it is to get this in place. The better the companies management is, the better the whole company reacts to errors made.

Boiling ITIL down to the most perfect set of word definitions is not the part that will help us here. We all have to spread the word, the meaning and the true spirit of responsibility in work (now I start to sound like a preacher ;-).

process error = system error

If the Skep means that process error = system error, then the Skep is supported by Deming, see:
http://management.curiouscatblog.net/2006/05/03/find-the-root-cause-inst...

I was also searching for when the terms direct, contributing and root cause were first used in literature. I didn't get to the root of that answer but instead found a great research report on the topic. See:
http://www.hse.gov.uk/research/crr_pdf/2001/crr01325.pdf
It states that root cause is defined as: the most basic cause that can be reasonably identified and that management has control to fix. (My only issue would be that it should incorporate multiple causation.)
The research paper should just be labled IT Safety and incorporated into ITIL v3!

ITIL and Best Practice

This thread has reminded me what a generally poor job ITIL (not just v3) does sometimes when dealing with subjects where best practice is already well established outside of ITSM. Financial management in v2 was frankly embarrassing.

ITIL's Definition of Root Cause

Took a quick look in the Index of the ITIL pubs...

Service Operation: Root Cause - The underlying or original cause of an Incident or Problem.

Service Operation: Root Cause Analysis - [An activity that identifies the Root Cause of an Incident or Problem. RCA typically concentrates on IT infrastructure failures. See also Service Failure Analysis. ]

Service Design: Root Cause - (same as Service Operation, although the Index says page 306 it is actually on page 308)

Service Failure Analysis - An activity that identifies underlying causes of one or more service interruptions. SFA identifies opportunities to improve the IT Service Provider's Processes and tools, and not just the IT Infrastructure. SFA is a time-constrained, project-like activity, rather than an ongoing process of analysis.

This is where an elaboration of the expanded Incident Life Cycle can be found on pgs 105-110.

Service Strategy: Root Cause - (no reference to Root Cause, however there IS a reference to Service Analytics on pg 184 with some interesting references to instrumentation, monitoring and Event Mgt; [While data from element instrumentation is absolutely vital, it is insufficient for monitoring services.])

Service Analytics - A technique used in the assessment of the business impact of Incidents. Service Analytics models the dependencies between Configuration Items, and the dependencies of IT Services on Configuration Items

Service Transition: Root Cause - (Same as Service Operation; in the Glossary on pg 243 but not in the Index)

So if I interpret this correctly, the ITIL guidance is saying that Root Cause and Root Cause Analysis are focused on Incidents and Problems associated with the IT infrastructure.

Service Failure Analysis and Service Analytics are activities that go beyond the technical infrastructure and look at downstream (process, etc.) aspects of the service and business impact.

Comments?

John M. Worthington
MyServiceMonitor, LLC

The root of root cause analysis has been removed

Hi John - great digging....

Sorry I am 'late to the plate' on this but... the root cause of all this mumbo jumbo in ITIL and our profession at large is that none (few) of us know how to define a problem and perhaps worse, we have not looked at other industries that do - like healthcare, federal emergency agencies and the like.

Root cause analysis is an element within the cause analysis activity. It MUST be preceded by control barrier analysis (Google that one) and the problemeer should be discovering types of symptoms and their likely or contributing causes - similar to a doctor (hopefully).

Enter the ITIL certificant with a box of colored pills to be taken in a certain order. How did your red and blue pill story go Skep? Anyway, the color and the order has magical significance. The patient must then wait an undetermined gestation period for the potion to work, some believe to give the consultant time enough to flee.

This is about knowing how to define a problem and its impact. ITIL doesnt know. Problems when well defined garner support and propel an improvement program. My hypothesis is therefore that ITIL cannot lead or be the only weapon.

So if we all agree that ITIL is but a contributor - and in many cases it has not got a clue.... what accounts for it being so misrepresented or misunderstood as being prescriptive - who the hell is out there either teaching folks its the answer, or actively saying that? I think they should be awarded a special certificate....

Wicked Problems

You are correct about problem definition...check out PMI's eReads & Reference site (a nice deliverable itSMF might consider) called Dialog Mapping -- Building Shared Understanding of Wicked Problems.

Thought I'd mentioned this book on this site sometime back, but it does provide an interesting approach to clearly defining the Problem. I blabbered on about it in a post on Is Monitoring Automation a Wicked Problem? in June...

Of course the 'root' will change based on how you define the Problem. 'Root' from ITIL's perspective is pretty much focused on preventing seat-of-the-pants management of technology from what I can tell.

to wit...
[We were somewhere around half way to virtualization for most of our services when the real fires began. I remember saying something like, "I feel a bit lightheaded; maybe you should take the console . . ." And suddenly there was a terrible ringing of Alarms all around us and the Console was full of what looked like huge fireballs, all swooping and screeching and diving around the operations bridge, which was in a serious red-line state on a savage journey to hell. And a voice was screaming: "Holy Jesus! What are these all these goddamn Events?"]

While it's true that the ROOT in many cases could be identified as a People or Process problem, when your hair's on fire putting together a fishbone diagram may not be at the top of your agenda. You either want some red/blue pills (to make the pain stop) or something else to put the fire out (and prevent another one).

So the ROOT could be poor Event Quality, ineffective, silo-based monitoring and cultures that re-enforce the status quo. Just depends on how far ya wanna dig....

John M. Worthington
MyServiceMonitor, LLC

Who are the demented prophets?

It is a good question. Where does the "Yea brothers and sisters! ITIL is the answer, the one true way!!" start from? Who are the demented prophets going from town to town stirring up the masses? Suggestions welcome...

a hole in ITIL V3

Since the ITIL guidance only defines RCA and SFA in the glossary and studiously avoids discussing them in the books at all, i don't think they have much to say to us :-D I'd say this is a hole in ITIL V3

SFA in ITILv3

ITILv3 discusses Service Failure Analysis in detail on pages 108-110 of Service Design. (pp397-399 of the PDF version)

You are correct that Root Cause Analysis isn't well documented. Root-cause analysis and RCA are only referenced about 34 times in the entire library and those are almost all in glossaries. That does appear to be a hole. Perhaps they intended to point to an external source but everything in the books points to Problem Management in Service Operation with nothing really there.

my goof

It is too! my goof.

having worked out that I can buy a new set of books every eighteen months for the price of an online subscription, I haven't had the capability to search the books until i recently discovered the Google way. But I shouild have spotted that in the index - must have been a "senior moment". Thanks.

There is a tshirt in that, :-)

ITIL v3, Failure is not an option.

Maybe its about viewpoint

Root cause is just a matter of perspective. I prefer to think root cause = direct cause. If you are trying to define a process framework, the Root Cause probably means a failure of process. In the same way Six Sigma defines Root Cause as a non-conformance of process.

I however work in a ecosystem which is larger than ITIL and for that reason by RC definition is technology related. Failure of ITIL process is more often failure to prevent the problem and not an active contributor in causing the problem. Of course this is not always the case, often operational processes can be direct causes, but for most of the ITIL process they are preventative in nature.

Its important to remember that the initial incident is often resolved without considering the root cause. We are "working around" the root cause, we are not changing the testing process or the release process, we are just fixing the technology process. The work-around or temporary solution is a core component of all the vendor based support problems.

$0.02
Brad Vaughan
http://blogs.sun.com/buraddo

Anyone with kids will know ...

One day you child learns the word "why". They quickly learn that no matter what answer you give them, that they can ask "why" again and again and again...

That's a good point. RCA

That's a good point. RCA needs to know when to stop.

I think the answer is "when the answers are no longer useful to the group asking them"

I think the root is nearly always a person, so when the answers start being "because he didn't think" we can stop

Direct Cause, Contributing Cause, and Root Cause

Even though I can't quote the source, I'd venture to say that the Direct Cause, Contributing Cause, and Root Cause model come up so often as to be "generally accepted practice".

i think the distinctions are important and useful. As you say Brad, the incident process is primarily concerned with Direct Cause - just get the bloody thing working again, and it is usually technical in nature. But if we lose the distinction from Root Cause we lose valuable information. That RC is going to go on biting us again and again, which is what PM is supposed to prevent.

In my example scenario, the missing patch will be included in the download by the time we try the next firmware upgrade, (Direct Cause fixed), but we have no guarantee there won't be ANOTHER patch missing by then.

My example

So your example is an operational process. So there will never be a technology solution, it must be a better operational process.

I would give two example if technology relate fixes;

1. Failed HDD - an incident of disk failure is identified, incident is solved by replacing the disk, root cause is derived from the telemetry on the disk shows cause as over operating temperature, solution is to better cool the array (if not, it is likely you would have more failures in the same array enclosure)

2. Software bug - an intermittent incident of software failure is identified, through diagnosis the problem is found to be a conflict between two pieces of software under load, the workaround is to segment the process (move to different domains or boxes etc..). The root cause is identified as problem with specific lines of code. The final solution is delivered as patch. Without the solution the problem may repeat itself with other pieces or software or the same failure mode for other customers.

$0.02

Brad Vaughan
blogs.sun.com/buraddo

a additional comment

I should have add that these two cases do illustrate the point of "viewpoint".. Both of these technical root cause cases also have preventative process issues..

1. Proactive management of the environmentals would prevent the disk failure.
2. Better integrated testing could have identified the the process incompatibility

$0.02

Brad Vaughan
http://blogs.sun.com/buraddo

violent agreement

so we are in violent agreement again :-D

Not really

I am more with ITIL defn. on this one..

Within the whole IT ecosystem, the root cause step is more technology and operational process. The ITIL processes more works in the solutioning phase.. They are a non-direct cause/preventative measure in the most part.

But I agree you can define root cause at the process level if process is your world.

So its more a case of each to their own, than violent agreement.

Brad Vaughan
http://blogs.sun.com/buraddo

The most misunderstood process

It has always struck me that a lot of people don't get the ITIL view of problem management, including a large number of red badge holders. Certainly from very early on the implicit ITIL view has been a lot more radical than people have realised. I'm sure there is something about people being unwilling to let go of an existing way of thinking. No one is saying operational and technical issues don't need to be addressed, but what I think we are trying to say is that those technical and operational issues will emerge elsewhere unless you take action at a different level of intervention.

I get that

Don't worry we get that.. What we are debating is what is "Root Cause"

I am saying is the definition should stay at the technical operational level and the high level process should factor in as part of the solution and not part of the cause..

You should never just fix the disk, fix the environment monitoring, apply the patch, you need to implement improvement to the service management processes that will prevent this from occuring again.

From John W's comment, it sound like the ITIL definitions are fairly consistent. "Root Cause" being more infrastructure aligned and "Service Failure Analysis" being more involved in the higher level. Even if they fail to address in detail.

$0.02

Brad Vaughan
http://blogs.sun.com/buraddo

Syndicate content