Root Cause Analysis can't be done by machines

Submitted by skeptic on Sat, 2008-07-26 03:51

Share this post with

Vendors are making a fuss about their Root Cause Analysis (RCA) features in their tools. People Process Things once again: who says Root Cause is in the technology?

If I wasn't a skeptic I'd say how spooky it is the way this blog pre-empts articles I've written.
Red said

most tools assume incorrectly that multiple causation is not required in managing problems

and James said

I've also always tried to push RCA back beyond the technical fault towards the kind of generic issues that impact a lot of apparantly unrelated incidents, for instance lack of effective pre-production testing

I have an article coming up on ITSMWatch that says

The lowest level event message is often not the Root Cause, so drill down data is only a symptom ... Root Cause is often a procedural error [i.e. human not machine] and no software can detect it.

When geeks invent tools to fix technical problems it's great. When they try to invent tools to fix people and process problems it's not so great.

Root Cause Analysis requires a bunch of people in a room walking through what happened and building fishbone diagrams. The root cause is not necessarily technical. My belief is that it is almost never technical.

The SAN crashed
Why?
The firmware update failed
Why?
We were missing a patch
Why? Did we check?
Yes we checked but the patch wasn't on the vendor's public support system
Why not?
They hadn't rated it critical
Why not?
Human error
Did we contact the vendor to check required patches?
No, we just looked for criticals on the system
Will we check with them for all required patches before the next upgrade attempt?
Yes
And we want a letter from the vendor saying they've fixed whatever process failed to recognise it as critical

So the cool tools tell me Root Cause was the SAN. Crap. Root cause was a negligent vendor and a negligent engineer doing the upgrade.

Vendors make this song and dance about the RCA feature in their products, but it is only a gizmo. It provides one useful input to a RCA discussion, nothing more. There is no automating RCA.

Published in The Skeptical Informer, August 2008, Volume 2, No. 8

Next story: How the ITIL Qualifications Board is made up and how to be on it

Comments

Submitted by buraddo on Sun, 2008-07-27 11:59.

Its not about fires

As skep says RCA is nothing about fire fighting, incidents etc.. Its about fire prevention..

It is still true however that the a large volume of incidents never get passed through to the problem management process for RCA. Its just not practical to have 10 people sitting in a room to analyse this volume. Particularly those that do not "on the surface" have large business impact.

Along with minor incidents, technology has a huge role in RCA and problem management for pattern analysis. Its very difficult to people to identify patterns, particularly when the patterns space tiers of the IT system.

It cannot be all technology, but it cannot be all people..

Brad Vaughan
http://blogs.sun.com/buraddo

Submitted by JamesFinister on Sun, 2008-07-27 13:10.

Air Crash

Do you remember the days when we used to endlessly discuss process v. function, when they were the only two terms that ITIL used, before we capabilities and so on? Let's go back to that very simplistic model for a second. My view is that all incidents touch the problem management PROCESS - because someone (who does not have to be in the PM FUNCTION) makes a call at 1st 2nd or 3rd level support that the incident either relates to an existing problem or has been caused by a new one, and that it is (or isn't)worth inveting time in problem management. Post event all incident records are at the disposal of the PM Function to analyse, and if it is a major incident PM Function will get involved by default. So no, not all incidents got the PM function, and that's fine because we are making a VFM judgment. On the other hand all incidents do get assessed by the PM process.

One of my favourite analogies, when I was a lecturer was that PM is akin to air crash investigators. They have no interst at all in putting out fires, but only in stopping another crash happening - and they pass their recommendations on not to the fire fighters (except when relevant) but to the designers and operators of aircraft.

Submitted by John Worthingto... on Sun, 2008-07-27 15:36.

fire fighting & fire prevention...

One thing vendors may want to observe about this thread is how the term root-cause is used. When a tool vendor says 'root-cause' they are typically focused on isolating and putting out a fire; different than finding downstream fundamental root causes (like lack of training, testing, etc.).

Like fire-fighting and fire prevention, they are not unrelated. Like finding the cause of the fire to be an electrical short, which is determined to be a result of a lack of awareness on the part of the homeowner who is overloading circuits.

If you are going to make downstream investments in process improvement, it's nice to be sure your conclusions are correct. Event Management touches many life cycle stages and processes, and monitoring even more, and can be essential to effective downstream root-cause analysis.

John M. Worthington
MyServiceMonitor, LLC

Submitted by John Worthingto... on Sat, 2008-07-26 14:38.

putting the fire out vs root cause

While I don't disagree with much of what you've said, when your house (or your butt) is on fire it doesn't do you much good to start a dialog on human error and negligent vendors.

Improper patching, human errors, inadequate testing, process failure, negligent vendors and the associated discussions can lead to finger-pointing and blame games. You need to have this debate, but you'd better PUT THE FIRE OUT FIRST.

Who says the root cause cannot be based on technology? It is just as viable an option as process or testing or anything else. In fact, an inability to quickly put fires out (and detect potential fires) can be the root cause of an inability to address People and Process issues and perform 'root cause'.

People, Process AND Technology. Ya need 'em all.

John M. Worthington
MyServiceMonitor, LLC

Submitted by skeptic on Sun, 2008-07-27 08:35.

firefighters have no idea of the root cause

Disagree. very often firefighters have no idea of the root cause while putting the fire out. they just put it out. RCA comes later, in the investigation. Likewise in IT RCA is more a function of Problem Management than Incident, IMHO. (ITIL says both). And as part of PM too right I'm going to nail the culprit's ass to the wall. Nicely of course, but I want to know the problem is fixed.

Submitted by ChrisEvans (not verified) on Tue, 2008-07-29 09:14.

Actually

Speaking as an IT Manager for the fire service, I can tell you that a HUGE amount of time is spent on RCA within our organisation. The firefighters at the scene use their experience to gauge cause of fire if they can (KEDB you might say) but if it is unknown or deemed 'suspicious' then the Fire Investigation team complete with arson dogs will investigate further. However it goes far beyond that with Incidents being examined to see what went well, what went badly etc.

Its funny how focussed on Service Management you can get if getting it wrong could kill you!

The key with the Fire Service is prevention is better than cure and most of the fire fighters free time is now spent either in training or doing this preventative work.

As for a servicedesk not being busy all the time, it would appear that you havent seen a public sector one or at least not mine. Imagine the ideal 1st level setup and then remove all resource and tbe ability to get any ... and you have a pretty close picture.

Submitted by JamesFinister on Sat, 2008-07-26 15:38.

First Amongst Equals

True that PPT applies, but one of the key tenents of ITSM is, I think, that if you are always fire fighting you are doing something wrong. At a very simplistic level if that root cause is technological your process is what should cut in to remove the technology from your infrasturcture

Submitted by Red Pineapple on Sat, 2008-07-26 17:03.

IT Safety

Wow, I go to the gathering for a few pots of draught accompanied by some pipes and drums, and this blog experiences the most action it has had for the whole week!
I feel that the most important goal of IT is a return to service when an outage has occurred (IT fire fighting) but that a return to service should not be at the expense of an aggregated loss inheritance. Does the return to service have an associated long term impact on profitability as opposed to a delayed return to service which will address the underlying causes and turn the business around?
Now when a major incident occurs, it is not possible to make these types of decisions by the seat of your pants. What is required is that some ground work already exists and hence my idea that an IT Safety service function should exist that fights the fire when it happens and alternatively prepares for it during periods of calm.
The inherent complexities in IT imply that major incidents will always be a reality, and cannot be engineered away. This view of IT Safety which is in effect continuous fire fighting is the correct approach in my opinion.

Submitted by skeptic on Sun, 2008-07-27 08:14.

firemen

From the any-day-now book:

it should be noted that firemen spend an awful lot of time polishing the fire engine, rolling and unrolling hoses, and playing cards. Any manager who expects Service Desk and Level One Support people to be always busy (“fully utilised”) does not understand what they do. Real Level One Support has plenty of spare capacity.

...but I think you put it better. Firemen drill. they drive round checking out big buildings before they burn. They plan, they practice.

Submitted by Red Pineapple on Sun, 2008-07-27 14:38.

They plan, they practice?

Although it would serve them well, I have never witnessed a service desk that plans and practices. They'd rather crash and burn the customers real-time. That is just my experience, unless the whole world has conspired against me making me deal with only the lousy ones.

The act of actually putting out fires might not involve RCA, but that does not mean that firemen do not understand or require RCA. Just putting out the fire has the potential to make things worse. I'll try and use a real example as blogged about here: http://thedailywtf.com/Articles/Designed-For-Reliability.aspx

Now the example, can be applied generically to any redundant system and I admit I also made the same mistake. However, no maintenance guide or checklist addresses the issue. It is far too easy to say "human error" and nail the sorry techie's ass. What I learnt to do was paste PostIT notes, with "Operational, do not touch!" written on them, on everything that was working before I attempted a repair. I had a good smile when I noticed that Kevin Bacon's character in the film Apollo 13 did the same thing to prevent himself from inadvertently jettisoning his colleagues.

Submitted by skeptic on Sun, 2008-07-27 22:40.

human error

I was refering to an idealised model where service desk staff MIGHT behave like firemen :-D

See my latest post:
If the Direct Cause is human error, be nice to the guy.
If the Contributing Cause is human error, warn them.
If the Root Cause is human error, you have to make sure it won't happen again

Submitted by John Worthingto... on Sat, 2008-07-26 15:53.

How 'bout Event Management?

OK, I'll buy that. Given the v3 guidance, what about the root cause being gaps in Event Management? Could the silo-ed nature of monitoring inhibit effective Event Management?

I believe in many cases it is. I tried to put a link to a White Paper I wrote on this subject, but I suspect it got screened out; I'll try again Here.

Believe me, I am not trying to sell snake oil here. In fact, I know of many clients who made some initial progress by solving the monitoring issues but never took the additional steps needed to solve more fundamental root causes (such as leveraging improvements in Event Management to make gains in other service life cycle stages and processes).

Gaps in Event Management can contribute in a very significant way to achieving a service oriented paradigm shift.

John M. Worthington
MyServiceMonitor, LLC

Submitted by JamesFinister on Sun, 2008-07-27 17:49.

Noise

John,

There is an good point in here, which has been worrying me in the context of a recent assignment. You can have a very competent low level operations bridge doing a lot of detailed monitoring of events who fail to pick up two types of scenario. One is the technicaly trivial event that at the user end is causing chronic impact, and the other is the event that in itself is minor, but suggest to a techie guru leaps out as a being a symptom of something more major in the background that is going to become a majore incident if it isn't dealt with.How do we create a system that ensures the low level monitoring systems pushg the right messages up the line to those who know how to interpret them? Again I know a lot of thought has gone into this in the avionics industry with a big emphasis on the "need to know".

Submitted by John Worthingto... on Sun, 2008-07-27 21:10.

more noise and then back to work!

OK, some more noise... I am not the techhie here, but will give it a shot...I do like the dialog though! First I'd say that staffing an Operations Bridge with silo-based monitoring tools and no correlation intelligence may be a waste of time and money. The sheer number and complexity of events is too much...however, under other circumstances...

Scenario 1: [technicaly trivial event that at the user end is causing chronic impact]

This is why monitoring the end user experience is so important, and that correlation of events must be in the context of the end-to-end experience (often a transaction). If it is causing havoc at the user end it is not trivial (unless it's not important to the user, in which case you should ask why are we monitoring it?). More importantly, you need to correlate WHY the end user experience is poor (i.e. response time, etc.) --- what seemingly trivial event (i.e., exceeded threshold, etc.) is the source ("root cause") of the anomaly?

Sometimes 'low level' monitoring equals 'infrastructure' (not inclusive of the applications). As applications do not operate in isolation, you cannot exclude them from monitoring. Leaving the application out of the equation is still a form a silo-based monitoring; the monitor MUST manage the service ---- every layer of every component.

Scenario 2: [the event that in itself is minor, but suggest to a techie guru leaps out as a being a symptom of something more major in the background that is going to become a majore incident if it isn't dealt with]

This goes toward 'what do we need to be monitoring?' I have seem clients use intelligent monitoring in the QA area, in order to get some idea of what happens to the service under load and other failure scenarios. Of course the QA lab needs to have this capability and you can't cover every base, but it helps. It may also help staff understand the real time nature of dependencies....I don't know how in the world some folks do demand and capacity management (i.e., in n-tier, viortual environments) in a silo-based monitoring environment....seems almost impossible to me.

[How do we create a system that ensures the low level monitoring systems pushg the right messages up the line to those who know how to interpret them?]

The ability to easily create personalized views of the service infrastructure can help in this regard. For example, if you want the DB domain expert to focus on DBs, they can have a view of just that. Any events that are determined to be caused by DB will be shown as such to the DB view...

However I am a big believer that the various silos (domains) should be able to share the same view of the service. It is an important element of establishing organizational collaboration between domains, which is often absent. Once you have an agreed 'source of truth' about service impacts, THEN a Wiki or knowledge base might make more sense to foster discussion about the non-technical aspects of root cause.

If I missed your point I am sorry...back to work Monday. I may not have time to post but I'll be reading and will catch up when I can....say hi to Ivor for me!

John M. Worthington
MyServiceMonitor, LLC

Submitted by skeptic on Mon, 2008-07-28 00:27.

Working on the IT Skeptic

I may have to edit your post John. I can't have you implying that work is not the place for the IT Skeptic website. It's called "professional development". Some of my biggest traffic peaks are weekday mornings in Western Europe and the USA.

Submitted by John Worthingto... on Tue, 2008-07-29 02:11.

please don't make me Blog Alone

K...no harm intended.... Your site is of value to me and I did not mean to imply otherwise.

To some extent, this blog in particular is a Road Less Traveled...unless we are challenged by the Skeptics among us, real personal/professional growth won't happen.

In fact, our European friends may understand more than US the importance of social capital to effectiveness and efficiency. Something we in the US are often short of. Greater participation would be a welcome (and valued) change.

I only meant that for me the party was over (as my vacation had come to and end).

Sorry Skep.

John M. Worthington
MyServiceMonitor, LLC

Submitted by buraddo on Sat, 2008-07-26 04:33.

Once again: its not black or white

Disclaimer: I working for a vendor and actually working in a related area

The one thing that most vendors realize that sometimes get lost in marketing messages is that autonomic computing and RCA can never be fully implemented. The issue is "context". IT Systems existing within an ecosystem that includes other systems that are not measured and people who are not predictable. But that should not stop vendors from pursuing the area.

Live or die by the 80/20 rule!!!

There are alot of incidents in the IT Ecosystem that are of a simple nature and do not require significant analysis. Whether they be known error matching or very simplistic failures. Why not try to automate that process and take some cost out of the management system.

I should also say that people driven process are equally flawed. Sitting 2 different sets of 10 people in a room with some data (assuming the data is comprehensive enough) and getting them to do a fishbone will probably result is different conclusions.

Complex systems generally solutions which are inherently flawed. But attempting to solve the problem is better than ignoring.

Just beware that marketing cannot deal in reality. If you use a non-interactive process like marketing, you need an impactful, utopian message. Reality only kicks in during the process of evaluation or sales.

Brad Vaughan
http://blogs.sun.com/buraddo

Submitted by JamesFinister on Sat, 2008-07-26 08:25.

Problem people

Brad,

I've always thought that there are two good types of problem manager. The first is wholly non techie but knows how to handle techies and asks those killer naive questions, and the other is the off the wall type, perhaps summed up best in David Firth's brilliant "The Corporate Fool". Techies tend to make poor problem managers but are an essential tools in the process. The same applies to technology itself. A healthy knpowledge of statistics doesn't go amiss either.

A few years ago Ivor Evans and I had one of our creative sessions in a Dublin pub and concluded that every problem report should include a comppulsory multiple choice selection of causes:

a) No process in place
b) Process in place but not followed
c) Process followed but process wrong

James

Submitted by JamesFinister on Wed, 2008-07-30 08:54.

How about this one?

http://www.theregister.co.uk/2008/07/29/hot_netware_server/

Well it amused me. Of course now the server would be monitored remotely, and would be in a controlled environment, so the detective work wouldn't be needed.