Should every major incident produce a problem record?

Dear Wiz

The auditor pinged us for not creating a problem record every time we have a major incident. I've been through ITIL V3 and ISO20000 and I can't find anything that says we should. What do you think?

P***ed Off

Dear P***ed Off

I can understand your frustration. It is the same frustration experienced by retarded children trying to come to grips with their own cognitive limitations.

You simple folk in the field really must learn that ITIL is a subtle and immensely complex body of knowledge. It takes years of experience and study to even begin to develop the higher intelligence required to apply it to everyday life. And this is an intelligence you cannot possibly attain working for years or even decades in a steady job at the IT production coalface. That sort of environment only stifles your development.

To reach our levels of awareness you must learn by teaching theory, gaining many certifications, travelling, wearing a suit, using wireless in airports, and attending many conferences. Until you have your own blog and people pay you to recommend things you cannot possibly begin to understand.

So please leave it to the auditors and consultants OK? You will only upset yourself flailing about in your ignorance.

To help you out with your specific question, ITIL may not say so in as many words but to those of us with higher insight into its nuances the message is clear. If you fail to recognize the value of documentation, even after the fact, you are doomed to repeat your mistakes all over again. That's why there is so much emphasis on DIKW and the knowledgebase approach with the SKMS. Are you beginning to get a glimmer of understanding? Because if not then going further will be an exercise in futility and I have better things to do.

I shall give you the benefit of a great deal of doubt and explain a little more: the consistency of the application of best practice and processes/roles is precisely what truly helps orgs to dig out of major incidents and problems in an effective and efficient manner. ITIL clearly says (SO 4.2.4.3) "A problem is the underlying cause of one or more incidents and remains a separate entity always!" One day you will understand that when ITIL ends a sentence in an exclamation mark that means it is something really important we call a "best practice", OK? ITIL goes on to say "some major incidents may not need to be handled in this way ... provided the impact remains low!" Also a best practice because of the "!" see? But you cannot begin to comprehend the depth of understanding required to come up with major incidents whose impact is low, now can you? Name one. I bet you can't.

Finally SO 4.2.4.3 says "If the cause of the incident needs to be investigated at the same time" which is pretty often "then the Problem Manager would be involved as well". Now ITIL doesn't say anything else about problems in major incidents, but for those of us truly attuned to its brilliant insights it doesn't need to say any more. To us it is as clear as words on a page that problem records will be created in all major incidents. Of course that is what this passage means, and if you study hard and travel hard maybe one day you will see it too.

That auditor was a wise man, probably even an ITIL Expert.

Good luck seeking enlightenment!
The ITIL Wizard

P.S. there are lots of good comments below and our host the IT Skeptic has discussed this further here

Comments

ITIL Audit?

Was this an ISO 20K audit? I am not aware of any ITIL audit; since ITIL provides guidance only that leaves room for you defining processes any damn way you choose and an auditor should be auditing

say what you do
do what you say
prove it

If it's an ISO 20K audit then it should be clear; the code of practice 'shalls' will tell you ....

John M. Worthington
MyServiceMonitor, LLC

Its all about impact, impact and impact

Deja vu? "whats the difference between an incident and a problem" is so 'itSMF Conference bar talk'. Its simple - IMPACT! They are separate records of what could be the same event. An incident record is the poor mans response - implying we are not sure if the effort of finding out why something happened is worth it - so lets just restore service - keep an audit trail - and move on.

Problem - one or more stakeholders (including service provider) makes us give a damn - so we ALSO log a problem record and log it into the investigative team. We run the money clock to be sure they don't over extend us - but give them time to come up with reasons (causes) and options to mitigate or eliminate... the problem angle is all 'return on investment' based....

IMPACT, IMPACT, IMPACT drives 'ask me if I care' decision making and throws the problem switch... very simple indeed...

whether a problem record is required

I'll admit the point of the post is obscure but you are missing it I think :) The issue is not the difference between incident and problem. It is whether the Incident Resolution procedure is "allowed" to include resolving the underlying root causes, i.e. is allowed to wander into Problem Resolution teritory, and in particular whether a problem record is required (not whether it is a nice idea - whether it is REQUIRED) in EVERY Major Incident in order to document what was done to resolve the underlying root causes, as compared to recording that in the incident record.

At a practical level, there are advantages to creating a problem record, so as to have all underlying faults documented in one place, and so as to have the problem team look further to make sure the alligator was killed. On the other hand it is an overhead. So it is probably a good idea but that's not the issue here.

At a religious level, the holy ITIL book certainly implies to me that a problem is something that cannot or was not dealt with by the Incident Resolution procedure, or put another way if something was dealt with by Incident Resolution then the implication is that it never got to be a distinct problem entity, even if we might involve the problem people in dealing with it.

So i don't think it is at all clear that it is "wrong" to record everything that happened in the Major Incident record - ITIL is hopelessly vague on this. The way i have done it in the past is to produce a Major Incident Review template that includes root cause. I see that as equally valid. The problem team should be part of every Major Incident Review - if there is still a (suspicion of) problem there they can create the problem record. I'm not entirely convinced about creating problem records for their own sake when a Major Incident has been cleanly resolved, which was the advice of the auditor.

I may have missed that point but...

Skep
Accepted I may have missed the functional aspect. Thats for governance to determine on a case by case (customer/service/service model) basis - go to your copy of USMBOK for more on some rules. Anyone could open a problem record - why not - as long as they are trained or guided in how to start populating the problem definition and impact statements - two areas ITIL V3 offers no help on at all. (Strike 1). Regard problem as a suggestion box - as for major incident - specific governance should apply (Strike 2).... and be established as we do not want unnecessary thrashing of resources whilst someones barn is burning down!

As for root cause - complete fallacy. Its very rare indeed to find one. Experience and guidance in non-IT (ITIL) world explains that causes come in gaggles (or whatever the word is for more than one cause!). They have types, of which root is one. Each cause needs to be weighed after careful documentation and a countermeasure formed.... USMBOK terms this a 'solution set'. ITIL misses this as well (Strike 3).

So forget ITIL v3 - its useless for problem management - lets talk about problem management as it is commonly used outside of IT and ITIL... in healthcare for example.. and this will all be so much easier... As I have blogged before - ITIL V3 problem management sucks. Its poorly defined, misleading, incomplete, and without this part continual service improvement is in trouble......

Do not have two processes doing the same thing

There is no sense in having two processes doing the same thing. IM must solve incidents, whatever it takes. PM is a different process with different goals. All PM is then proactive and the goal is minimize or eliminate the risk of the thing reocurring. This concept is from Jan van Bon and I think it makes a lot of sense. Unfortunaly you would lose points in ITIL Exam with this answer.

Aale

Audit

Where to begin? In this reply I will try and stick to the audit issues

Generic ITIL audits are of variable value. Most people designing and undertaking them have little understanding of or training in professional audit techniques, which lessens their value. In recent years there has been a trend to link audits to maturity models, something Rob has commented on elsewhere recently.

In any case there is no definitive guidance on how to audit ITIL, never mind what questions to ask and evidence to consider, which means the auditor must make up the short fall.

From an audit persepctive it would be legitimate for an auditor to point out that based on their experience and best/good practice it is at least advisable to raise a problem for every distinct major incident, even if a management decison is then made not to undertake any further problem management activity.

ISO 20000 is slightly different.

I believe I'm correct in saying that strictly speaking they only audit based on part 1 of the standard.

There was a formal BSI 15000 audit workbook which I wasn't that impressed with from an audit perspective. I vaguely recall this was re-issued as an ISO/IEC 20000 self assessment workbook, rather than as an offical auditors' checklist, but I don't know what the offical status of it is - IIRC it isn't an ISO document.

ISO 20000 auditors do have have leeway in assessing organisations. They will, correctly, take into account the size of an organisation, and obviously can only audit based on the agreed scope statement.

Part 1 does not say that all major incidents shoyuld lead to a problem record being raised. I think it would be reasonable again for an ISO 20000 auditor to ask about those cases where a probelm record was not raised, and assure themselves that controls were in place to ensure this was the result of a documented management decision, not an oversight.

James Finister
Wolston Limited
www.wolston.net
www.coreITSM.com
http://coreitsm.blogspot.com/

I still say that a good logical data model would illuminate this

I have a number of colleagues from the data management world who I believe would be troubled by the inconsistencies. Here is where I think the conversation would go:

"OK, we have agreed that there is an entity named Incident, and there is an entity named Problem. They clearly have some sort of relationship, if not a functional dependency."

"But after reading your documentation and interviewing a number of apparently distinguished subject matter experts in ITIL, it remains unclear to me whether

1) they are subtypes of some common abstract type;
2) what their cardinality is, and
3) what their optionality is."

"Good luck with that Service Desk application...."

Charles T. Betz
http://www.erp4it.com

in the interests of the ITIL industry not the ITIL users

Having "grown up" in this industry worshipping the one true Codd, it has always intrigued me ITIL's refusal to define a data model at any level of detail beyond a VSAM-inspired "record". No Entity-Relationship, no normalised data model... To me, this has to come from the influence of the vendors: they need to be able to say they "support ITIL" without redeveloping product to comply with an ITIL data model. The lack of data guidance is in the interests of the ITIL industry not the ITIL users. If it had been done twenty years ago they'd all be there by now.

We see exactly the same thing with CMDBf, the federation standard. it defines the syntax of data exchange, because you can externalise just about any data model these days regardless of the internals of your product, but they refuse to define the semantics - the behaviour and interpretation - because once again that would require code changes more extensive than just slapping a data translater and XML-API on the outside. [Please excuse my layman's abuse of terminology - pedantic corrections welcomed].

So you've done it again

So you've done it again !
Skep, this post started really nicely, by explaining how weird the consultants are.
"How sweet are thy words to my taste! yes, sweeter than honey to my mouth!"
But then all of the sudden, it's again the fault of these ugly software vendors. What a disapointment.

Philippe,
Member of the Royal Society for the Prevention of Cruelty to Software Vendors

Prove me wrong

Prove me wrong: come up with a better explanation for a mystifying gap in ITIL

Actually there are mentions

Actually there are mentions of data schemes in the books. See for example Service Operation, page 50, 4.2.5.3. Even in some detail : see page 66, 4.4.7.2, Note in the second column.

But it's really risky to give a more detailed list of tables, and fields. You may contradict yourself or even start endless discussions with techies on subjects you don't really control. No one wants to start discussing with techies, they are sometimes right.

When you're an ITIL consultant, it's so much easier (and cost efficient) to spend time chatting with upper management than trying to build something that really works. Let that to the lower class.

Philippe, member of RSPCSV and Che Guevara of ITIL

PS : this could probably be called a "troll" based on the fact that most people who comment on this blog are ITIL consultants. Don't hesitate to cut if you find my comments inappropriate.

inappropriate

I find them inappropriate but not as inappropriate as mine

Does this stand up to scrutiny?

Surely it would be in the interests of some of the vendors to promote their own data model, and it would be easier for them to "prove compliance" if such a model existed? I thought HP were actually undertaking this as part of v3, and I presume Ashley's role as a mentor in the update, ensuring the diagrams are consistent, reflects that on going interest on the part of HP.

Given a choice between conspiracy and incompetency I tend to go for the latter. Isn't it just the case that bringing some rigor to ITIL would reveal just how much of it is still very soft? That isn't to say there is anything wrong with it being soft, but it would be nice to know which bits are soft and which are capable of being subjected to detailed specification.

James Finister

A problem record means the alligator is still out there somewher

My definitive (non-facetious) answer to this:

I think the auditor was making the point that it is good practice to always check for root cause of a major incident. An alligator mauled you. Did you kill it or did you just drive it off?

But I think the auditor made the point the wrong way.

How you record evidence of having done root cause analysis is up to you: ITIL and ISO20000 have nothing to say on it.

But you SHOULD do root cause analysis as part of the response to any Major Incident or as part of the later wash-up and review. That analysis is "officially" labelled as problem management activity and is what is meant when ITIL and ISO20000 say you need to "do problem" with a major Incident. If you don't methodically do RCA then I suspect that is the point the auditor was really making.

A problem record is a good way to record your RCA but not the only way, not the "official" way (in fact ITIL clearly implies you only create a problem record for recurring or unsolved incidents: SO 4.2.5 "ongoing or recurring problem"), and not in my personal observation the generally accepted way: I've seen it recorded more in the incident record and especially in a formal post-incident review. The Incident Review template I use has analysis of direct, contributing and root cause. Problem staff are involved in that review and they would create a problem record if the root cause was undetermined or unresolved, i.e. the alligator is still out there somewhere

Audits, Root Causes and Driving Tests

Skep
Its my experience that auditors check for existence of compliance or similar - with some black and white 'shall'. Now how that compliance is demonstrated varies wildly and is wholly dependent upon the individual auditor, and what they use as a reference. Cause analysis is the activity here - and it includes root cause analysis... as well as change, task and control barrier analysis. In fact the latter three are typically performed PRIOR to RCA. Using whatever methods work for you - from fishbone diagrams, through fault trees, to crystal balls, a list of causes is developed, ranked, perhaps favorites tagged (all legitimate problem manager work), and then countermeasures proposed... Not new. Not invented here.

As for major incident - as I may have said earlier (sorry that was 9 hrs into a flight!) - what represents a major incident must be defined - typically based upon impact to a named stakeholder. The auditor will/should start there. All major incidents should have a complete record - indicating what action if any was taken, by whom and why. Its quite 'legal' from an auditors perspective to take no action as long as someone is tagged with that decision.

IMHO auditors have no personal stake in the result. They are inspecting for compliance - as I started. This requires some level of detailed reference as a comparison point. So far ISO20K and ITIL V2 and ITIL V3 lack that detail.... deferring it to the auditor. In any audit - PLEASE - ask the auditor what they will use as their detailed reference. Its quite healthy to ready the organization based upon that - its not cheating - rather like knowing the 'highway code' before taking a driving test....

What do the auditors know?

In my experience auditors vary greatly in their capability. It worries me that many ITIL assessments are carried out by people who in practice have limited experience of the full range of ITIL processes, capabilities, functions or whatever we are calling things today. It is certainly clear to me that in depth knowledge of those disciplines that ITIL borrows from is often lacking. So expecting to find an ITIL auditor with the ability to make proper judgments about problem management is being very optimistic. So they lack the internal reference point anyway, and as Ian says, that means they are relying on what the books say. The catch with this is that if you get audited by someone who knows there stuff you are likely to get a lower score than if audited by someone reliant on book knowledge ( see http://en.wikipedia.org/wiki/Dunning–Kruger_effect yet again)

Of course you might argue that the auditors could adopt a black box approach, ignoring the mechanisms of problem management and just evaluating whether there was evidence of effective problem management in place - for instance the number of successful changes that have arisen out of problem management activity. The problem with this is that problem management capability and the the processes around it have to be relatively mature before it becomes effective, so often there would be nothing visible to audit.

As for ISO 20000 audits - I'm obviously a massive fan of ISO 20000, I wouldn't dedicate so much pro bono activity to it if I wasn't - but the audit is primarily about the paperwork, not the effectivness

James Finister
Wolston Limited
www.wolston.net
www.coreITSM.com
http://coreitsm.blogspot.com/

the battle of the blogs

Well this is fun: the battle of the blogs. Being compared to Monty Python is a good thing right? Astute commentators who pointed out absurdities in orthodoxy.

Readers who are members of LinkedIn group "ITIL v2 / v3 Service Management (ITSM) and ISO 20000" may like to follow the whole thread for yourselves and draw your own conclusions. Then maybe we'll have a poll eh?

Suspect anybody who calls himself an ITIL evangelist or guru

Juan calls himself evangelist and he seems to have a clear model in his head but it is not based on ITIL or ISO 20000. The people who call themselves gurus or evangelists seem to be unable to discuss anything that would weaken their belief in their own infallibility. I would say poor Juan Jimenez ran into a wall of knowledge and could not take it. Diarmid Gibson had a good point and Juan terminated discussion after that. This is the simple fact that Diarmid stated:

"There is nothing in ISO20000 that stipulates the need for a problem record related to each Major Incident. In fact, the standard does not define Major Incident beyond that it is something that requires extraordinary management arrangements."

So ISO 20000 does not support Juan, either does common sense. I don't want to repeat the several good arguments of the long discussion but I did not notice this point mentioned: In many cases organizations take risks. The cost of a major incident can be less than the expected value of the risk (probablity X impact). Then the risk is realized and the incident happens. The root cause is known immediately but it is still a major incident because it has a lot of impact BUT it was a calculated risk. No need for problem analysis.

Aale

specialist priesthood

let's not bring the individuals into it. there are two issues I see here:

- consultants, especially auditors, who see fit to impose interpretations (however sensible) above and beyond what is "standard". Advice is good, telling someone they are wrong when there is zero authority behind it is not. The absence of a problem record is not "wrong" anywhere but in the consultant's mind. ITIL does not require a specialist priesthood to interpret or infer it. Consultants add value when they advise not lay down laws

- the standards and frameworks SHOULD define some of this stuff in more detail. the ambiguities and oversights in ITIL are excused on the grounds of "adopt and adapt" - that every site is different. And yet the experts seem able to get utterly dogmatic about points that are not in ITIL or ISO20000. If they are that clearcut they should be in there. As it happens i think this particular point is NOT that clearcut, but there are plenty that are.

Priesthood and biblish behavior causes wars

Religions have besides comforting many many people caused the death of very many people. I do not think we should approach ITIL like a religion.

I always compare ITIL with a good cook book, build on lots of experience, but...if you want to bake your bread otherwise, no one will deny you the right to do so. It may just end up to have a different taste. ITIL is no norm, ISO is anor and so if an auditor checks against the norm which states: you shall have so an so and he discovers you do not have it, then you end up with a non-conformity. It's as binary as that.

The ISO 20.000 norm (ISO/IEC 20.000-1) states in the paragraph about Incident management: "Major incidents shall be classified and managed according to a process". In the paragraph about Service reporting it is mentioned that: "Service reporting shall include: ........d) performance reporting following major events, e.g. major incidents and changes;..." There is not other occurence of the term "Major Incident" in the norm.
This part one of the ISO norm is the actual norm and includes the mention: "shall". The auditor shall check compliance to these mentions.

Part 2 of ISO 20.000 (ISO/IEC 20.000-2) is the code of practice. This part is the best practice guidance along with the norm and uses the term "should".
Here it states (as part if the incident proces description):
8.2.2 Major incidents
There should be a clear definition of what constitutes a major incident and who is empowered to invoke
changes to the normal operation of the incident/problem process.
All major incidents should have a clearly defined responsible manager at all times.
Nomination as manager of a major incident should give the individual authority levels that are adequate to the
role of coordinating and controlling all aspects for the resolution. This should include the responsibility for
effective escalation and communication across all areas involved in resolution, and to the customers affected
by the major incident.
NOTE This level of authority can be temporary, and apply only during that major incident.
The process for a major incident should include a review which will inform a plan for improving the service.

Herafter the term "Major Incident"is not mentioned anymore.

So......it is not part of the norm and thus the auditor can and may never raise this as a non-conformity. End of story.

By the way: it strikes me that many good things come from the Brittish empire, to name a few: Monty Python, Tommy Cooper, Ashton Martin, The Beatles, Prince2, real marmalade, ISO 20.000, Scotch Whiskey and ITIL.

best of British

...the BBC, Winston Churchill, the Goon Show, rugby soccer and cricket, Rolling Stones, miniskirts, my ancestors, the Westminster system, nuclear fission, Shakespeare, fox terriers, Top Gear, Led Zepplin, Bertrand Russell, Oxford English Dictionary, the Lord of the Rings, the industrial revolution...

And Let's Not Forget ...

... her majesty the Queen - Gawd Bless 'Er!

It was her personal publishing company that brought us ITIL after all!

I rest my case

I've posted a comment on Juans Blog. He answered. Reading is an art. See for yourself.

Both camps had it wrong?

It would seem that in this extended and heated LinkedIn argument maybe everyone had it wrong. Crudely there were two camps: (1) When a Major Incident is resolved, the root cause and its resolution should be documented in a problem record (2) it is OK to document it all in the incident record (by implication because that is what we use all along). ITIL describes how Major Incident response should involve both Incident and Problem teams from the start. So there should be two records from the start too.

See We should create the problem record right up front in an incident

Incident/Problem records

Having worked in a data center for more than 10 years, I find it interesting that folks don't have separate findings for the incident and the problem. Incident focus is to restore service and documenting how that was done. Problem looks to why the service break occurred, associated solutions and their risks. Our best practice is to have a problem record/investigation following every major incident, documenting the potential causes, identifying the risks if no action (or cost out weighs the risk) and making decisions to move forward. As the data center is working at a 99.997% availability, that practice works well and should be encouraged. We also have problem tickets when there isn't a major incident in an effort to avoid one from occurring (proactive/reactive management). Documentation sometimes hurts but it is a necessary evil and can help moving forward and down the road.

Our processes were worked out well before we were involved in ITIL and ISO 20000 (data center has been 20K certified since 2006). It made sense to us to investigate and document for ourselves and our clients.

Mature environment

Dear Visitor

It looks like you are working in a mature and well organized environment. But I would not be surprised if you did not have many end user calling you direct, usually there are desktop support and application support in between them and mainframes. What works well for you may not work for other type of support organizations. My argument is that it is difficult to set fast rules on this area that fit all cases.

Aale

Syndicate content