How ITIL gets Incident vs Problem wrong

In ITIL, we don't separate Incidents from Problems properly. This causes a muddy and confused definition of both. Join me as I try one more time to make this clear.

In my post on Standard+Case I introduced this idea

If you seek to break a rock, many types of stone have a crystal structure which create fracture planes: hit the rock in the right place and it drops into two; hit it in the wrong direction and it just chips.
When you attempt to categorise stuff into two or more types, if division comes easily and is clear to everyone then you have found a categorisation which reveals something about the underlying nature of the information. If it is hard to do and the results are messy and debatable, then you are trying to force an unnatural taxonomy onto the data.

(That's a really important post BTW. If you read no other post on this blog, read that one.)

I have been thinking about that concept in the context of my Sisyphean quest to get the ITIL definition of Incident fixed. This may take a fair bit of explanation. if you are interested enough, stick with me. The rest of you go read about Kanban or BYOD or JobsToBeDone or something else fizzy and exciting.


According to ITIL, an Incident is an unplanned interruption to a service, or the failure of a component of a service that hasn't yet impacted service.
And yet according to ITIL the purpose of Incident Management is to restore normal service as quickly as possible and minimise business impact.
So if a component has failed and the current impact on service is zero, why does incident management give a flying fox about fixing it?

I think ITIL has the definition (but not description) of Incident Management exactly right and the definition of Incidents awfully wrong. That "component" crap shouldn't be in there.

But wait, there is more.
According to ITIL, a Problem is a cause of one or more Incidents.
And according to ITIL, Problem Management proactively prevents Incidents from happening and minimises the impact of incidents (...by removing causes of incidents, though the definition doesn't explicitly say that).

If a Problem isn't a problem until it causes at least one incident, what is proactive about that?

Clearly, in order for proactive Problem Management to even exist, the definition of a Problem should be the cause or potential cause of zero or more Incidents.

A failed service component is a Problem that will potentially cause Incidents. How it ever got dumped in the Incident definition is beyond me. My suspicion is that in many organisations Incident Management behaves with urgency and Problem Management doesn't, and that was why "failed component" got called an incident. It's a dumb reason but I can't think of a more plausible one.

What ITIL also gets wrong is the description of the Incident Management process: Incident resolution (SO 4.2.5.8) can include the fixing of a fault, or the fault becomes a Problem to "prevent...recurrence" (SO 4.2.6.4). WTF? We create a problem sometimes but not others? Sometimes we do root cause analysis as part of Incident Management and sometimes as part of Problem Manegement? Sometimes the fault is the responsibility of the Incident Manager and team, others times the Problem Manager? When we do statistical reporting on causes, we need to somehow extract them from (multiple) incident records as well as problem records? A Problem Manager looking at their portfolio of problems won't see many of the current issues? This rock is chipping not splitting.

So the formal ITIL definition of Incident Management is fine: restore the service, whatever it takes e.g. workarounds. The description of Incident Management fails to honour this. It seems to me the implicit ITIL definition of Incident Management is:

    restore the service and find the underlying fault and fix it and in vaguely defined circumstances create a Problem for the cause instead of fixing it.

And the implicit ITIL definition of Problem Management is:

    remove the causes of incidents except in vaguely defined circumstances where Incident Management is going to remove them

Coming back to my initial concept of cleaving a rock, let's revisit that mess with some nice crisp definitions of our own:

  • An Incident is a user reporting an unplanned interruption to a service.
  • An interruption to a service is a reduction in quality below the agreed service levels.
  • The purpose of Incident Management is to restore service to the user.
  • A Problem is a cause or potential cause of Incidents.
  • The cause of an interruption is a Problem. Every time.
  • The purpose of Problem Management is to remove Problems.
  • When Incident Management identifies a cause or suspected cause of an Incident, it immediately creates a Problem record, after which Incident Management continues to focus on restoring service to the users, including helping to prioritise the related Problem.

This splits the rock nicely.

  • A user-focused team works on restoring the user to service.
  • If there are workarounds, the Incident team are happy to get the user working again. If not, the Incident team will drive the priority of removing the problem.
  • A technical-focused team works on helping the Incident team with diagnosis of an interruption, and doing all the work to determine and remove cause of the interruption.
  • All problems are recorded as Problems so our problem portfolio and problem stats are finally useful.
  • If incidents are related, there must be a suspected cause to relate them, so incidents can be related to a mutual problem, not related to the cumbersome concept of a master or parent incident.
  • The conflicting priorities of restoring service and identifying cause are not mixed within one process, team and accountability.
  • Incident management can be measured on restoration of service.
  • Problem management can be measured on elimination of cause.

An analogy for my view of problem vs incident: the problem team are the Fire Service. The fire service has firemen, inspectors, and technical advisors to government. They don't have counsellors, emergency housing, or doctors. That's looking after the people: that's incident management. See also my "Cherry Valley" article on this subject.

[From comments below:
That's not what ITIL says, but it is what I think it should say.

An incident is that a user perceives they are not getting their agreed service levels.
Incident management is about getting them to again feel they are getting their agreed service levels.
Incident management is an outward facing process - a subset of request fulfilment - dedicated to providing maximum service to users.
Incident management is the responsibility of front-office outward-facing customer-service (actually user-service) teams with their own tools around service desk, CRM, SLM etc

Period. Nothing else. One process one purpose. one accountability, one set of goals and metrics. incident management. Don't confuse an incident with an interruption. Different entity.

There are inward-facing back-office technical teams whose role is to look after the components of the service. As part of that their job is to remove causes of service failure and therefore to restore the underlying service. Different people, different purpose, different goals and metrics, different tools. Ergo different process. Problem Management.

Split the rock.
]

I don't get how people don't get that. It is so clearly better. It is the right plane of cleavage between Incidents and Problems.

I'm proud of the article I did on incident management for ITSM Review as one of my best, and the follow-on article on Problem management. These use railroading as an illustration of the concepts.

    Interesting note: a problem is a cause of incidents. An incident is a user reporting an interruption. So poor user training is a problem because it causes incidents which are false reports of interruptions. Just a thought.

Then there is the right plane of cleavage between Incidents and Requests, which I have discussed elsewhere: there isn't one. We shouldn't try to separate them because an Incident is a category of Request. But that's a whole other debate...

Comments

Incident vs. Problem vs. Risk

Hi,

I know this has been beaten up, over and over again, and I don't know if this helps but some of the enterprises I've worked for or helped view Incidents, Problems, and Risks in the following manner...

  1. Risk(s) = Potential Incident(s) (or negative outcome)
  2. Incident(s) = Disruption (or negative outcome) caused by one or more Problem(s)
  3. Problem(s) = One or more things that require correction in order to prevent or avoid Risks from occurring, at all, or Incidents from happening, again.

I hope this helps.

My Best,

FG
--
Frank Guerino
The International Foundation for Information Technology (IF4IT)

Real life and book

Rob, your split is closer to real life than ITIL. IMHO there are two confusions in the ITIL view of Problem Management.

In real soft words: An ITSM practice needs "someone" who spends time trying to avoid future Incidents by a close look after the Incidents are fixed.

The approaches in real life are very different and always far away from the ITIL view. For example in desktop computing a lot of organization do not care about 10 or 50 Incidents which may have the same root cause. They re-install on the machine, full stop. (BTW: I do not like this)

So here ITIL could be a better practice additionally listing some typical (mature) industry approaches (BTW: Not just in Problem Management).

Second confusion "pro-active". I assume this comes from mechanical parts, there you can measure slackness and say "this shaft will break down in 48 h". IT systems are deterministic machines. They break down by sudden. Ex-post IT experts can tell you why it broke down after 47 h.

Best Regards
Werner

PANIC (Problems And NOT Incidents, Clearly)

Rob, I make you right that the definition of both incidents and problems needs attention. For problems, I have for a long time been using - word for word - the definition you've written in your post.

I think the incident one you've proposed is more problematic. Whilst Incident Management is by definition reactive, that doesn't - nor should it - mean that we wait for a user to spot the service interruption and report it before we act on it. Surely we hope that Event Management will spot the event that causes an Incident and give us warning way sooner than it will take users to. (Would we want our ISP to wait for us to call, or notice the incident and fix it whilst we sleep?).

I do agree that ITIL has handled the 'failed component' clumsily; I think it's because there's no actual problem investigation involved - the cause of the potential incident is known, it's therefore already a known error and very probably the remedial action is known too and doesn't need a business case approval before taking it.

Even if you talked about it in risk terms I don't think it gets less muddy. Incidents and problems might be analagous to issues and threat risks. The thing about a risk is that it's yet to happen; you could rightly argue that service hasn't been interrupted if it's a disk in a redundant array set, but there most definitely has been a fail event.

And incidentally, I imagine that many of organisations will have their Event Management systems raising incidents for any exception event that occurs. This is probably at the root of ITIL's wrestling with the definition. How do you automate the separation of exception events into 'incident' and 'problem' without overcooking the set-up and management costs?

Personally I think the answer to your question lies in problem priorities. It's something I don't think we handle anywhere nearly as well as we do incident priorities. Problem priority is about [potential] impact and urgency [~ likelihood]. If a component fails and the service risk level significantly increases then that's because the impact and likelihood combination has just changed. Action should be taken accordingly.

There's really not a problem with having component failure events being classified as high priority problems with remedial action comparable to a high priority incident; and if it suits your organisation, is there really a problem with such events being classified as incidents (particularly if they can be separated in MI).

So my definitions would simply be:

  • Incident: An unplanned interruption to an IT service or reduction in the quality of an IT service.
  • Problem: The cause or potential cause of an incident.
  • I have a fundamental problem with your statement "The conflicting priorities of restoring service and identifying cause are not mixed within one process, team and accountability." It's nice in theory, and I definitely agree with two-thirds of it - but not all organisations can support incidents and problems being separated to different teams for investigation and resolution. The processes may have different goals, but diagnostic and resolution activity in practice has much in common between both - and of course it will be the same mix of technical domains involved. Ultimately I think getting the priorities right will handle most of the conflict, with 'getting the process right' lapping up the rest.

    Rich

    interruption != downtime

    Makes sense, in general. But:

    "An Incident is a user reporting an unplanned interruption to a service."

    Sometimes the user realizes there is an interruption, and sometimes they don't. I don't think incidents can only be reported by users, they can also be found by monitoring - e.g. when something happened and the service is slower than usual (therefore directly impacting user experience) we shouldn't rely on just the users to report this. And the more automation there is in place, the more are Event, Incident and Change Management becoming one process (Find&Fix). This is divided into Manual (needs someone to investigate/repair) and Automated (automated fixes). The latter, in turn, can be divided in two: Reactive (service degradation beyond tolerance) and Proactive (service degradation within tolerance [no user impact or tolerable user impact]).

    When redesigning/redefining ITIL processes, we should not build the initiative on the model where stuff breaking is considered an exception.

    falling into the same trap

    Nope. You are falling into the same trap ITIL did when it bunged failed components in there.

    Incident management should be about restoring service to users. the process should be built around that, and the people should be skilled at that. fast restore, finding workarounds, being nice to unhappy people.

    If you want to fix a broken service, that's a problem. use the right people and tools to fix it.

    That was the whole point of my post.

    "Incident management should

    "Incident management should be about restoring service to users" - yes. Restoring involves restoring a failed service and a degraded service. In the first case users understand something is wrong (a 503 page, etc.), in the second case they might experience degradation but won't necessarily report it. Impact is there nevertheless. Many incidents should be taken care of automatically (restoring the service in 30sec rather than 30min). The rest do need nice people talking to unhappy users, but it only one part of the whole picture, not the whole picture.

    the whole point of my post

    The point of my post is that you are taking too broad a view of "incident". I say it should be focused on users. One process one purpose. Sure it is part of a wider picture of restoring a broken service. We have problem management for that. Problem mgmt's purpose is to remove the causes of failure. The whole point of my post is to say don't double up on that.

    only user contact?

    Are you saying that Incident Management applies *only* in cases where the users themselves contact the Service Desk (by whichever means) about the issues they have encountered with the service, and everything that does not originate from the users is a part of something else, e.g. Problem Management?

    Yes

    Yes.

    That's not what ITIL says, but it is what I think it should say.

    An incident is that a user perceives they are not getting their agreed service levels.
    Incident management is about getting them to again feel they are getting their agreed service levels.
    Incident management is an outward facing process - a subset of request fulfilment - dedicated to providing maximum service to users.
    Incident management is the responsibility of front-office outward-facing customer-service (actually user-service) teams with their own tools around service desk, CRM, SLM etc

    Period. Nothing else. One process one purpose. one accountability, one set of goals and metrics. incident management. Don't confuse an incident with an interruption. Different entity.

    There are inward-facing back-office technical teams whose role is to look after the components of the service. As part of that their job is to remove causes of service failure and therefore to restore the underlying service. Different people, different purpose, different goals and metrics, different tools. Ergo different process. Problem Management.

    Split the rock.

    a view too narrow

    "An incident is that a user perceives they are not getting their agreed service levels." Do these users always contact Service Desk to let the service provider know? Nope.

    Monitoring can let the service provider know that something has happened and the user is impacted (slower service or, as I mentioned, occasional 503 pages). This needs to be dealt with.

    Synthetic clients can test services 24/7 and can spot issues impacting n% of users. This needs to be dealt with.

    Now who is responsible for taking care of these types of issues - impacting the user, but not reported by the user? What kind of communication with users is required to let them know of the issues and the resolution?

    It seems like you are trying to redefine ITSM concepts while for some reason staying within the limits of ITIL (as Aale also commented). The ITSM world is much wider than that. I know that you know that :) Sometimes the reason why you cannot finish a puzzle is that you're missing too many pieces, not because it is too difficult.

    going in circles

    This is going in circles.

    Detecting a broken service is an Event. The Event Mgmt process raises a Problem record for that service (or Aale can call it a Fault if he feels better about that) and the Problem Mgmt techs get on it.

    The service desk would be notified of the Interruption and would handle comms to users. That's why we have a service desk function in ITIL distinct from the processes. Comms isn't a process, it is an activity of the SD function.

    In some cases Incident Management might contact users proactively to push them workarounds.

    Since I've rejected the ITIL definitions of just about everything and introduced the Interruption entity I can hardly be accused of trying to "stay within ITIL".

    So here, I think, I disagree.

    So here, I think, I disagree. Event Management detects a broken service in two cases - when there is yet no user impact and where there is already user impact. When there is no user impact yet, then maybe indeed it would be better to have Event->Problem workflow. If there already is user impact, then the workflow would be Event->Incident->Problem. Some of the Incident Management would be done by Service Desk (comms, searching the KEDB, etc) and some of it would be done automatically (automated repairs). Not all incidents that impact users need to have comms attached, usually because the impact lasts for a very short time, while the resolution can be both manual or automated.

    In many cases service desk doesn't even need to be notified, because the incident, detected by event management, was resolved using automated repair. Is that automated repair problem management? I don't think so, because the only difference compare to your idea of incident management is the fact that the solution was automated, not manual.

    As I said

    Don't you see how hopeless it is to use ITIL terminology but give it a different meaning.

    Don't use the word incident

    One of the main sources of confusion here is the word incident, which has several meanings. I don't think one can have intelligent discussion of ITSM with ITIL terms;)

    Problem with the problem

    ...is that in many cases it is not a problem;)

    Let me explain. The cause of the incident may be crystal clear. A component has failed. The SD can provide a workaround and fix the customer but it cannot fix the failed component. There is nothing problematic about the failed component, it is a calculated risk. There is a team which routinely fixes failed components. This is different from the situation where you have unexpected interruptions but do not know what is causing them.

    I would kick problem management out and talk about Risk management. That would include problem, Availability and CSI. Unknown causes represent a potential major risk.

    This in not so complicated but you are absolutely right that ITIL has mixed up the concepts. And each ITIL version/edition has different incident/problem concepts.

    Aale

    More mixed up still

    Risk management is just one of the processes/capabilities thta I think ITIL doesn't map on to problem management in a way that is useful. I stress the term useful because this isn't an academic debate - failing to understand and implement effective problem management costs companies money, reputation and creates negative customer satisfaction.

    At the simplest level many organisations don't distinguish between 3rd level support, the workflow of managing problems and the activity of actually identifying and removing the root causes. The result is often a mechanistic approach that never achives the desired outcomes, and often this approach is reinforced by carlessly framed KPIs and targets.

    It is perhaps telling that in very few organisational designs that I've done in recent years have i made provison for a dedicated problem manager. Instead I prefer to split the process between CSI, the service desk, support teams and service managers.

    I absolutely agree

    I absolutely agree with you. I have struggled with incident/problem since doing my ITIL foundation a year ago. Funny how I had my own clear ideas, which match yours, based on years of helpdesk experience before it all got muddied by awareness of ITIL.

    Bad Interpretation of ITIL = Bad Support

    I think this is the whole point. People are getting so hung up on the ITIL gravy train that they have forgot how to actually provide support. I think the simplest way to assimilate ITIL to the real world is to think of IM in the same vein as 1st line (which could or could not include helpdesk, desktop, server, network, etc.), and PM resembles 2nd line and beyond (could or could not include desktop, server, networks, etc.)
    Organisations are trying too hard to assimilate their support functions to ITIL and making a real hash of it resulting in a balooning of process and ultimately ineffective support to the user base.
    Just rememeber what it's all about - KISS

    as simple as possible

    I'm with Aale on this

    "Make things as simple as possible... and no simpler"

    Exactly. KISS(Keep It Simple

    Exactly. KISS(Keep It Simple Stupid)

    No

    Yes, ITIL is a mess but not that way.

    The original idea of having several lines in support is good but the are working in the same process. Also the idea of having someone looking for hidden causes and trying to prevent future failures is good.

    If you call 2nd line PM, you lose both of these benefits.

    Aale

    Me three

    This is how I describe it to students. I say 'this is what ITIL says, but, of course, it has to be nought it more incidents, not one ore more to make sense.

    Have you pot all this on the ITIL issue log?

    Proactive

    Good one Rob. It takes me back 11 years when I did my ITIL v2 Foundation training conducted by a Dutch gentleman. We nearly spent 30 min on WTF is proactive in ITIL-speak. The conclusion was nothing was proactive, but a reaction to something else.

    The nearest I could get...

    ....to proactive problem management were two examples, both weak. The first was the use of predictive techniques and forensic investigation during the course of routine maintenance and the second was keeping an eye n user groups and press reports for things that had caught other people out.

    My other thought in my teaching days was that to be proactive there had to be an element of PM influencing a wider community than operations, for example feeding lessons learned back into project teams

    Great minds

    Hmm, have you seen the description of my next Brighttalk session? I don't agree with where you've split the rock, but I do agree ITIL just chips at it

    Syndicate content