Event Management best practices

Submitted by skeptic on Wed, 2011-06-01 03:03

Share this post with

While we were all disappearing up our own philosophical orifices debating what is a customer and what is a service, reality came calling. In my mailbox I got this plea from the coalface for some Event Management best practices:

It was just earlier this year I learned to spell ITIL but I still don't
know how to pronounce it. In the past week or so I discovered your site --
good stuff! A batch analyst at a large-ish company, discovering ITIL was
profound: finally there was a map of how to handle service operations. We
are doing so much right, and so much wrong. Now I am at the phase of ITIL
understanding where I am starting to see its flaws in real-world
application.

My team monitors some 2,400 batch jobs. On any given day some 1,600 are
scheduled. Each day sees some 25,000 job invocations (some jobs rerun many
times throughout the day). My team is not involved in development or
scheduling. Each job in some way is tied to an SLA or a customer or an
internal process or, sometimes, a pilot with no impact. Usually we don't
know what the job does, why it is being run, which customer/SLA it ties to
or who at the company knows about it. As a team we have acquired a vast
amount of "tribal knowledge" to understand a lot of it.

Almost every job, and most job "streams" have alarms around them, e.g.
"SEV2 job x has not started by 08:00" or "SEV1 stream Y may not complete by
15:00" or "SEV3 job Z has failed" or "SEV2 outbound file XYZ was expected
by 17:30." Almost all alarms result in an automated e-mail to the Help Desk
which results in an incident record (via Unicenter Service Desk) to my
group with the appropriate severity. My team intervenes and "resolves" the
"incident" if possible, and notifies appropriate groups. We manage approx
350 incidents per week.

We have recently had the epiphany that most of what are dealing with are
Alerts, not Incidents. For the life of me, I cannot find a good model for
managing Alerts, apart from what we see in Service Operations where Event
Management is discussed.

Does anyone in the real world do Event Management? Last year I was
convinced my team was far behind the curve on managing operations, but I am
starting to convince myself that we are nearer to the cutting edge in Event
Management.

Do you have any suggestions on where to look for reference Event
Management processes?

How about it folks? We're all full of opinions. Rubber meets the road now. Can anyone point us to useful best practice for this guy to extend the few pages in ITIL Service Operation?

event

Comments

Submitted by piotrchec on Thu, 2011-06-16 23:16.

Event Management Good Practices

The more I thought about it, the bigger the comment became. So finally, I wrote a post on my blog, which talks about how Event Management and Incident Management can live together in peace. You can read it here.

Submitted by aroos on Fri, 2011-06-17 05:37.

Sorry but no

You write: "Before ITIL v3 they were not necessarily called events". The truth is that event management existed long before ITIL as it had become science in the 1990's. I have Risto Vaarandis thesis (Tallinn University of Technology, 2005: Tools and techniques for Event Log Analysis. In the introduction he explains event correlation and refers to a book from 1995.

Aale

Submitted by Visitor (not verified) on Sat, 2011-06-25 12:39.

Even earlier in my experience

In Digital Equipment Company (DEC) in the mid 80s it was common practice to write command language scripts (in DCL) to handle all types of events that were routinely generated by DEC's hardware and software. A simple technique was to write a DCL script to handle specific events and "fix" the issue automatically - "self healing" we call it today. We regularly handled a network link going down by the event firing off a script that simply turned the network components off and back on again. This required just 4 lines of Network Control Protocol (NCP) commands in a command file.

The big issues with Event Management are how to filter events and how to decide what rules are needed for handling the filtered events. ITIL Service Operation book only explains that such things are required - standard ITIL approach of course!

Submitted by skeptic on Sat, 2011-06-25 23:30.

Self-healing explained without spam

Dear British visitor,
There are plenty of explanations of self healing systems than don't require HP marketing materials, thank-you anyway. I've removed your comment as a violation of terms of this site.

Submitted by skeptic on Sat, 2011-06-25 23:24.

not so fast with that scripting

100% agree about "The big issues with Event Management are how to filter events and how to decide what rules are needed for handling the filtered events. "

Not so agreeing about "A simple technique was to write a DCL script". The planet is awash with scripts some tech knocked up: undocumented, undesigned, unsupported, unrecoverable... These are nothing but job insurance for the tech writing them. I'd fire staff who wrote any script that wasn't subject to the same disciplines as any other piece of production software, and there's nothing simple about production software: requirements, design, compliance, testing, release...

I wrote recently about how, to be a candidate for automation, something needs to be high-volume, repeatable and stable. System alerts actually qualify on all three counts, so long as there are enough of them to provide an ROI over just having someone watch the console when you cost them as properly constructed software.

And don't give me that stuff about scripted responses reducing risk of human error. I'd say the risk of a badly scripted response was at least as serious.

There will be an 80/20 rule. Maybe 20% of them will be worth automating and that will deal with 80% of the traffic, but that still leaves you needing to monitor the console for the balance. Someone will always need to monitor the console - it's one of those immutable laws. Project it on the wall of the ops team area. Give the job to juniors - it's a great way to learn the environment.

Submitted by John Worthingto... on Fri, 2011-06-10 00:57.

Event Management - back to the future

wish I could have weighed in on this sooner. While I'm not an old mainframe dinosaur, I have long wanted ITIL to have much more to say about this (process or not). In fact, I remember when most tools focused on managing events rather than workflows, but I guess that's besides the point....

Most of us can DETECT events, in fact we're often drowning in them. Fewer of us can effectively MAKE SENSE of events; the 'process' relies heavily on automation (at least for infrastructure-based events, which is what ITIL tends to focus on). When it comes to event management, correlation is still king (wish I could link to that old blog post by the same name).

Once we've made sense of things, TAKING THE APPROPRIATE CONTROL ACTION is a matter of implementation and policy. Some events that are categorized as ALERTS will require the opening of an Incident. For others, we may decide to assign them to IT staff for operational tuning and proactive action. Some may even warrant the creation of automated actions, once approved via Change management.

With the emergence of virtual service infrastructures, I think it's really Event Management's moment of truth (which I white papered about some time ago)...

Service Monitoring Intelligence is a critical element of ITSM success, more today than ever. This takes us to the heart (or soul) of operational monitoring & control, and Event Management.

Back to the future.

John M. Worthington
Third Sky, Inc.

Submitted by Paul Robinson (not verified) on Thu, 2011-06-02 05:46.

Event management is very

Event management is very product (tool) centric. Are you after recommendations of tools that will perform event correlation, automatically logging incidents/problems/changes and somehow notifying operations staff of other events that are classified as alerts, but may not necessarily constitute an incident? I'd love to hear some "real world" examples of how this has been implemented too.

What I've seen in the real world so far is essentially thresholds set up so that only specific events (exceptions in this case) will automatically raise an incident, with all other events (such as information/alerts) staying where they were initially recorded (in log files) rather than being input into the ITSM tool (CA Unicentre Service Desk in your case). You're then reliant on your operations guys actively monitoring these log files/dashboards/whatevers and hopefully picking up on when an alert is logged that hasn't been automatically classified as an exception.

What tends to happen in these cases is that the monitoring of alerts takes a back seat to all of the more important work being undertaken by your operations staff, so things slip through the cracks.

Certainly not the most effective way of implementing the process, but as with all things ITIL it ultimately comes down to how much time and money you're willing to invest in the process/products/people/partnerships and how much return or value you'll receive for your efforts and expense.

Submitted by JamesFinister on Thu, 2011-06-02 10:00.

Back to front

I'm not sure that event management is new, or tool centric. Back in my days of visiting the living dead in data centers, in the dark ages, a lot of time was spent sat at the handling relatively low level error messages and interrupts at the console. None of which went anywhere near the help desk - because we didn't have one.

The catch was that the operators were often too detached from the user base to understand the impact of what they were seeing come out of the teleprinter and what it meant to a typical "user" such as me . For instance a relatively minor event at the start of the night shift could mean my crime statistics run, due to run in the middle of the shift, wouldn't get processed for another 24 hours.

In my IT audit days I spent a lot of time plowing through log files to see if things could have been responded to better. My general observation from those days was most rouble was caused by:

- A routine error occurring under unusual circumstances
- The domino effect with one event triggering many others, obscuring the event that needed to be prioritised
- The error code no one had ever seen before
- error codes that were so generic the computer could just have typed up "I really have no idea what just happened"

Then ITIL came along. And the great thing about v1 was it attempted to put the help desk at centre stage. Sadly in doing so it seems to have got things back to front. So for years "Incident" was the de facto generic starting point for any definition of a workflow based service management activity, whereas in reality not only is an incident is a specific instance of a generic event life cycle, but it is also only a representation of part of a larger life cycle. As an aside I'm sure that this has contributed to those whi get confused about incidents "turning into problems"

So how should we handle events?

Well as I've hinted, first of all we need a defined workflow with some criteria for deciding which route the management of events should take. That means establishing those ITIL basics of impact, urgency and priority and having pre-defined checklists for events that are either routine or rare but high potential impact. There is much for IT to learn here from the aviation and nuclear industries, not all of it good. Obviously these procedures etc. should not be developed in isolation from the rest of IT.

How should events then interact with the service desk and the service management tool. Remember they aren't the same thing. Just because an event raises a record in the ITSM tool doesn't mean it has to be treated as an incident. So why would you want it recorded in the tool? Reasons I can think of include:

- It might become an incident at a later point in time, if so you already have some information in the tool to assist in managing the incident and an audit trail for problem management
- Dealing with the event involves resources who might also be dealing with incidents and requests, providing a single work queue for them

Personally I'm very much in favour of the definition of incidents including events that might cause an outage or other disruption to warranty or utility, but if you go down that route it requires a greater degree of judgment. Again my personal view is better safe than sorry - better to flag it as a possible incident and then downgrade it than be caught out.

I'm aware that some of the glib comments I've made hide an immense amount of complexity, especially in two areas: Assessing the impact, priority and urgency, and combining event management into a work queue that also has to accommodate incidents, requests and changes. I've also probably forgotten to mention a lot of common sense things, like projecting the colour coded events onto a massive command screen so that everyone can see them, and not letting past events fall out of view if they are still relevant.

At the risk of being accused of academic philosophizing by Rob I would recommend a number of books:

Understanding Systems Failures by Bignell & Fortune
Simple Heuristics that make us Smart by Gigerenzer & Todd
Checklist Manifesto

James Finister
www.tcs.com
http://coreitsm.blogspot.com/

Submitted by Springs Marty (not verified) on Thu, 2011-06-02 21:04.

When is it resolved? And much more!

The more I think about this, the thornier it gets.

Suppose an alert arrives which says that the syslog rotate exited with a non-zero status on a server. Assume that there is no useful knowledge about the server available (this is the real world, after all).

Is this an incident? If so, what is the priority? How urgent is it? How many users are affected? Which CI failed?

Whatever the analyst does, at what point is the incident resolved? Put another way, when has the analyst "restored service"?

Dare the analyst attempt to "restore service" (whatever that means) without attempting some sort of root cause analysis?

In the end, was there any impact? Would there have been impact if the issue wasn't fixed? Was a service in jeopardy? Does it even matter if there was or wasn't impact? Should the analyst try to determine current and potential future impact before fixing the issue?

Submitted by JamesFinister on Thu, 2011-06-02 21:29.

The ITIL mantra of "It all depends"

The answers to these questions will vary from shop to shop.

Clearly the more you know about your systems and services, even in advance or in real time, the easier some decisions are to make. And to answer some of these questions you have to do what ITSM best practice has long advocated - you have to look outside of your silo, and place reliance on others. That's not a criticism, it doesn't matter how good you are at doing your part of the job if another part of the chain is broken. According to the textbook a lot of the decisions you've mentioned aren't yours to make, your responsibility is to bring things to the attention of the right people to make those decisions, such as the service desk manager the problem manager....but then I get the impression you are having to do your optimization in a real world situation where those people might not exist, or if they do are not skilled up to make those kinds of decisions.

My starting point is always not to sweat the small stuff, and get the routine and trivial things under control so you can focus on the difficult and unusual cases. So don't have an over;ly complex prioritization system and try and apply simple rules of thumb. Maybe you have a set of servers that you treat as high priority by default, and other that are low priority. OK, that's probably two simplistic.

One question you haven't asked that I would add to your list is "When is what the analyst is doing actually a change that needs to go through the change management procedure?"

James Finister
www.tcs.com
http://coreitsm.blogspot.com/

Submitted by Springs Marty (not verified) on Thu, 2011-06-02 22:14.

You are right, of course.

You are right, of course. In the interest of simplicity the example I threw out has but one actor (analyst) and one trivial issue to illustrate something we could all understand.

Part of my role at my company is to make improvements to our processes. Having discovered ITIL, I have been trying to figure out where my team maps to the ITIL framework, without much success.

Thanks for the feedback.

Submitted by JamesFinister on Thu, 2011-06-02 22:36.

Remember the end state

ITIL is good as far as it goers. I've been involved with it for around 20 years, which is scary, but it isn't the be all and end all of running an IT shop. It is better in some areas than others, it could probably all be improved, but it stands, along with COBIT and ISO 20000 as a good basic starting point. What it isn't is an infallible and complete answer to every issue facing an IT shop, though some would like to think it is. If you can't link to the ITIL framework it doesn't necessarily mean what your team is doing is wrong, and just because ITIL says something doesn't mean it is right.

It seems to me that you are asking all the right questions. What might bring you some benefit is to start formalizing some of those questions and having some sort of disciplined approach to evaluating the different options. That might well mean looking outside of ITIL and using six sigma, lean, Theory of Constraints etc etc. At a more basic level I've always found Kipling's wise men a useful analytical tool.

"I keep six honest serving men, They taught me all I knew,
Their names are What and Why and When
And How and Where and Who!

Have you come across G2G3? http://www.g2g3.com/ You might find some of their simulations useful, and I've been very impressed with what I've seen of their Vyper tool

James Finister
www.tcs.com
http://coreitsm.blogspot.com/

Submitted by Springs Marty (not verified) on Thu, 2011-06-02 23:48.

Where ITIL breaks

> it stands, along with COBIT and ISO 20000 as a good basic starting point.

Agreed. Before I declare what's broken in my team, it made sense to compare it to a reference of how our group should be performing. I am getting close to believing that ITIL has a lot of things, but not a template for what we do. I don't want to reinvent the wheel: surely someone else has mapped this out.

This voyage has shed light on some interesting aspects of ITIL, one of which is that ITIL is service-centric. At first it seems obvious to be service-centric and it seems to make sense, but ITIL leaves almost no room for anything which can't be tied to a service. In other words, if you can't tie an aspect if ITSM to a service, then you are (almost) immobile. Of course, given enough time, research and resources, anything can be chained back to a service. But anything which is urgent and cannot quickly be tied to a service is almost orphaned by ITIL.

Take the example of the log rotate error above. Given enough time and research, one can tie that failure to some number of services. Unfortunately, that research almost never makes economic sense, leaving few ITIL-friendly options.

Ironically, it is in this scenario that a CMDB would thrive, providing dependency visibility while reducing risk and response time, yet I have not seen a CMDB discussed for this use, a use where almost nothing else would be effective.

Submitted by JamesFinister on Fri, 2011-06-03 05:23.

The Usefulness of Philosophical Orifices

I find it kind of amusing that this thread started with Rob making fun of the debate we were having about whether DBA is a service or not, and here we are again faced with the importance of knowing whether the service model breaks down at some point in the value network. It wasn't always that way,

If you look back at ITIL v1 you would find that parts of, it such as capacity management, were written by people who genuinely bridged the service world and the technical world. I believe that link has been progressively weakened, to the detriment of ITIL. Aale Roos, amongst others, has suggested that the time is ripe to face up to the divide of the IT world into those at the supplier/customer/user interface and those actually managing IT.

And yes, the scenario you describe is a textbook example of where a CMDB would add value, in fact that is just what the first CMDBs I came across in my mainframe days were there for.

James Finister
www.tcs.com
http://coreitsm.blogspot.com/

Submitted by skeptic on Fri, 2011-06-03 14:29.

calling all you mainframe dinosaurs

it seems excessive to populate and maintain CMDB relationships between services and thousands of batch jobs just so Marty's team can be part of service management community and practices :)

The fact is these guys are so deep in the guts of operations that thye are indeed remote from the services. they need to just deal with that. Creating cute little services of their own isn't the answer either.

As Aale says we did this stuff in the 80s and 90s. Surely one of you old dinosaurs recalls a source of decent event management practices to help Marty out?

Submitted by aroos on Fri, 2011-06-03 11:10.

It'll be fixed in version 4

On of the last projects I was involved while still working in a Data Center was operation automation. The idea was that the tool was able to react automatically to events and alert the operators when necessary. This happened in late 1980's.

The major problem of itil is that it does not have a clear and logical model for the different type of things that happen in a complex service environment. The service request - incident - problem - error-change model of V1-2 was ok for a start. Adding event did not help and neither did rewriting the definition of the term incident. It is really a waste of time to argue about itil terms because they are not logical and it is silly to say that itil brings a common language when in practice people waste a lot of time arguing in which itil category some real life activity falls.

This is how I see it. A service consumer can make three types of contacts. These are feedback, request and problem. Consumer problems can be broken in three main categories,
- Fault, when we need to repair something. This does not need to be easy, it may require problem solving and root cause analysis but it can also be some simple activity.
- Support, when we need to show how to use it
- Potential fault when the system starts working after some activity or workaround. These may or may not indicate a fault in the system.

Event management is not a process, it is just an activity of monitoring & controlling. Events need to be broken also in sub categories. These could be:
- Information
- Request for some activity
- Potential fault
- Fault

We should not mix consumer problem handling (a.k.a. Incident Management) with monitoring & control activity. The place where these streams may meet is in fault reparation.

Notice that in this model there is no problem management process. Consumer problem management solves all consumer problems but does not repair faults. All teams must do continuous service improvement to prevent future faults and consumer problems.

I have been told that the itil terms were left vague intentionally as a group of experts could never agree about a common term. I think this points to the core problem, there is no common best or good practice until a clear majority can agree to a single and logical set of terms.

Waiting for ITIL V4.

Aale

[Editor's note: I can't resist

Submitted by Springs Marty (not verified) on Wed, 2011-06-01 15:57.

Don't even need best practice

At this point, best practice suggestions aren't necessary; advice on good practice would be helpful.

Best is better than good, right? Wait, which one is more proven to be effective?

Submitted by skeptic on Wed, 2011-06-01 18:01.

the difference between best and good practice

Don't go there.

Hard to say whether the ITIL books or the ITSM community are most confused about the difference between best and good practice.

Now you've done it. Here comes the debate (again)...

Submitted by Springs Marty (not verified) on Wed, 2011-06-01 18:49.

Back to Event Management

Regarding good vs. best, I apologize. I know it brings out the passionate in people. However counter-intuitive the terms are, I have internalized it as "good practice has been shown to be effective in the general case" and "best practice has been shown to be effective in some cases."

Back to events. The core volumes skirt around the edges of alerts and events in many places, but don't meet them head on. At the risk of sparking yet another debate, my team is rethinking some terms which we thought were obvious. If we agree that an incident is a service interruption, then what do we call an issue (for lack of a better term right now) which portends a service interruption, but is not, in itself, a service interruption? What if that issue clears up all by itself? What if the issue does not clear up by itself but needs intervention to prevent a service interruption? Do we wait until the interruption takes place before we open an incident record and action it? Where is the boundary between an event and an incident, and what other intermediate boundaries exist?

My group is quickly coming to the conclusion that there are two levels of activity around these issues BEFORE they become interruptions (i.e. incidents), and therefore they are two different things.
1. Acknowledge and monitor
2. Intervene

We are calling item one above, "Alerts" or, as one team member calls them, "alertcidents." We track them as incidents, even though they are not. We acknowledge them, ensure someone on the team owns each one and ensure that they are monitored. Since these may clear, these items do not need intervention, but they do need monitoring. A long-running job, for whatever reason, may throw an alert but may still finish soon enough to meet SLA.

Once it becomes clear that the alert will need intervention in order to meet SLA, we escalate it to a faux incident, aka "fakecident" and continue to track it in Unicenter Service Desk as an incident. At this point a team member owns the process and all communication to affected parties, as well as engaging the appropriate teams to attempt to resolve the issue prior to SLA breach.

Note that according to ITIL, none of the above really is an incident yet. Nonetheless, a lot of work has gone into tracking and managing issues.

About one in 3,000 alertcidents results in a SLA breach or other outage worthy of a true incident. At this point our Incident Management group, already painfully aware of the looming issue, will get involved.

The above process is confusing enough, but it gets worse.

If we have an alert going off too often (we have many "long running" alerts which go off each day like clockwork), but they never turn into an incident, then who owns fixing those? Our Problem Management group takes the position that they exist to prevent incidents, and recurring false alarms are not incidents. Fair enough. Our Change Management group won't authorize a change without a problem record, project charter, or a sufficient impact statement. Dozens of daily false alarms meet none of those criteria. Also fair.

So many questions.

If we should not be tracking alerts as incidents, then how should be tracking them?

If we don't tune alerts through problems, then how should we track alert trends?

Once we find a trend, who tunes alerts?

Any insight is appreciated.

Submitted by Wraith on Fri, 2011-06-24 04:09.

The primary mechanism by

The primary mechanism by which an event is transformed into an incident is the magic phrase "user impact". Essentially, if I have a failure in an HA system and the users are unaffected, then it's an event. However, if that HA system is also load-balanced, then I have the potential for degradation of service and thus, we have an incident.

The determination of which events constitute incidents requires domain-specific analysis - in short, an expert system which knows about infrastructure and services which depend upon it. This is usually encapsulated in a human being, although automated expert systems are preferred.

ITIL's phrase for determining an incident contains ambiguity which must be puzzled out: "Any event which is not part of the standard operation of a service and which causes, or may cause, an interruption to or a reduction in, the quality of that service."

This does not imply that an event which is part of a pattern of events which may lead to an eventual incident is itself an incident. Far from it. What it states is that an event which - left unchecked - would cause an outage or degradation is an incident.

If you're in a power boat and your backup motor fails, that's an event - and you'll run at elevated risk because of it. However if you spring a minor leak, that's an incident. It would be nonsensical to treat your backup failure as an incident because there is simply no service to restore. Your backup is a risk mitigation measure and its absence changes your risk profile. That's all. Incident Management does not manage risk, it manages restoration of interrupted or degraded service.

I'd suggest that if only one in every 3,000 events actually creates an incident and you have people investigating every single event, then you've got a lot of nothing being done for very little return. That's not cost-effective, especially in today's financial climate.

"So", I hear you cry, "what do we do with events? And who does care about the warnings which events may imply?"

There are three processes which care. Availability Management, Capacity Management and Problem Management.

Availability Management should be intimately integrated with Event Management to accumulate information on the reliability, maintainability and resilience of the various configurable items and systems. For example when a single disk fails in a disk array, this will usually be preceded by various events warning of the impending failure. Availability Management may even specifically choose to mirror the contents of the disk prior to failure and proactively replace it. (Although most decent SAN's will already have data recovery strategies in place anyway). Point is, these are not incidents. There is no diagnostic phase attempting to determine root cause, there is no outage to service and there is no urgency to respond to customer impact.

Capacity Management should be using Event trends to match up against its own projections.

Problem Management should be using Event trends to determine possible correlations with incidents. The nonsense about using past incidents to prevent future incidents should be soundly ignored by everyone.

Events are data, nothing more. What's critically important is that you have the two things required to make use of them. Sufficient historical recording of events to allow short and long-term analysis - and the expertise necessary to interpret events and determine their impact or potential impact upon the infrastructure. Interpreting them as incidents strikes me as wasteful and jumping the gun. Ignoring them altogether verges on irresponsible.

Submitted by skeptic on Wed, 2011-06-01 20:08.

fault and incident

ITIL V3 says somethign that could potentially in future cause a service interruption is indeed an incident - check the book again.

I disagree. We had a nice clean definition, the one you are using: an incident is an interruption, or more precisely the incident is one user's reported experience of the interruption. One interruption, many incidents.

So I think there is a distinct entity, an interruption, which we report for SLM.

Then I think there is another distinct entity, a fault. Something is wrong or out of bounds that hasn't caused a user top notice yet and hence hasn't caused an incident. investigation of most faults leads to a problem, in a parallel way that investigation of some incidents leads to a problem. i think your PM people are talking crap: a problem is any thing that can be rectified, that causes an incident or might cause an incident.

ITIL V3 muddies fault and incident together as incident.

Submitted by Springs Marty (not verified) on Wed, 2011-06-01 20:56.

Definitions

First of all, thank you for having this dialogue.

> ITIL V3 says somethign that could potentially in future cause a service interruption
> is indeed an incident - check the book again.

The book is a little vague here. The definition of incident includes, "Failure of a Configuration Item that has not yet affected Service is also an Incident. For example Failure of one disk from a mirror set."

I could argue either side of this. Is an alert a "Failure of a Configuration Item"? Is slow processing a failure?

The book goes on to define Incident Management to include, "The primary Objective of Incident Management is to return the IT Service to Customers as quickly as possible."

The definition of Incident Management presupposes than an outage has already taken place.

4.2.1 goes on to say, "The primary goal of the Incident Management process is to restore normal service operation as quickly as possible and minimize the adverse impact on business operations, thus ensuring that the best possible levels of service quality and availability are maintained."

Once again, service is already down before an incident, but 4.2.2 says, "Incident Management includes any event which disrupts, or which could disrupt, a service."

On the other hand, 4.2.5.4 tells us to prioritize based on impact, presumably known because impact is underway. My team is constantly dealing with hypothetical obscure impact that may or may not take place, to some dataset or system which may or may not affect customers we may or may not be able to list, having an effect we may or may not be able to quantify. If we insist on determining true impact to prioritize an incident before actioning it, we won't action anything in time to prevent an outage. By the way, a CMDB would do wonders here.

Back on point, I can find nowhere in the book where it describes managing incidents in an outage-prevention context, but that's what my team does.

> Then I think there is another distinct entity, a fault. Something is wrong or out of bounds
> that hasn't caused a user top notice yet and hence hasn't caused an incident.

That makes sense.

> a problem is any thing that can be rectified, that causes an incident or might cause an incident.

I could argue either side of that as well. The first sentence defining Problem reads, "A cause of one or more Incidents" and the definition of Problem Management goes on to say, "The primary objectives of Problem Management are to prevent Incidents from happening, and to minimize the Impact of Incidents that cannot be prevented."

Again, no Incident => no Problem.

While my gut reaction is to disagree with the above, I cannot find where ITIL condones what we both agree is the "way it should be."

Thoughts?

Submitted by Derek (not verified) on Sat, 2011-06-04 01:36.

"The definition of Incident

"The definition of Incident Management presupposes than an outage has already taken place." I disagree. In your example of a failure of 1 disk in a mirrored set which does not cause an observable outage, I would still consider it an incident because the integrity of the service is impacted (degraded service). "Normal" service operation is to have the disks mirrored. The lack of mirroring is a disruption to my service, but not an outage.

"no Incident => no Problem". I disagree. If "The primary objectives of Problem Management are to prevent Incidents from happening, and to minimize the Impact of Incidents that cannot be prevented." then I consider it implicit. If you are managing problems in a ideal world, you do have problems, but no incidents.