Riddle me this: matching ITIL theory to the real world

Submitted by skeptic on Fri, 2011-10-21 19:53

Share this post with

Calling all you ITIL theorists, philosophers, pontificators and pundits. Marty is back: our follower from the real world, trying to make sense of ITIL on its home grounds, the operations of big iron batch computing. Marty asks what happens after a service is restored? What does ITIL call the function of undoing the damage done while a service was unavailable? I have a view - of course - but I'm going to stay quiet - for a while- and hear what everyone else thinks. So have at it.
I love hearing from Marty (welcome back!) because he has a real job, and in that job he has to make all this waffly theory work. That persepctive results in interesting insights. Here's Marty's question:

I ask because I am trying to explain to some coworkers the mapping of a real world event to ITIL terms. We have a twice-daily batch job which processes applications for credit. By law, the applications must be decisioned, notifications sent and proper documentation done for each application within a certain number of days after application has been made.

Let's map this to ITIL. The "service" is the processing of applications, correct?

Twice in the last week the job has failed to process several applications from some batches. We discovered the failures because compliance management reports showed a mismatch between inbound and processed applications.

The "incident" would be that the "service" was offline for several applications, correct? Therefore, we have two incidents.

The "problem" is that the "service" goes offline. Let's ignore that the failures generated no alerts.

According to ITIL, each incident has been resolved, because the "service" was available immediately after the failure to process the batch of applications.

Clearly, we cannot just pat ourselves on the back and move on; we must process the missing applications. What is that activity called?

Answers on a stamped self-addressed comment below please.

Comments

Submitted by Visitor (not verified) on Wed, 2012-03-21 17:01.

An incident is not

An incident is not considered closed until the effect of service outage is not taken care of. When service is back we can call incident as "Service Restored" . When the root cause is fixed along with the issue of alarms not being raised then the related problem record can be closed.

Submitted by skeptic on Wed, 2012-03-21 18:52.

So sayeth the Holy Book.

So sayeth the Holy Book. Thank-you brother for that reading.
If only the real world were so crisp.
E.g. an incident is also closed when the user neither remembers nor cares any more because it has been so long since we last called them...

Submitted by Hap (not verified) on Tue, 2012-03-20 14:22.

Matching ITIL Theory to the Real World

Late to the game, but here's my 2 cents worth. If the referenced incident is categorized as a combination of the (software, credit application), then I'd argue that the incident for the misprocessed applications is not closed. The service may be available for other credit applications- but not for the ones that failed. So, until a workaround is found for the credit applications in question, the incidents remains open- and the process is incident resolution. Since the incident is serious and may be recurring, a problem ticket needs to be opened and a root cause for the failure found as well. Thinking about the application processing software, the failure is one of a failure in state transition processing and may require a larger state awareness if the failure is context dependent.

Submitted by aprill on Mon, 2011-10-24 22:17.

What Would Aprill Do

It's all very clear to me. I would have incidents for the failed batch/es that need to be re-run and a Parent Incident or Problem (sorry, haven't done ITIL yet to have those theoretical definitions in place), which those child incidents are linked to. The Parent drives the investigation into what caused the batches to fail, and while the Child incidents can be 'resolved' by having their batches re-run, they can't be closed until the Parent is resolved and closed.

Submitted by slogger on Sun, 2011-10-23 19:38.

First comment on the skeptic site so please be gentle...

I found this "challenge" quite interesting and I hope it will be the catalyst to make me contribute a bit more. I have split my reply into 3 parts. 1) Quasi ITIL vs text book, 2) Discussion around the incident, 3) the initial question - what is the process following service restoration

So part one, is a "quasi ITIL" implementation ok or should we stick to a purest implementation? Personally I have always respected the "framework" element of ITIL and tried to implement it based on the service needs of the organisation. Of course it would be difficult to re-write a whole process (ultimately a change is a change) but how an organisation goes about using the framework I believe is open to the individual interpretation. The service strategy element of V3 (or 2011?) probably gives more licence to veer away as you could probably justify most things using the 4 P's! But joking aside whether you wrap it up as ITIL or IT Service Management ultimately you are trying to get to a given point. For ourselves it was decreasing service downtime. We took the Incident and problem disciplines and created a hybrid which really focused on "major incident management". Not a major topic in V2 but suited our needs and allowed us to consistently reduce our downtime over 5 consecutive years.

Part two I found quite interesting because for me the incident was not the "batch process" but was actually the inaccuracy of the report? The failure of the batch process was the root cause but the failing service was the report with the user logging an incident because the info did not match. This could have been as a result of the MIS system failing or a change in the formatting of the report (but ultimately in this case you diagnosed the problem as your batch process. We have had big debates at work around similar incidents where the service demonstrating the impact is different to the one which actually caused it....

Part three? What is it called?? For me it is the proactive part of Problem Management,clean and simple. It may branch into change management (what has changed), service level management (what were the service expectations), security management (as mentioned earlier, were their any security implications as a result of this) and even things like Continuity (two failures in the same week sounds like creeping death and I would want to be considering my plan b ?). Output wise I would be expecting a Major Incident Report (not an ITIL term...) with any actions and acknowledgement of known errors noted and maybe a failed change report?

phew...... thanks for inviting me to comment...

Submitted by garyroos (not verified) on Mon, 2011-10-24 02:51.

Some really interesting

Some really interesting discussion here, who thought something this simple would be so complicated :-)

I'll give my perspective on this FWIW, but in essence I think the definitions are not that important - there are some flaws in the chain and I'll touch on those later.

We should not get overly entwined with the applications which were "lost", they are ultimately a business process responsibility/failure. I think the acceptance/response to an application is a business process, the processing of the batch of applications is the ICT-based service that this business process uses (that'll get the Skeptic's juices flowing) to help them complete their response. So we cannot take each individual person's lost application as an ITIL incident - it is the failure of the batch (collectively) that is the incident. Each time a batch fails to complete within it's operating parameters it is an incident. So in your scenario - yes - you have two incidents.

But I don't think that your interpretation of Problem in this scenario is correct. The Incident is the disruption to service, the Problem is the cause of this disruption (not the disruption itself). The Incidents may have resolved themselves technically - but until you know exactly why these two batches failed to complete as expected from an ITSM/ITIL perspective, they are not resolved or closed. (You may make a determination that these were likely due to solar flares and the incident is not worth looking into - but you have two and that should cause you to sit up and take notice).

And so the long way to the answer to your question - the processing of the missing applications should be some form of business or service request, and may need to be fulfilled through a change request even. I say this for a couple of reasons 1.) Nowhere in your description have you eluded to having business guidance on what to do with failed transactions, 2.) They may already have been in contact with the customer in some form, they may have some legal implications, etc. So you should be notifying the business that these transactions failed.

But that is all semantics. The processes should be fixed - some governance from the business side that checks their inputs vs outputs will tell them if something is amiss; and some form of batch error reporting to notify the service delivery folks. Then ultimately, when the business process for dealing with these types of failures is defined, the re-processing of failed transactions (and the rules around these) can be automated and made to be part of the service.

Submitted by technicalbridge (not verified) on Mon, 2011-10-24 09:03.

In my world (Major Incident

In my world (Major Incident Mgt) we would:

- Raise an incident called "Credit check applications have failed to be processed by [service]"
- Resolve the incident when it's confirmed that the service is available (ie immediately)
- Close the incident once the datafix - in this case processing the missed applications - has been completed

I would argue that the end-to-end Service here is "ensuring that the applications are processed", not just "service x processes the applications". Therefore to "restore service" you need to have processed the impacted applications.

This approach may not be quite as ruthless as ITIL wants us to be, but it's served us well in building a good relationship with the business. It also helps us measure the pain caused by incidents much closer because we don't lose visibilityof the datafix activities by pushing them back to the business and closing the incident.

Submitted by technicalbridge (not verified) on Mon, 2011-10-24 09:09.

er

whoops, that was a reply to the main article not the comment above mine! Sorry!

Submitted by aroos on Sat, 2011-10-22 07:25.

ITIL was made for batch jobs

This is a scenario where ITIL actually works because it is straight from the time of punched cards and steam powered computing.

Batch jobs are not "online" so the service was not resolved by the next batch. You need to whip out your quill and write two incident cards, one for each failed jobs. Then pass the cards to the Incident Manager who will mark them closed after the missing applications have been successfully processed.

Aale

Submitted by Todd Walton (not verified) on Mon, 2011-10-24 12:23.

Incidents not resolved

The incidents were not resolved. If the applications failed to process, then they remain unprocessed. You still have to find those applications and either work them or write them off or whatever is appropriate.

Submitted by Peter Suba (not verified) on Mon, 2011-10-24 22:46.

Resolution and Recovery

Todd, I agree with you and have to disagree with those who say that the incident is now resolved.

I partially agree with Ian in that the ITIL V3 books (neither the 2007 edition nor the 2010 edition) are not crystal clear on the difference between resolution and recovery, but there IS a difference and I distinctly remember many years ago discussing this in my V2 Managers training. I have been using this understanding ever since and it works fine.
Resolution is when you have the service generally up and running, recovery is when you got the service back up and running without any residual errors or difference in configuration compared to the normal operation.

Without repeating myself too much, mapping the actual situation to the theoretical stage in the Incident lifecycle will depend on circumstances not detailed in the case description, but it can be determined. In both cases the incident should still be running.

Once again there is a fundamental matter here that is a common question in live operations, although mostly it is surfaced in context of the Service Desks: there is a "big" incident and there are many "small" incidents that make it up. Compare it to the situation when a mail service goes down for a whole company and many users are calling the Service Desk. Is it many incidents or just one? It is a very similar situation here - batch processing fails, is it one incident or as many as there are applications waiting to be processed in the batch? So in practice there are many ways to handle this (typically in the SD situation, depending on implementation, you can define this a "Major" incident and link many of the single incidents to this major incidents). You then close the "big" ("Major" - although some will object to using this terminology here, as would my purist self, which is why I only use it in brackets) incident, e.g. the batch processing now works, but you have to check if you can close the "small", linked incidents: in the mail server failure case you may contact the users (perhaps via targeted mass mailing if there are too many) to check if their issue is fully resolved, and in the batch processing case you check if all applications have now been successfully processed.

My personal preference to handle this would be to have 2 incidents open (one for each batch failure) but they would not be closed until it is verified that all failed aplications are fully processed - the status would be dependent upon the toolset's setup that I am using and the related Work Instructions - for sure the incident would not yet be closed.

Once again - you may have the Service up and running, but that does not mean you can't have an incident to the service.

My current client operates one of (if not "the") the largest SAP landscapes in the world and there are frequent cases related to batches, interfaces etc - in fact the vast majority of the incidents are not about the whole system/service being down, but certain parts of the system not doing what it supposed to in certain specific cases. These are all valid incidents in their own right, and them being open does not even mean that the service is unavailable (!) - an incident does not necessarily equal an outage of the Service (it may just be a "...reduction in the quality of...").

So whilst there are plenty of cases when you may be better off working outside of ITIL, I think in this particular case there is a perfectly ITIL-compliant way to look at it...

Submitted by ianclayton on Tue, 2011-10-25 02:27.

USMBOK 3R

Peter - just a quick note here - the USMBOK has a 3R concept - 'recover, resolve, restore', representing the three stages of getting stuff fixed. Recover the last known working configuration, resolve the issue (apply fix)', restore service (normally gradually). Old 1970s 'best practices'!!!

Skep - you gave me flashbacks there - punched cards - I used to have to 'interpret' them as part of my duties way back in 1974 - those who know what a card punch looks like will know what I mean... if you would, leave paper tape out of this - I still have nightmares chasing it across the room after it shot through the reader at 250mph.... can you believe punch cards were an upgrade and improvement over paper tape... (yes!)

Submitted by Marty S (not verified) on Sun, 2011-10-23 01:37.

Aale, thanks again for the

Aale, thanks again for the reply. In rereading your post, I am concerned that the word "batch" may hook some people. It turns out my company is not entirely in the steam age, so let me change this up just a little to make it more contemporary.

I ask because I am trying to explain to some coworkers the mapping of a real world event to ITIL terms. We have software which processes applications for credit from several web sites. By law, the applications must be decisioned, notifications sent and proper documentation done for each application within a certain number of days after application has been made.

Let's map this to ITIL. The "service" is the processing of applications, correct?

Twice in the last week the software has failed to process several applications. We discovered the failures because compliance management reports showed a mismatch between requested and processed applications.

The "incident" would be that the "service" was offline for several applications, correct? Therefore, we have two incidents.

The "problem" is that the "service" goes offline. Let's ignore that the failures generated no alerts.

According to ITIL, each incident has been resolved, because the "service" was available immediately after the failure to process the online applications.

Clearly, we cannot just pat ourselves on the back and move on; we must process the missing applications. What is that activity called?

Submitted by Peter Suba (not verified) on Sun, 2011-10-23 10:02.

Incidents in batch processing

Marty,

I think your definition of an Incident is OK, but your conclusion you draw from your own service definition is not.

To illustrate where you went wrong, I turn to Six Sigma only because it is very strict in leading you to define things such as "defects" and "opportunities for defects" and that helps in this case.

Your Service is the processing of applications.

Your Opportunity for defect (Six Sigma definition) is every application you need to process.
Your Defect (again, Six Sigma definition) is every time AN application does not get processed (within the allicated time dictated by compliance).

Coming back to ITIL, your incident is then any occurance that might prevent you from processing an application (not an occurence that might prevent you from processing ALL applications!)
Clearly, something has prevented the processing of those applications you mention. It is not clear whether anything was done in terms of resolution (either manually or automatically), so you are in one of two possible places in the Incident lifecycle with regards to the incident related to those applications that are still not processed:

Option 1.) Nothing was done, the applications failed as they did and not progressed (e.g. if they were not processed due to the batch processor not being able to read an illegal character in the application and "failed" that application). In other words if another application of the same type comes in now, it also would not be processed. In this case the incident in currently diagnosed but not yet resolved. You also have a Problem (whether formally registered or not is another question), that should be resolved (in the example I gave this would be fixing a bug in the batch processing code) - you potentially do not have to wait for that to fix the incident (for example you may have a workaround in the example by correcting the illegal character and re-submitting the application to the batch).

Option 2.) Something was done so the application processing now continues as expected (for example, the application may have not been processed due to a network timeout in the batch processor that needed to look up something, but that network issue since have been resolved, or a mapping table did not yet include a lookup figure needed for that specific application but that since replicated. In both cases the batch could have placed the application into a "failed" log waiting for further action from the operator). In this case, your incident is currently in a resolved state, BUT the Incident lifecycle also has another step before closure and that is "recovery". That is precisely the step that still needs to be performed in this case even if the root cause is no longer there - so for example re-submitting the application to the batch may be the "recovery" step.

In both cases, you have, in ITIL terms, an incident running. Whether you may (or should) have an incident or more incidents in your Service Management tool, is another discussion but you were asking about the terminology which hopefully this answers.

Submitted by Visitor (not verified) on Sun, 2011-10-23 15:44.

My thought is the incidents

My thought is the incidents are resolved for the service since the service is restored. However, the remediation of the incidents would be done as an outcome of problem management. Problem management would identify the root cause and open a change to fix the root cause and at the same time potentially generate a service request to the group who can make sure the applications are processed.

Consulting answer: ITIL is a framework, so bottom line is it depends on how your organization implements ITIL.

Submitted by Visitor (not verified) on Sun, 2011-10-23 08:56.

Process the missing

Not hoying to the expert tag, just personal opinion

The incident has still not been resolved - if the service is to process applications and they have not yet been processed then service is still in failure (even if the technology component isn't) - just ask the customer waiting for the result of their application!

Are they still missing? Because this might also be a security incident? Is there a possibility that the information puts the applicant at risk?

In addition, has the processing of the applications breached their SLA? If so, may I recommend you visit CSI - see "TIPU"! Section - oh yeah, it's not in that book yet...we're talking reality! Visit TIPU on this very same website!

Hope this helps!

Submitted by Marty S (not verified) on Sun, 2011-10-23 01:35.

Getting rid of the tricky word

Let's map this to ITIL. The "service" is the processing of applications, correct?

The "incident" would be that the "service" was offline for several applications, correct? Therefore, we have two incidents.

The "problem" is that the "service" goes offline. Let's ignore that the failures generated no alerts.

According to ITIL, each incident has been resolved, because the "service" was available immediately after the failure to process the online applications.

Clearly, we cannot just pat ourselves on the back and move on; we must process the missing applications. What is that activity called?

Submitted by Visitor (not verified) on Thu, 2012-07-19 20:41.

I just stumbled on this

I just stumbled on this website. This peaked my curiosity...and I know I'm a year late on this reply but I couldn't resist:

ITIL says an incident is "An unexpected event or failure that degrades or threatens to degrade the agreed quality of service". In a nutshell - something isn't working the way it is supposed to.

What do you call it? Since the service is functioning but inconsistently, you would have one incident - 'report identifies a mismatch' or if you know that this means some applications weren't processed, call it 'applications missed in batch'.

Reprocessing the applications is called a WORKAROUND.

Once you have reprocessed the missing applications successfully, you can close the incident ticket. Incident resolved.

Bridge to problem management.

The problem is 'some applications aren't being processed in batch'. Now you dig into an RCA to determine why. Once you identify the 'why' it is called a KNOWN ERROR (not sure the value of naming a bug except that you may choose - due to financial or time constraints - not to permanently fix this incident). If you decide a permanent fix is warranted,

Bridge to change/release management.

ITIL start to finish.

Submitted by Marty S (not verified) on Sat, 2011-10-22 22:00.

Rob, thanks for posting the

Rob, thanks for posting the question.

Aale, you are correct and in fact this is what we do. That means that for certain items, we are now opening incident records for something other than "An unplanned interruption to an IT service or reduction in the quality of an IT service."

This leads to us having our own definition of an incident, a definition we have not bothered to write down because (as you hint at in your reply) it is "obvious." My group has morphed into calling activities and failures incidents when they do not meet the ITIL definition. This becomes an issue when we interface with other groups who happen to follow the ITIL definition, or worse yet, have made up their own definitions, just like my group did.

In short, we are using a quasi-ITIL, not pure ITIL. This can be a very slippery slope, because we have set the precedent to allow everything in ITIL to be silently redefined within any group at the company.

It seems that our options are to:
1. Stay with pure ITIL
2. Make up our own stuff and call it ITIL even though it is not
3. Ditch the framework completely and start over

Of the options, only option one seems palatable. Unfortunately, as you demonstrated, fitting some operations into ITIL can be daunting.

Suggestions?

Thanks,
Marty

Submitted by benr. (not verified) on Mon, 2011-10-24 03:33.

Its a Problem

My understanding of ITIL is that each interruption or failure is an "incident", and those are closed. The higher level activity to snuff our the cause is problem management, and thats open.

In my ticket system I would log incident tickets for each event which constitutes a service interruption and link those to a single "master" problem ticket on which we are researching the underlying cause.

I would likely also open a "problem" ticket for the fact that "events" (alerts) were not generated.

Submitted by Peter Suba (not verified) on Mon, 2011-10-24 22:58.

linkages

Benr,

See my comment below on why I disagree that (all) the incidents should be closed. Depending on the process structuring/tooling work instructions you may have SOME incidents closed, but definately not all of them (you may have "master" tickets for each batch failure and "child" incident tickets for each application, in which case it is fine to have the "master" tickets closed but not the children. Alternatively, you may have just the 2 tickets for each batch failure but then they could not be closed. Yet alternatively, you may have a ticket for each application that failed in the batch, or even one for each "type" of application - all of these configurations can have valid reasons behind them).

You would be wrong to assume given what we know that there is a single Problem ticket linking the two failures. It is possible, but without Root Cause analysis you just can't say. There may be multiple Problems here (or indeed there may be a single Problem). So there should in my mind be at least 2 Problem tickets open at this point unless it is already proven that both batch failures were caused by the same thing. As there is a compliancy driver behind this IT service (outside-in!), it is critical that both failures are invistigated so it won't happen in the future.

I also agree with you that due to the criticality of this processing there should be some monitoring in place so you don't wait for the compliance report to highlight the failure of IT - however strictly speaking that is not a "Problem" (this did not cause the incident) - depending on how the framework is implemented in the company it may be a Continuous Improvement initiative, and Event Management input to process, a Service Request item or just an post incident review action to correct.

Submitted by ianclayton on Mon, 2011-10-24 01:44.

The USMBOK says... consider service recovery program

Well I'm scared to comment here... but damn it here goes..

What to do after restoring a service and undoing damage - well the business, and the USMBOK (+companion guides) call it 'service recovery'. Its an optional support program that has nothing to do with recovering systems and infrastructure and everything to do with recovering the customer satisfaction and loyalty - I'll be speaking on the 'service recovery paradox' at HDI 2012 next April.

Seems ITIL missed this one. Another reason why I think its inside-out.

So before you engage problem management, which in my universe would engage whether you want them top or not based upon impact, you would optionally invoke a service recovery program, or 'next contact' procedure. (Thats another concept not found in ITIL. A service encounter has a number of moments of truth including a 'greet' and 'thank you' contact. After the 'thank you' their is an optional 'next'. Similar to when the car salesman calls unannounced a week or so after you drive the new car off the lot to ask if all is ok...)

Depending on the relationship you have and want with the affected parties - you invest resources to repair their emotions and satisfaction levels. You know - like being awarded extra free miles by an airline when they lose your bags. Paying to ferry the bags to your location wouldn't count - thats the least they should do...

Anyone out there trying to stay within ITIL's bumper bars will fail their customer. Remember, ITIL is a CONTRIBUTION towards a service management way of thinking. There is much to be learned from what the service experts do - the real fountain of service management knowledge lives outside of IT.... Trying to make things work within the confines of ITIL waste your time, and frankly those of others trying to help you...

A clue here - it was written by folks who have never actually had to design and manage a customer's service experience... Just read what ITIL 2011 has to say on 'customer satisfaction' and how it is managed and measured....