SLAs that promise a resolution time are like firemen promising to put a fire out

Some SLAs assign a key metric to how long IT is going to take to resolve incidents. Really. This is like firemen promising to put a fire out in ten minutes. Worse still if an SLA makes this mistake it almost always also has it the wrong way round.

The IT Skeptic's new book is available now. One point is makes is that

Some organisations give high importance to how long IT is going to take to resolve incidents, and they write this into SLAs as a key metric. Usually high priority incidents are to be resolved quickly while lower priority incidents can take progressively longer.
This is akin to firemen promising to extinguish three-alarm fires within ten minutes but a backyard grassfire may take until tomorrow. It is absurd on three levels: extinguishing the fire takes as long as it takes, bigger fires take longer, and that little yard fire won’t be so small tomorrow.

Comments

The focus should be on the workaround

Finding a resolution to an incident is of secondary importance. The important target is the workaround. I have often seen teams focus on finding resolutions, because that is what the contract says, and ignoring the customer, who just wants to start working again!
I really think something from ITIL like the expanded incident lifecycle is a good tool, but even that ignores the workaround?
There are key times a service provider needs to commit to:
* Detection times based on suitable monitoring
* Logistics for delivering resources or replacements
* Speed of restores
* System restarts
The contentious time is the time to diagnose! That is open ended, the rest are not.

Hooray! Time to Diagnose

I agree that setting an expectation for Incident Resolution does neither the customer or IT any good --- very bad application of setting expectations...

Time to Diagnose --- this a subject that should come up right at Service Design (but rarely does).

Forcing a dialog about Time to Diagnose will put the screws to the customer to ante up what's required to properly manage the service, and squeeze IT to think about end-to-end requirements rather than the usual silo game of "it's not me".

In the mad rush to n-tier, virtual-ized and service-oriented infrastructures we should be paying attention to how this impacts Event Management and monitoring as much as CMDBs and changes.

Time to Diagnose might help focus some attention in this area.

John M. Worthington
MyServiceMonitor, LLC

Hi, managers need to

Hi,

managers need to validate how effective the delivery of IT is and they use SLA (max time to resolve) metrics to try and achieve this.

Until thi smindset is chnaged Managed Service providers will be expected to provide services to these type of SLA's.

SLA is not about the metrics!

I am not aware of a max time to resolve metric? ITIL has 3 suggested metrics in it's literature: MTTR - Mean time to repair (measured between time detected and time of completed recovery); MTBF - Mean time between failures (measured between time of completed recovery and detection of next incident); MTBSI - Mean time between system incidents (measured between time of detection for first incident and time of detection for next incident). Like all statistics these are meaningless unless aggregated over an extended time period with a few hundred thousand samples. (No usually practical!)
An SLA is not a stipulation of the maximum of these metrics. It is an agreement with the customer that describes and details the delivery of a service and sets expectations about the available support.
My opinion is that SLAs are usually flawed because people create them first without any supporting collateral. If a set of SOPs (Standard Operating Procedures) do not exist for a service then it is a futile exercise to even attempt an SLA as you'll have no clue about the context and engagement.
COBIT actually provides a means to validate how effective the delivery of IT is. (Can't believe I actually complemented the ugly sister!)

user v customer

There is a distinction to be made here between the agreement of service with a customer, when the aggregated measures are important, and the statement of service made to a user, when the experience of individual events is important. I don't care that the average call answer time is well within 20 seconds when I've been hanging on the line for five minutes.

My experience is that most SLAs ARE about maximums/worst case scenarios because we fall into the trap of thinking defining the worst case influences the average service. We think that if we define a five minute answer target that there will be a standard distribution with a six sigma failure rate.

I agree that metrics are often useless

I agree that metrics are often useless as with taking your scenario of waiting for 5 minutes into account as an example. No metric provides the answer as to why you were waiting for five minutes and how we would solve the issue of you not having too wait that long. It does not really help to tell everyone to answer their calls faster. But, yet, that is the standard perception as the assumption is that it is because the agents are lazy and require a kick in the pants.
What if there was a technical fault with hunt group or the ticketing app had gone titsup? Metrics are an easy way to blame humans.

Traditional SLA approach

The problem with this traditional SLA approach is it is based on a zero-defect mindset; a philosophy based on mass-production. These SLAs are focused on resource management: production speed, efficiency and cost-cutting. E.g., Incident max-time to resolve or service desk measurements based on answering a number of calls on time. They measure outputs rather than outcomes.

Notice this reinforces the "silo mentality." Worse, these measurements are rarely within the control of the people who operate within the system. Rather than focusing on effectiveness or customer outcomes, staff are incentivised only to make sure that the work gets done within the specified time. They keep their heads down and focus on an SLA spec. While the specification may be achieved, customers are often frustrated by shoddy or incomplete work, generating further units of work - an effect known as as demand amplification.

When an organization uses resource measurements as performance indicators for staff, it creates a dynamic where ineffectiveness is institutionalized.

The problem is with how the SLA is defined.

Businesses, Customers and IT Management mandate SLAs/targets so that expectations can be set for all stakeholders including suppliers, and I think it is a good idea.

Generally I believe that most incidents are easy to fix and targets can be achieved for these incidents.

Now when it comes to major incidents (or exceptions) finding the root-cause and a permanent solution may take many days or even weeks. But, there is always a workarounds or temporary fix that can be implemented within reasonable targets.

So, the problem is not with the SLA or the targets. Usually, the failure is because the SLAs are not defined properly with the various conditions and situations.

Fires

The first thing a fire department does is evaluate the fire to determine if they need help and if they do they call in departments from other towns...its all a matter of how much coverage they need to get it under control (so it doesn't start another fire) and how much water they need to put it out.

While I see some rationale

While I see some rationale in what you say I would tend to disagree generally. Especially when there are multiple fires burning at the same time and you only have one fire squad, there has to be decision taken what to do first and what to do next. Burnt-down yard house is not the same as burnt-down factory. And about times to put the fire down - to me it is just different angle of securing availability (as a valid SLA key metric even by your standards ;-). Not entirely of course - resolving Incidents in some specified timeframes doesn't guarantee availability conformance but can help a lot (especially when there's more of those incidents than you ever expected to be).

And one more thing: being able to define and then deliver resolution of Incidents (especially those by end-users) helps to define expectations and then not breaching those expectations ("expectation management" ;-) which I believe is quite important too.

Targets are useful

I have been on both sides of the fence (service provider and customer), and while I agree that these targets and measures of recovery and resolution time can be misleading when looked at for an individual incident, they have two key uses for me:
1 - resource management - if you have a target recovery time, then it focuses the service provider's efforts on ensuring the appropriate resources are applied (and having the priority related targets ensures that the resources are allocated to the one with the highest impact and urgency)
2 - service provision - without any targets for recovery/resolution, it is difficult to design the service delivery structure .. this drives the size and location of teams required and the spare part/hot swap stock size and locations (from experience supporting locations across the UK)

Does anyone have a view on the best way to include both Incident resolution and Problem resolution targets in an SLA covering application and infrastructure delivery? This is the one that I am trying to get to grips with at the moment ..

Incident and Problem

Neither suggestion is ideal but the only ways I can see this are;

- linking problem resolution to issuance of the same workaround for an incident. Basically the priority of problem management (and therefore the resolution time) gets escalated as the frequency of repetition of the problem increases
- the second is linking it to availability management so that a similar escalation occurs based on the impact of the problem on availability.

A combination of the two is a possibility.

Brad Vaughan

using indirect KPIs is always a dangerous distorter of behaviour

Using indirect KPIs is always a dangerous distorter of behaviour. if you want the SLAs to ensure the appropriate resources are applied and to drive the size and location of teams required and the spare part/hot swap stock size and locations, then write the SLAs so they define the appropriate resources to be applied by priority of incident for that service and define the size and location of teams required and the spare part/hot swap stock size and locations by priority of service. Don't make the behavioural causal chain any longer than it need be - you'll get all sorts of unintended consequences.

Moved the rest of my response up to here

Every incident should be addressed as soon as possible

Both Radovan and James imply that incident resolution is a single thread. "one fire squad", "get some low impact incidents out of the way quickly so that teams can focus on the high priority incident". it isn't. even in the worst major incident, not everyone is involved, nor should they be. Some Service Desk and level 1 people will still be quietly chugging away, working on lower priority incidents.

SLAs that say "priority 1 incidents: four hours to resolve" are crap. If we can we'll resolve it in ten seconds. if we can't it may take four days to get the vendors in and rebuild the system.

Likewise SLAs that say "priority 3: three working days" are just an excuse for slacking off.

Every incident should be addressed as soon as possible and closed as soon as possible. it is how much we chuck at it that varies.

"User requests" still come

"User requests" still come to my mind as a sort of Incidents where clearly defined and measurable objectives can help to do things like:

  • manage expectations of end-users
  • measure (semi-objectively) load on teams/workgroups (ratio between number of Incidents and their solution times in individual categories/urgencies/impacts/...) and thus help improving the overall service

This of course applies to broad understanding of Incident as in ITIL v2 as discussed in other threads here.

And of course not all Incidents are being dealt with in single thread. But then each "fire squad" can have more than one Incident assigned and needs to prioritize.

"Every incident should be addressed as soon as possible" equals to "as good as it gets" or "best effort" regarding service quality if it is not substituted/enhanced by some other mean of representing the quality of service. Being able to (or at least attempting to) guarantee the maximal time to get the services up and running again has its value. Providing 99% availability of service is nice but adding to it that every Incident (outage) will be resolved up to _whatever_amount_of_time_units_ gives some added value. It may force the division of this 1% of unavailability to even smaller and more predictable chunks of _whatever_amount_of_time_units_. Even it is a minor detail it can help better manage the customer's expectations which is never a bad thing.

FIFO, LIFO etc

Skep,

First of all I quite agree that targets for priority 1 incidents are useless. In the case of other priorites it is useful to have some measure that cuts off the long tail.

Where I disagree is the simplistic "addrssed as soon as possible and closed as soon as possible" - I would agree if you added in "taking into account the most efficent way of reducing the overall resource requirement" or something a little more elegant. This is particularly true when considering how to feed work into an individual team, where sometimes slowing down the rate of input can benefit overall traffic flow.

Priority

Despite some ITIL temrinology having been around for a long time there is still a tendency for many of us to think primarily about how we use them in our organisation, not in terms of best practice.

The distinction between priority and urgency was introduced to deal with this area of confusion. Priority is about what it says it is - deciding which incident/event/change/request takes priority - meaning it has first call on resources and is on managment's radar. That doesn'tt mean it gets fixed first. It might be better to get some low impact incidents out of the way quickly so that teams can focus on the high priority incidnet. Perhaps we don't decouple the terms quite as much as we should. Sometimes, and this will be common to those of us with a financial services or critical infrastructure background, you can have multiple big fires burning at the same time - all of which are major incidents but you still need to prioritise. The prioritisation might be state specific - for instance if you have major security breach it is high priority until you know you have contained the breach. There will be other work to be done, but that isn't at the head of the queue.

Personally I didslike having targets for problem resolution more than I disklike it for incident resolution.

In best practice you tend to get overlapping SLAs, and the actal targe to fix is based on a combination of, amongst other factors: impact, location, time/date/type of failure, ease of fix.

Something I dislike, thast you hint at, is an SLA that on page has a section on targets to fix individual incidents, and overall incident fix targets, and on another page has a section on availability targets that is independent of the incident targets.

Syndicate content