ITIL Implementation Roadmap (Problem Management) – Part 6
Problem Management Screws Up Our Metrics!
How many time have you seen the glint in the executives eye when he or she proudly proclaims a first call incident resolution of 80% percent or higher at the Service Desk?
This single metric is often held high as the sacred metric of efficiency and effectiveness for an IT support organization. However, have you ever stopped to think that perhaps the exact opposite may be true?
Consider that the best way to achieve a high first call resolution is to repeatedly see the same issues called into the Service Desk over, and over again. Once this predicable pattern is documented the repeat incidents are quickly identified and the workarounds are applied within minutes. As my good friend George Spalding is fond of saying. “Show me an IT organization with an extremely high first call resolution metric and I will show you a Service Desk which is covering up for a dysfunctional IT Shop.” It is true that the Service Desk is doing a fine job of dispatching Incidents without bothering 2nd level support, but this is primarily due to the fact that no one is fixing anything permanently. If the organization was to actually identify repeat incidents and remove them from the environment (Problem Management) then the first call resolution metric would actually go down and the only ones who would be happy about it would be the customers since they don’t have to call the Service Desk.
We in IT live in a culture where repeat Incidents are an accepted norm! It cannot be a coincidence that Problem Management is consistently ranked as one of the least mature IT processes assessed in over a hundred ITIL process assessments conducted by Pink in the last five years.
Insanity: doing the same thing over and over again and expecting different results. ~Albert Einstein
The key reason that ITIL splits Incident and Problem Management into two separate processes is that they have very different objectives.
- Incident management: Restore service (fix the user)
- Problem Management: Identify the error and remove it (fix the technology)
Problem Management is typically started in the second wave of process projects due to the fact that it is a back office IT process and that it also depends on the output of a mature Incident process. Ideally Incident Management produces consistent record classification and data for trending in support of problem identification. Once Incident Management can reliably produce this level of output, the data can be trusted to support the implementation of Problem Management. Organizations approach implementation of this process in a typical pattern.
- One of the first activities the organizations implement that are traditionally associated with the Problem Management process is the “Major Incident Review” process, often referred to as a postmortem activity. The premise of this activity is to review high impact incidents (the really embarrassing ones) to determine root cause and implement measures to avoid a re-occurrence. This activity is often implemented under the management of the incident restoration process and is often led by a service desk lead or manager. This can be considered as reactive Problem Management and is further explored in the blog post ITIL Problem Management vs Root Cause Analysis
- The next level of maturity is the realization that Problem Management is a distinct process that requires its own process models, policies and resources and is supported by incident reporting and trending activities. While at this point the process is implemented at a level of maturity that has significant benefits, the majority of activity is still focused on reactive problem identification and elimination.
- The third level of Problem Management implementation typically includes the identification or proactive issues for the explicit purpose of incident avoidance. An example of this is where the patch management process is understood to be part of the Problem Management process. When a vendor signals that a security vulnerability or deficiency has been found in their product, a Known Error record is opened for the purpose of impact analysis and assessment before the incident occurs in the organization. If the Known Error is deemed to be applicable, then the Release and Change Management processes are engaged to validate, test, approve, and deploy the patch into the production environment. Obvious process dependencies include Release, Change and Configuration Management.
Jack Probst: One of my co-workers at Pink summarizes the challenge nicely in the following statement. “Perhaps Problem Management is such as challenge due to the fact that we have lost sight of the forest by focusing on the daily grind of managing trees! Or perhaps a more accurate statement is that we don’t have a forest or trees problem at all. What we have is a bark problem – we are far too close to the technology issues to even envision that we have a problem.”
Troy’s thoughts what are yours?
“An S.E.P.,’ he said, ‘is something that we can’t see, or don’t see, or our brain doesn’t let us see, because we think that it’s somebody else’s problem. That’s what S.E.P. means. Somebody Else’s problem. The brain just edits it out; it’s like a blind spot. If you look at it directly you won’t see it unless you know precisely what it is. Your only hope is to catch it by surprise out of the corner of your eye.” ~Douglas Adams
-
Great post Troy. Here’s a question for you; do incident records have to be closed to create a problem record? In most cases incident management can figure out a work around to an issue, but they dont always have the resources or skill set to find a resolution. Should the problem management team get a jump on the problem so that a known error record can be issued to help the incident management team?
Posted by .(JavaScript must be enabled to view this email address) on 12/27 at 05:49 PM -
Hello Aleks
You will very often find that both an Incident and a Problem tickets are open at the same time. This is certainly the case where you have a significant Incident with a high Impact and Urgency. The incident team will create a ticket(s) and begin working on restoring service. Since it is what I like to call an incident you never want to live through again you also open a problem record to begin the process of root cause and problem elimination.
The Incident ticket will be closed once the the service has been restored by whatever means possible but the problem record stays open until it has been deemed eliminated.
You will find that both the Incident and Problem record will share the same classification but the SLA’s will differ.
eg: Priority 1 Incidents resolved within 4 hours 80% of the time / Priority 1 Problem need to reach root cause status by 5 business days.
Good Question
Troy DuMoulin
Posted by Troy DuMoulin on 12/28 at 09:06 PM -
I really enjoy your blogs, Troy! Thanks!
Back to the Aleks’s question/scenario though…
Providing with those “means” of restoring service in case of an unknown and difficult (no-obvious) error, would it be a task for Problem or Incident Management?
To Aleks’s point, due to the lack of skills in Incident Management, would Problem Management be directly involved in addressing the soonest service restore, thus being directly involved in helping Incident Management to fulfil its main objective?Thank you!
GlebPosted by .(JavaScript must be enabled to view this email address) on 05/21 at 10:33 AM -
Hello Gleb
When considering the boundaries of the process of Incident versus problem management it is important to keep the process objectives in focus.
The process of Incident Management is focused on service restoration. This can be as simple as replacing a mouse or as complex as restoring service related to a enterprise mainframe. The process of incident management needs to be able to scale for each scenario.
Problem management on the other hand is focused on discovering the root cause of the incident whether it is complex or simple and taking steps to ensure that it does not re-occur.
It is possible that the incident was restored by a work around such as reboot the server and that during the root cause activity an improved work around or even a permanent fix is discovered by the process of problem management. Communicating this information back to the Incident process for future re-occurances is an activity or problem management.
The difference is not based on the complexity of the activity but the objective of the action. Incident Management is about fixing the user(s) by what ever means is possible / permissible. Problem Management is about fixing the technology and takes a back seat in priority and sequence.
Troy
Posted by Troy DuMoulin on 05/23 at 01:11 PM -
Hello Troy,
Wonderful post. I have few doubts which I think you would be able to clear. Can the same human resources in the incident management team be also assigned with the problem management responsibilities? I have come across incident management teams who perform root cause analysis on high/critical priority incidents and if a change in code is required to fix the problem, a change request is initiated to another department. This root cause analysis is performed by the same incident management team who provides a workaround to temporarily fix the user.
Posted by .(JavaScript must be enabled to view this email address) on 03/17 at 06:18 AM -
Hello Binoy
Good question:
While all things are possible they are not always wise!
Speaking of a time when I managed a Service Desk here are my thoughts on this.
When it comes to 2nd level roles it is very possible for the same people to wear two different hats based on what they are being asked to do.
Restore Service
Eliminate the ProblemHowever, I believe that the Service Desk should focus on Service Restoration or Service Request.
It is not so much because it is a conflict of interest for them to own Problem Management it is more about the fact that Problem always takes a back seat to Incident Management and suffers from it.
When I was a SD manager I would try to rotate my people off the desk so that they could get a break from the phones but also so that they could begin looking at developing the top 10 list of Incidents based on our data.
While this was a good intent it rarely worked since every time the phones got busy the person was placed back on the phones to the detriment of Problem Management.
For this reason I suggest Problem Management should be owned by a group that is not your first line of defense on Incident Restoration.
My Thoughts
Troy
Posted by Troy DuMoulin on 03/26 at 12:19 PM -
Hello Troy:
Nice article. Quick question…..in case of Problem, does the problem take the same severity level as the incident or assign an appropriate severity level for Problem. An appropriate severity level would be based on the number of incidents and frequency of the incidents in order to get the biggest bang in our metric and hopefully a positive SLA implication. Could you share your thoughts on this.
Posted by .(JavaScript must be enabled to view this email address) on 11/17 at 07:51 PM -
Hello Manju
Assuming that your Priority/ Severity Model is based on business or mission risk then the assumption is that a Severity 1 Incident should also be a Severity 1 Problem. The basis for establishing your priority levels (Severity + Urgency) should be shared by multiple processes.
For more info on this please take a look at the following article.
The Practicality of Prioritization. http://bit.ly/yRnl4
However, what I would suggest is that the Problem initially inherit the Severity of the Incident and then move it up or down based on further information or the fact that there may be more than one Incident associated with your problem record.
Best Regards
Troy
Posted by Troy DuMoulin on 11/19 at 12:29 PM -
Hi, We are struggling with defining owners for Incident & Problem. You breifly discussed that the Inicident owner should not own Problem, but can the Problem owner own Incident management? The Service Desk would be held accountable to hit Incident KPI’s and at the end of the day the Incident/Problem team proper closure of Incident tickets and trending for Problem Management. How do most large organization align Incident and Problem Management - do they have the same owner? Thanks for any input on this difficult process owner decission. We are hearing the the Service Desk should not own Incident becuase it’s a conflict of interest? How is that a conflict of interest is beyond me….
Posted by .(JavaScript must be enabled to view this email address) on 04/07 at 08:56 AM -
Hello Mark
To answer your question it is possible for Incident and Problem Management to be owned by the same overall process owner / sponsor with certain precautions put in place.
First there are two types of conflict of interest. 1) The objectives of the processes conflict placing the process owner in a challenge and the other is 2) The high transactional volume of one can override the other process goals and execution.
In my experience if the same person is given the ownership and execution of both Problem and Incident the majority of time will be spent in service restoration versus taking a step back and looking for the repeating patterns that would indicate you have a systemic issue with your service assets.
At one point in my career I managed a help desk that was responsible for both processes. As part of this role I decided it would be a wise thing to have one of the agents take a turn of the phones to run the top 10 incident report and look for trends (a problem mgmt. activity) It was a good idea in principle but it did not work out in practice since we always had a crisis or a major back log in Incidents that would force me to go back on my plan and bring the agent back online with the Incident queue. In this way Incident always trumped the goals of Problem.
I have seen the same thing occur when an organization gives the Crisis Management role to the Problem Management team. They end up spending all their time managing Major Incidents and the subsequent RCA meetings that they never get time to spend on the proactive part of Problem Management leaving the trending of Sev 2-4s out of the picture.
The only way that I have seen this work with any effect it that you can give a single person the ownership of both processes but they in turn need two separate and dedicated teams focused on their respective process. It will be tempting to pull the Problem Team into fire fighting mode but you need to resist this if Problem Management will ever move beyond just the big stuff.
My thoughts.
Troy
Posted by Troy DuMoulin on 04/07 at 09:44 AM -
I’m struggling with getting my head around how to measure a proactive problem management process across multiple services partners and the process’s outputs. How do you measure incidents & problems you’ve prevented? How do you measure a reduction in impact or cost to your customers resulting from incidents which don’t occur?
Posted by .(JavaScript must be enabled to view this email address) on 05/11 at 10:49 PM -
Hello Eric
One consideration for you to think about is that Proactive Problem Management is also about Patch Management. For example when a supplier like Microsoft puts out a vulnerability or security alert about a Known Error in an application or the OS this is in essence a notice to Problem Management to assess the potential incident and determine if they will choose to act or not based on the Risk Assessment.
You could have your suppliers provide you will a measure that indicates how many of these alerts they have captured and executed an update or patch on in a proactive manner.
I remember a personal experience a few years back when a company I was working for got notice of the ISS vulnerability that was targeted by the Blaster Virus. In retrospect if they had captured that alert as a known error and managed against it they would have perhaps avoided a whole world of hurt and business loss.
My thoughts.
Troy
Posted by Troy DuMoulin on 05/14 at 11:39 AM -
How about reporting on the problem records that are NOT linked to an incident record? Surely that’s an indicator (metric) of Proactive Problem Management. Either that or we have a dysfunctional Incident Management process - where incidents are not being recorded.
Posted by .(JavaScript must be enabled to view this email address) on 05/14 at 12:00 PM -
Hi,
Is there a methodology to find the repeat incidents? For example, based on symptom how can an analyst know whether the 800th incident is a repeat of the 5th incident?Any advice would be of great help.
Posted by .(JavaScript must be enabled to view this email address) on 05/26 at 01:25 PM -
Hello Nambi
Normally what we suggest is that you use the technology and service classification structures and setup some type of pattern tolerance.
For example.
If you see so many incidents of a certain Category/Type or Service within a fixed time frame you raise a problem record.
The key is that you will want to set different tolerances for different technologies / services based on their relative business criticality.
For example: You may make a statement stat says if we see so more than 25 Severity 1 Mainframe Incidents within a Month raise a problem ticket.
or if we see 300 Desktop Image Issues within a Qtr raise a problem record
or if there are more than 15 Sev 2 or higher Email Service Incidents within a 2 week period raise a problem record.
It is then for Problem Management to determine if there is actually a correlation or not and to either continue the Problem Correlation or close the record.
My thoughts.
Troy
Posted by Troy DuMoulin on 05/27 at 11:15 AM -
Hi Troy , thanks for your quick & effective response. This is very helpful and practical for implementation.
Posted by .(JavaScript must be enabled to view this email address) on 05/28 at 12:32 PM


