ITIL Implementation Roadmap (Problem Management) – Part 6
Problem Management Screws Up Our Metrics!
How many time have you seen the glint in the executives eye when he or she proudly proclaims a first call incident resolution of 80% percent or higher at the Service Desk?
This single metric is often held high as the sacred metric of efficiency and effectiveness for an IT support organization. However, have you ever stopped to think that perhaps the exact opposite may be true?
Consider that the best way to achieve a high first call resolution is to repeatedly see the same issues called into the Service Desk over, and over again. Once this predicable pattern is documented the repeat incidents are quickly identified and the workarounds are applied within minutes. As my good friend George Spalding is fond of saying. “Show me an IT organization with an extremely high first call resolution metric and I will show you a Service Desk which is covering up for a dysfunctional IT Shop.” It is true that the Service Desk is doing a fine job of dispatching Incidents without bothering 2nd level support, but this is primarily due to the fact that no one is fixing anything permanently. If the organization was to actually identify repeat incidents and remove them from the environment (Problem Management) then the first call resolution metric would actually go down and the only ones who would be happy about it would be the customers since they don’t have to call the Service Desk.
We in IT live in a culture where repeat Incidents are an accepted norm! It cannot be a coincidence that Problem Management is consistently ranked as one of the least mature IT processes assessed in over a hundred ITIL process assessments conducted by Pink in the last five years.
Insanity: doing the same thing over and over again and expecting different results. ~Albert Einstein
The key reason that ITIL splits Incident and Problem Management into two separate processes is that they have very different objectives.
- Incident management: Restore service (fix the user)
- Problem Management: Identify the error and remove it (fix the technology)
Problem Management is typically started in the second wave of process projects due to the fact that it is a back office IT process and that it also depends on the output of a mature Incident process. Ideally Incident Management produces consistent record classification and data for trending in support of problem identification. Once Incident Management can reliably produce this level of output, the data can be trusted to support the implementation of Problem Management. Organizations approach implementation of this process in a typical pattern.
- One of the first activities the organizations implement that are traditionally associated with the Problem Management process is the “Major Incident Review” process, often referred to as a postmortem activity. The premise of this activity is to review high impact incidents (the really embarrassing ones) to determine root cause and implement measures to avoid a re-occurrence. This activity is often implemented under the management of the incident restoration process and is often led by a service desk lead or manager. This can be considered as reactive Problem Management and is further explored in the blog post ITIL Problem Management vs Root Cause Analysis
- The next level of maturity is the realization that Problem Management is a distinct process that requires its own process models, policies and resources and is supported by incident reporting and trending activities. While at this point the process is implemented at a level of maturity that has significant benefits, the majority of activity is still focused on reactive problem identification and elimination.
- The third level of Problem Management implementation typically includes the identification or proactive issues for the explicit purpose of incident avoidance. An example of this is where the patch management process is understood to be part of the Problem Management process. When a vendor signals that a security vulnerability or deficiency has been found in their product, a Known Error record is opened for the purpose of impact analysis and assessment before the incident occurs in the organization. If the Known Error is deemed to be applicable, then the Release and Change Management processes are engaged to validate, test, approve, and deploy the patch into the production environment. Obvious process dependencies include Release, Change and Configuration Management.
Jack Probst: One of my co-workers at Pink summarizes the challenge nicely in the following statement. “Perhaps Problem Management is such as challenge due to the fact that we have lost sight of the forest by focusing on the daily grind of managing trees! Or perhaps a more accurate statement is that we don’t have a forest or trees problem at all. What we have is a bark problem – we are far too close to the technology issues to even envision that we have a problem.”
Troy’s thoughts what are yours?
“A S.E.P.,’ he said, ‘is something that we can’t see, or don’t see, or our brain doesn’t let us see, because we think that it’s somebody else’s problem. That’s what S.E.P. means. Somebody Else’s problem. The brain just edits it out; it’s like a blind spot. If you look at it directly you won’t see it unless you know precisely what it is. Your only hope is to catch it by surprise out of the corner of your eye.” ~Douglas Adams
You will very often find that both an Incident and a Problem tickets are open at the same time. This is certainly the case where you have a significant Incident with a high Impact and Urgency. The incident team will create a ticket(s) and begin working on restoring service. Since it is what I like to call an incident you never want to live through again you also open a problem record to begin the process of root cause and problem elimination.
The Incident ticket will be closed once the the service has been restored by whatever means possible but the problem record stays open until it has been deemed eliminated.
You will find that both the Incident and Problem record will share the same classification but the SLA’s will differ.
eg: Priority 1 Incidents resolved within 4 hours 80% of the time / Priority 1 Problem need to reach root cause status by 5 business days.
Troy DuMoulinPosted by Troy DuMoulin on 12/28 at 09:06 PM
I really enjoy your blogs, Troy! Thanks!
Back to the Aleks’s question/scenario though…
Providing with those “means” of restoring service in case of an unknown and difficult (no-obvious) error, would it be a task for Problem or Incident Management?
To Aleks’s point, due to the lack of skills in Incident Management, would Problem Management be directly involved in addressing the soonest service restore, thus being directly involved in helping Incident Management to fulfil its main objective?
When considering the boundaries of the process of Incident versus problem management it is important to keep the process objectives in focus.
The process of Incident Management is focused on service restoration. This can be as simple as replacing a mouse or as complex as restoring service related to a enterprise mainframe. The process of incident management needs to be able to scale for each scenario.
Problem management on the other hand is focused on discovering the root cause of the incident whether it is complex or simple and taking steps to ensure that it does not re-occur.
It is possible that the incident was restored by a work around such as reboot the server and that during the root cause activity an improved work around or even a permanent fix is discovered by the process of problem management. Communicating this information back to the Incident process for future re-occurances is an activity or problem management.
The difference is not based on the complexity of the activity but the objective of the action. Incident Management is about fixing the user(s) by what ever means is possible / permissible. Problem Management is about fixing the technology and takes a back seat in priority and sequence.
TroyPosted by Troy DuMoulin on 05/23 at 01:11 PM
While all things are possible they are not always wise!
Speaking of a time when I managed a Service Desk here are my thoughts on this.
When it comes to 2nd level roles it is very possible for the same people to wear two different hats based on what they are being asked to do.
Eliminate the Problem
However, I believe that the Service Desk should focus on Service Restoration or Service Request.
It is not so much because it is a conflict of interest for them to own Problem Management it is more about the fact that Problem always takes a back seat to Incident Management and suffers from it.
When I was a SD manager I would try to rotate my people off the desk so that they could get a break from the phones but also so that they could begin looking at developing the top 10 list of Incidents based on our data.
While this was a good intent it rarely worked since every time the phones got busy the person was placed back on the phones to the detriment of Problem Management.
For this reason I suggest Problem Management should be owned by a group that is not your first line of defense on Incident Restoration.
TroyPosted by Troy DuMoulin on 03/26 at 12:19 PM
Assuming that your Priority/ Severity Model is based on business or mission risk then the assumption is that a Severity 1 Incident should also be a Severity 1 Problem. The basis for establishing your priority levels (Severity + Urgency) should be shared by multiple processes.
For more info on this please take a look at the following article.
The Practicality of Prioritization. http://bit.ly/yRnl4
However, what I would suggest is that the Problem initially inherit the Severity of the Incident and then move it up or down based on further information or the fact that there may be more than one Incident associated with your problem record.
TroyPosted by Troy DuMoulin on 11/19 at 12:29 PM
To answer your question it is possible for Incident and Problem Management to be owned by the same overall process owner / sponsor with certain precautions put in place.
First there are two types of conflict of interest. 1) The objectives of the processes conflict placing the process owner in a challenge and the other is 2) The high transactional volume of one can override the other process goals and execution.
In my experience if the same person is given the ownership and execution of both Problem and Incident the majority of time will be spent in service restoration versus taking a step back and looking for the repeating patterns that would indicate you have a systemic issue with your service assets.
At one point in my career I managed a help desk that was responsible for both processes. As part of this role I decided it would be a wise thing to have one of the agents take a turn of the phones to run the top 10 incident report and look for trends (a problem mgmt. activity) It was a good idea in principle but it did not work out in practice since we always had a crisis or a major back log in Incidents that would force me to go back on my plan and bring the agent back online with the Incident queue. In this way Incident always trumped the goals of Problem.
I have seen the same thing occur when an organization gives the Crisis Management role to the Problem Management team. They end up spending all their time managing Major Incidents and the subsequent RCA meetings that they never get time to spend on the proactive part of Problem Management leaving the trending of Sev 2-4s out of the picture.
The only way that I have seen this work with any effect it that you can give a single person the ownership of both processes but they in turn need two separate and dedicated teams focused on their respective process. It will be tempting to pull the Problem Team into fire fighting mode but you need to resist this if Problem Management will ever move beyond just the big stuff.
TroyPosted by Troy DuMoulin on 04/07 at 09:44 AM
One consideration for you to think about is that Proactive Problem Management is also about Patch Management. For example when a supplier like Microsoft puts out a vulnerability or security alert about a Known Error in an application or the OS this is in essence a notice to Problem Management to assess the potential incident and determine if they will choose to act or not based on the Risk Assessment.
You could have your suppliers provide you will a measure that indicates how many of these alerts they have captured and executed an update or patch on in a proactive manner.
I remember a personal experience a few years back when a company I was working for got notice of the ISS vulnerability that was targeted by the Blaster Virus. In retrospect if they had captured that alert as a known error and managed against it they would have perhaps avoided a whole world of hurt and business loss.
TroyPosted by Troy DuMoulin on 05/14 at 11:39 AM
Normally what we suggest is that you use the technology and service classification structures and setup some type of pattern tolerance.
If you see so many incidents of a certain Category/Type or Service within a fixed time frame you raise a problem record.
The key is that you will want to set different tolerances for different technologies / services based on their relative business criticality.
For example: You may make a statement stat says if we see so more than 25 Severity 1 Mainframe Incidents within a Month raise a problem ticket.
or if we see 300 Desktop Image Issues within a Qtr raise a problem record
or if there are more than 15 Sev 2 or higher Email Service Incidents within a 2 week period raise a problem record.
It is then for Problem Management to determine if there is actually a correlation or not and to either continue the Problem Correlation or close the record.
TroyPosted by Troy DuMoulin on 05/27 at 11:15 AM
love this quote
“Show me an IT organization with an extremely high first call resolution metric and I will show you a Service Desk which is covering up for a dysfunctional IT Shop.”
I Have not heard it before bt how right you are. A freat read thank you for sharing your thoughts
CraigPosted by Craig Kelly on 06/15 at 06:14 PM
We’re implementing ITIL Service Catalog, Request Fulfillment, Configuration Management, Incident Management, Change Management and Service Level Management (for the first phase) along with ISO27001:2013. I have a question here. We’re trying to integrate ITIL processes with ISO27001 requirements and we’re using ManageEngine ServiceDesk Plus. In ISMS we need to have RCA for major incidents, as you have mentioned, the first level of Problem Management maturity is having review and RCA for major incidents. the question is depending on what we have on ManageEngine SeviceDesk, RCAs cannot be recorded for incidents unless they are associated with a problem. so should we associate problems (use the problem management module) and skip the Problem Management process for now?
In my view your plan to use the Problem Management Module for RCA even though you are not tackling that process in your first phase is a wise one. Regardless of the fact that many organizations will do an RCA as part of a Major Incident Review the outcomes of the RCA process are best aligned and reported as Problem Management.
By following your intended approach the RCA data will be available to you when you actually are ready to start Problem Management
TroyPosted by Troy DuMoulin on 03/09 at 12:56 PM
Hello Sudha, the best practice is to open a Problem Record and keep it open in the state of “Problem” vs “Known Error” since as you state you do not have a workaround established for the repeat issue. You will continue to associate Incident tickets to the open Problem Record and it will gain history over time increasing the business case to develop a Root Cause and Mitigation strategy.
Yes you will close the Incident record providing that service has been restored by some alternative means. What you don’t want to do is close the Problem record. This stays open until such time as you can remove the reoccurring incidents.
TroyPosted by Troy DuMoulin on 03/31 at 11:07 AM
Hi, we have had solution delivery teams identifying deficiencies and implementing systems with deficiencies. The problem is that the deficiency list is not acted upon. Would the best practice for this be to open a problem ticket?
Yes, in principle you should open up a Problem Ticket when you see a pattern that needs to be corrected. This should lead to a Root Cause evaluation of where deficiencies are typically coming from and why. For example perhaps there is a challenge with the Requirement Definition part of your development process.
Also in the case of where you have a know deficiency (Known Error) being implemented into production due to acceptable risk it is important to log it as a known error and publish its existence to the Production Support team. Again in an ideal situation you would also publish the expected work around for this “known deficiency” if possible.
The one challenge you may face is that while it is easy enough to record a problem the real challenge is to get action around establishing the root cause and removing it from completely. Otherwise your problem ticket inventory continues to grow with no improvement.
I address this challenge in the following PR Radio episode.
Practitioner Radio - The Problem With Problem Management
http://blogs.pinkelephant.com/index.php?/troy/practitioner_radio_-_the_problem_with_problem_management/Posted by Troy DuMoulin on 12/03 at 03:41 PM
Typically Problem Records are opened when a pattern of Incidents appears to indicate a trend or when there is an Incident a major business impact. In the first case you would have the repeated Server Logs to share and can probably count on a repeat at some point. In the 2nd case your data for analysis would be wider than the server logs in question.
If these server issues have not occurred frequently then they should probably not be in a Problem Record but rather be dealt with through a Major Incident Review process to be conducted after the service restoration but before the Incident is status close.
If there has been no action on these records for over a year it serves no purpose to keep them open and I would recommend gaining agreement with your customer to close them and to handle them under Incident Management.
TroyPosted by Troy DuMoulin on 10/11 at 02:43 PM