Pink Elephant
The IT Service Management Experts

Troy's Blog

The Hitch Hiker's Guide to the ITIL Galaxy and Beyond
Don't Panic

Home

Author

Troy Dumoulin Photo

Troy DuMoulin, VP, Research, Innovation & Product Development

Troy is a leading ITIL® and IT governance authority with a solid and rich background in Executive IT Management consulting. Troy holds the ITIL Service Manager and Expert certifications and has extensive experience in leading IT Service Management (ITSM) programs with a regional and global scope.

He is a frequent speaker at IT Management events and is a contributing author to multiple ITSM and Lean IT books, papers and official ITIL publications including ITIL’s Planning To Implement IT Service Management and Continual Service Improvement.

 

The Guide

"This blog is dedicated to making sense out of the shifting landscape of IT Management. Just when we thought we had a good handle on managing technology, the job we thought we knew is being threatened by strange acronym’s like ITIL, CMMI, COBIT, ect.. Suddenly the rules have changed and we are not sure why. The goal of this blog is to offer an element of sanity and logic to what can appear to be chaos."


Hitch Hiker's Guide to the Galaxy

"In many of the more relaxed civilizations on the Outer Eastern Rim of the Galaxy, the Hitch Hiker’s Guide has already supplanted the great Encyclopedia Galactic as the standard repository of all knowledge and wisdom, for though it has many omissions and contains much that is apocryphal, or at least wildly inaccurate, it scores over the older more pedestrian work in two important respects.

First, it is slightly cheaper: and secondly it has the words DON’T PANIC inscribed in large friendly letters on its cover."
~Douglas Adams

Syndicate

Troy On Twitter

Recent Entries

Categories

Links

Other Blogs

Archive

ITIL Implementation Roadmap (Problem Management) – Part 6

Problem Management Screws Up Our Metrics!

How many time have you seen the glint in the executives eye when he or she proudly proclaims a first call incident resolution of 80% percent or higher at the Service Desk? 
This single metric is often held high as the sacred metric of efficiency and effectiveness for an IT support organization. However, have you ever stopped to think that perhaps the exact opposite may be true?

Consider that the best way to achieve a high first call resolution is to repeatedly see the same issues called into the Service Desk over, and over again. Once this predicable pattern is documented the repeat incidents are quickly identified and the workarounds are applied within minutes. As my good friend George Spalding is fond of saying. “Show me an IT organization with an extremely high first call resolution metric and I will show you a Service Desk which is covering up for a dysfunctional IT Shop.” It is true that the Service Desk is doing a fine job of dispatching Incidents without bothering 2nd level support, but this is primarily due to the fact that no one is fixing anything permanently. If the organization was to actually identify repeat incidents and remove them from the environment (Problem Management) then the first call resolution metric would actually go down and the only ones who would be happy about it would be the customers since they don’t have to call the Service Desk.

We in IT live in a culture where repeat Incidents are an accepted norm! It cannot be a coincidence that Problem Management is consistently ranked as one of the least mature IT processes assessed in over a hundred ITIL process assessments conducted by Pink in the last five years.

Insanity: doing the same thing over and over again and expecting different results. ~Albert Einstein

The key reason that ITIL splits Incident and Problem Management into two separate processes is that they have very different objectives.

  • Incident management: Restore service (fix the user)
  • Problem Management: Identify the error and remove it (fix the technology)

Problem Management is typically started in the second wave of process projects due to the fact that it is a back office IT process and that it also depends on the output of a mature Incident process. Ideally Incident Management produces consistent record classification and data for trending in support of problem identification.  Once Incident Management can reliably produce this level of output, the data can be trusted to support the implementation of Problem Management. Organizations approach implementation of this process in a typical pattern.

  1. One of the first activities the organizations implement that are traditionally associated with the Problem Management process is the “Major Incident Review” process, often referred to as a postmortem activity.  The premise of this activity is to review high impact incidents (the really embarrassing ones) to determine root cause and implement measures to avoid a re-occurrence.  This activity is often implemented under the management of the incident restoration process and is often led by a service desk lead or manager. This can be considered as reactive Problem Management and is further explored in the blog post ITIL Problem Management vs Root Cause Analysis

  2. The next level of maturity is the realization that Problem Management is a distinct process that requires its own process models, policies and resources and is supported by incident reporting and trending activities.  While at this point the process is implemented at a level of maturity that has significant benefits, the majority of activity is still focused on reactive problem identification and elimination.

  3. The third level of Problem Management implementation typically includes the identification or proactive issues for the explicit purpose of incident avoidance.  An example of this is where the patch management process is understood to be part of the Problem Management process.  When a vendor signals that a security vulnerability or deficiency has been found in their product, a Known Error record is opened for the purpose of impact analysis and assessment before the incident occurs in the organization.  If the Known Error is deemed to be applicable, then the Release and Change Management processes are engaged to validate, test, approve, and deploy the patch into the production environment.  Obvious process dependencies include Release, Change and Configuration Management.

Jack Probst: One of my co-workers at Pink summarizes the challenge nicely in the following statement. “Perhaps Problem Management is such as challenge due to the fact that we have lost sight of the forest by focusing on the daily grind of managing trees! Or perhaps a more accurate statement is that we don’t have a forest or trees problem at all. What we have is a bark problem – we are far too close to the technology issues to even envision that we have a problem.”


Troy’s thoughts what are yours?


“A S.E.P.,’ he said, ‘is something that we can’t see, or don’t see, or our brain doesn’t let us see, because we think that it’s somebody else’s problem. That’s what S.E.P. means. Somebody Else’s problem. The brain just edits it out; it’s like a blind spot. If you look at it directly you won’t see it unless you know precisely what it is. Your only hope is to catch it by surprise out of the corner of your eye.” ~Douglas Adams

Posted by Troy DuMoulin on 01/23 at 02:18 AM
  1. Great post Troy. Here’s a question for you; do incident records have to be closed to create a problem record? In most cases incident management can figure out a work around to an issue, but they dont always have the resources or skill set to find a resolution. Should the problem management team get a jump on the problem so that a known error record can be issued to help the incident management team?

    Posted by .(JavaScript must be enabled to view this email address)  on  12/27  at  05:49 PM
  2. Hello Aleks

    You will very often find that both an Incident and a Problem tickets are open at the same time. This is certainly the case where you have a significant Incident with a high Impact and Urgency. The incident team will create a ticket(s) and begin working on restoring service. Since it is what I like to call an incident you never want to live through again you also open a problem record to begin the process of root cause and problem elimination.

    The Incident ticket will be closed once the the service has been restored by whatever means possible but the problem record stays open until it has been deemed eliminated.

    You will find that both the Incident and Problem record will share the same classification but the SLA’s will differ.

    eg: Priority 1 Incidents resolved within 4 hours 80% of the time / Priority 1 Problem need to reach root cause status by 5 business days.

    Good Question

    Troy DuMoulin

    Posted by Troy DuMoulin  on  12/28  at  09:06 PM
  3. I really enjoy your blogs, Troy! Thanks!

    Back to the Aleks’s question/scenario though…
    Providing with those “means” of restoring service in case of an unknown and difficult (no-obvious) error, would it be a task for Problem or Incident Management?
    To Aleks’s point, due to the lack of skills in Incident Management, would Problem Management be directly involved in addressing the soonest service restore, thus being directly involved in helping Incident Management to fulfil its main objective?

    Thank you!
    Gleb

    Posted by .(JavaScript must be enabled to view this email address)  on  05/21  at  10:33 AM
  4. Hello Gleb

    When considering the boundaries of the process of Incident versus problem management it is important to keep the process objectives in focus.

    The process of Incident Management is focused on service restoration. This can be as simple as replacing a mouse or as complex as restoring service related to a enterprise mainframe. The process of incident management needs to be able to scale for each scenario.

    Problem management on the other hand is focused on discovering the root cause of the incident whether it is complex or simple and taking steps to ensure that it does not re-occur.

    It is possible that the incident was restored by a work around such as reboot the server and that during the root cause activity an improved work around or even a permanent fix is discovered by the process of problem management.  Communicating this information back to the Incident process for future re-occurances is an activity or problem management.

    The difference is not based on the complexity of the activity but the objective of the action. Incident Management is about fixing the user(s) by what ever means is possible / permissible. Problem Management is about fixing the technology and takes a back seat in priority and sequence.

    Troy

    Posted by Troy DuMoulin  on  05/23  at  01:11 PM
  5. Hello Troy,

    Wonderful post. I have few doubts which I think you would be able to clear. Can the same human resources in the incident management team be also assigned with the problem management responsibilities? I have come across incident management teams who perform root cause analysis on high/critical priority incidents and if a change in code is required to fix the problem, a change request is initiated to another department. This root cause analysis is performed by the same incident management team who provides a workaround to temporarily fix the user.

    Posted by .(JavaScript must be enabled to view this email address)  on  03/17  at  06:18 AM
  6. Hello Binoy

    Good question:

    While all things are possible they are not always wise!

    Speaking of a time when I managed a Service Desk here are my thoughts on this.

    When it comes to 2nd level roles it is very possible for the same people to wear two different hats based on what they are being asked to do.

    Restore Service
    Eliminate the Problem

    However, I believe that the Service Desk should focus on Service Restoration or Service Request.

    It is not so much because it is a conflict of interest for them to own Problem Management it is more about the fact that Problem always takes a back seat to Incident Management and suffers from it.

    When I was a SD manager I would try to rotate my people off the desk so that they could get a break from the phones but also so that they could begin looking at developing the top 10 list of Incidents based on our data.

    While this was a good intent it rarely worked since every time the phones got busy the person was placed back on the phones to the detriment of Problem Management.

    For this reason I suggest Problem Management should be owned by a group that is not your first line of defense on Incident Restoration.

    My Thoughts

    Troy

    Posted by Troy DuMoulin  on  03/26  at  12:19 PM
  7. Hello Troy:

    Nice article.  Quick question…..in case of Problem, does the problem take the same severity level as the incident or assign an appropriate severity level for Problem.  An appropriate severity level would be based on the number of incidents and frequency of the incidents in order to get the biggest bang in our metric and hopefully a positive SLA implication.  Could you share your thoughts on this.

    Posted by .(JavaScript must be enabled to view this email address)  on  11/17  at  07:51 PM
  8. Hello Manju

    Assuming that your Priority/ Severity Model is based on business or mission risk then the assumption is that a Severity 1 Incident should also be a Severity 1 Problem. The basis for establishing your priority levels (Severity + Urgency) should be shared by multiple processes.

    For more info on this please take a look at the following article.

    The Practicality of Prioritization. http://bit.ly/yRnl4

    However, what I would suggest is that the Problem initially inherit the Severity of the Incident and then move it up or down based on further information or the fact that there may be more than one Incident associated with your problem record.

    Best Regards

    Troy

    Posted by Troy DuMoulin  on  11/19  at  12:29 PM
  9. Hi, We are struggling with defining owners for Incident & Problem.  You breifly discussed that the Inicident owner should not own Problem, but can the Problem owner own Incident management?  The Service Desk would be held accountable to hit Incident KPI’s and at the end of the day the Incident/Problem team proper closure of Incident tickets and trending for Problem Management.  How do most large organization align Incident and Problem Management - do they have the same owner?  Thanks for any input on this difficult process owner decission.  We are hearing the the Service Desk should not own Incident becuase it’s a conflict of interest?  How is that a conflict of interest is beyond me….

    Posted by .(JavaScript must be enabled to view this email address)  on  04/07  at  08:56 AM
  10. Hello Mark

    To answer your question it is possible for Incident and Problem Management to be owned by the same overall process owner / sponsor with certain precautions put in place.

    First there are two types of conflict of interest. 1) The objectives of the processes conflict placing the process owner in a challenge and the other is 2) The high transactional volume of one can override the other process goals and execution.

    In my experience if the same person is given the ownership and execution of both Problem and Incident the majority of time will be spent in service restoration versus taking a step back and looking for the repeating patterns that would indicate you have a systemic issue with your service assets.

    At one point in my career I managed a help desk that was responsible for both processes. As part of this role I decided it would be a wise thing to have one of the agents take a turn of the phones to run the top 10 incident report and look for trends (a problem mgmt. activity) It was a good idea in principle but it did not work out in practice since we always had a crisis or a major back log in Incidents that would force me to go back on my plan and bring the agent back online with the Incident queue. In this way Incident always trumped the goals of Problem.

    I have seen the same thing occur when an organization gives the Crisis Management role to the Problem Management team. They end up spending all their time managing Major Incidents and the subsequent RCA meetings that they never get time to spend on the proactive part of Problem Management leaving the trending of Sev 2-4s out of the picture.

    The only way that I have seen this work with any effect it that you can give a single person the ownership of both processes but they in turn need two separate and dedicated teams focused on their respective process. It will be tempting to pull the Problem Team into fire fighting mode but you need to resist this if Problem Management will ever move beyond just the big stuff.

    My thoughts.

    Troy

    Posted by Troy DuMoulin  on  04/07  at  09:44 AM
  11. I’m struggling with getting my head around how to measure a proactive problem management process across multiple services partners and the process’s outputs.  How do you measure incidents & problems you’ve prevented?  How do you measure a reduction in impact or cost to your customers resulting from incidents which don’t occur?

    Posted by .(JavaScript must be enabled to view this email address)  on  05/11  at  10:49 PM
  12. Hello Eric

    One consideration for you to think about is that Proactive Problem Management is also about Patch Management. For example when a supplier like Microsoft puts out a vulnerability or security alert about a Known Error in an application or the OS this is in essence a notice to Problem Management to assess the potential incident and determine if they will choose to act or not based on the Risk Assessment.

    You could have your suppliers provide you will a measure that indicates how many of these alerts they have captured and executed an update or patch on in a proactive manner.

    I remember a personal experience a few years back when a company I was working for got notice of the ISS vulnerability that was targeted by the Blaster Virus. In retrospect if they had captured that alert as a known error and managed against it they would have perhaps avoided a whole world of hurt and business loss.

    My thoughts.

    Troy

    Posted by Troy DuMoulin  on  05/14  at  11:39 AM
  13. How about reporting on the problem records that are NOT linked to an incident record? Surely that’s an indicator (metric) of Proactive Problem Management. Either that or we have a dysfunctional Incident Management process - where incidents are not being recorded.

    Posted by .(JavaScript must be enabled to view this email address)  on  05/14  at  12:00 PM
  14. Hi,

    Is there a methodology to find the repeat incidents? For example, based on symptom how can an analyst know whether the 800th incident is a repeat of the 5th incident?Any advice would be of great help.

    Posted by .(JavaScript must be enabled to view this email address)  on  05/26  at  01:25 PM
  15. Hello Nambi

    Normally what we suggest is that you use the technology and service classification structures and setup some type of pattern tolerance.

    For example.

    If you see so many incidents of a certain Category/Type or Service within a fixed time frame you raise a problem record.

    The key is that you will want to set different tolerances for different technologies / services based on their relative business criticality.

    For example: You may make a statement stat says if we see so more than 25 Severity 1 Mainframe Incidents within a Month raise a problem ticket.

    or if we see 300 Desktop Image Issues within a Qtr raise a problem record

    or if there are more than 15 Sev 2 or higher Email Service Incidents within a 2 week period raise a problem record.

    It is then for Problem Management to determine if there is actually a correlation or not and to either continue the Problem Correlation or close the record.

    My thoughts.

    Troy

    Posted by Troy DuMoulin  on  05/27  at  11:15 AM
  16. Hi Troy , thanks for your quick &  effective response. This is very helpful and practical for implementation.

    Posted by .(JavaScript must be enabled to view this email address)  on  05/28  at  12:32 PM
  17. love this quote
    “Show me an IT organization with an extremely high first call resolution metric and I will show you a Service Desk which is covering up for a dysfunctional IT Shop.”
    I Have not heard it before bt how right you are. A freat read thank you for sharing your thoughts

    Craig

    Posted by Craig Kelly  on  06/15  at  06:14 PM
  18. Page 1 of 1 pages

Name:

Email:

Location:

URL:

Smileys

Remember my personal information

Notify me of follow-up comments?

Please answer the question asked below:

What is missing: North, South, East?