Wednesday, July 18, 2007
Service Improvement Tips – Problem Management
For several decades Pink has been conducting process assessments based on ITIL and over that period we have seen consistent trends in which processes were more mature than others. As you would assume the maturity of Configuration Management is universally depressing (a discussion for another day). However, one that might surprise you is that Problem Management has consistently come out as one of the least mature processes across hundreds of first time assessments. It seems that while companies do focus on the Service Desk and the process of Incident Management our consistent finding is that very few IT organizations focus on the process that is designed to remove errors and instability from our service environment. Perhaps this is due in part to our IT culture that rewards firefighting skills and quick resolutions over and above back office analytics and proactive activity.
Historically we have been much more interested in 1st call resolution rates then we have been on problem avoidance or deploying solid production assurance processes like Release Management. This topic I have explored in detail in the following blog post: Problem Management screws up our metrics!
However, recently I have observed a renewed interest in problem identification and Incident reduction thanks to the guidance provided in ITIL. The following tips represent common pitfalls on the way to implementing this important process which I have observed in numerous organizations.
- Root Cause Analysis (RCA) is not Problem Management:
At an early stage of Problem Management implementation many organizations develop a reactive process to analyze and report on major business impacting Incidents. In short, these Incidents are significant in that they have caused a business impact that was highly noticeable and most probably introduced significant cost or risk to the business. At its best this process is quickly executed as soon as the service has been restored and a high priority investigation occurs where a detail report is generated describing the business impact and the contributing factors (People, Process and Technology) that led to the outage. However, the report should not only identify cause and effect but should also identify specific and concrete actions that will avoid a similar occurrence in the future. The RCA process then continues to ensure that the actions are carried out. While this is a commendable and useful activity it is only one element and arguably not the most important activity of the Problem Management process. Many, if not most, organizations are satisfied with leaving their process at this level of maturity and never dive deeper into trending and removing the repeat incidents of lesser perceived impact and then move on to further address proactive Problem Management. My strong advice is that to be truly effective you have to mature this process from only focusing on the large business impacting issues to also looking for trends and removing repeat Incidents. This topic is further discussed in the following post: Problem Management vs Root Cause Analysis
- Central Ownership with Distributed Coordination:
The next two tips have to do with the roles of Problem Management. In and of itself this is not a hard process to understand or implement relative to the other processes described by ITIL. However, that being said a common issue that I have seen is that while the process has been defined and a central Problem Management function has been established many companies still struggle with making this process effective. In my experience this is largely due to the fact that a distributed role that I will call a Problem Coordinator must be identified and resourced in each IT domain for this process to be effective. What happens when you identify a central governance function and process without their distributed counterparts is that you identify problems and produce interesting trending reports that show you where the pain is but very little is actually done with this information. In essence you build an inventory of identified problems without a real focus on removing them from the environment. Remember that what we are talking about here are the lesser impacting but numerous repeat issues identified through reporting and trending. The large business impacting Problems are getting the attention they need based on the observations in the first point. This problem is address by establishing a distributed Problem Coordinator role for each domain and then ensuring that the manager and the groups KPIs reflect the importance of the process. (What gets measured gets done!) When this role is established the central governance roles focus on reporting, trending, problem identification and prioritization. While the distributed role is responsible for resource allocation, investigation and root cause determination as well as identifying permanent fixes. The challenge that a central function has without these distributed coordinators is that it does not own the resources required to execute the process so it is forever going around with hat in hand literally pleading for resources to move this process forward.
- Leave the Problem Management Resources out of the Major Incidents:
Another common challenge I see around Problem Management is the observation that often the people who have been resourced against this process are also often required to be the champions of the major Incident or crisis management processes. Many companies in accordance with best practice have a specialized approach to coordinating the activities around resolving a major Incident. It often seems logical that the same person who is responsible for the post mortem or major Incident review activities should also chair or captain the major Incident restoration process. My advice and experience on this matter is to resist this assumption and make sure that these two processes are handled and resourced separately. The reason for this advice has two primary drivers. The first driver is that time after time where I have seen this double duty applied the individual spends most of their time in a fire fighting mode and has precious little time left over to step back and take an overall view of the big picture. The other reason for this suggestion is that the skills required for facilitating a major Incident process with high stress are not necessarily the same as those required for the detailed, tenacious and analytic approach required by Problem Management. In summary keep the central Problem Management resources out of the firefighting mode and let them do the incredibly important work of systematically identifying and removing service delivery issues related to people, process and technology faults.
Troy’s Thoughts What are Yours?
“The major problem - one of the major problems, for there are several – one of the many major problems with governing people is that of whom you get to do it; or rather of who manages to get people to let them do it to them. To summarize: It is a well known fact, that those people who most want to rule people are, ipso facto, those least suited to do it. To summarize the summary: Anyone who is capable of getting themselves made President should on no account be allowed to do the job. To summarize the summary of the summary: people are a problem.” ~Douglas Adams