Tuesday, February 25, 2014

Strategic Direction of Service Assurance Tools

Future Strategy of Service Assurance Tools

Over the last 24 years the strategic direction for Service Assurance of large IT infrastructures, specifically the area referred to as proactive monitoring has had to change radically to meet the growing demands of IT Operations. In simple terms it has had to monitor larger, ever more complex and dynamically changing infrastructures with less and less people.

In many ways this accelerating change has been a good thing, it has fueled the demand for innovative technology improvements such as de-duplication (as used in Netcool back in 1993). In turn these innovations have given large organizations operating efficiencies, in terms of the ratio of the number of people per IT infrastructure size, that small and medium size companies find hard to match. However I believe that in the last 5 to 6 years the strategic direction of proactive monitoring in many large IT infrastructures has taken a serious turn in the wrong direction.

I believe the result of this strategic wrong turn has been the gradual degradation of “effective“ real time proactive monitoring. Unfortunately for many (both Companies and Vendors) abandoning this strategic dead end would involve massive loss of face. I will explain!

Many large organizations split there first line resources into two broad areas:

1) Reactive monitoring dealing with trouble tickets that have been generated for customers. Often performing simple diagnostics and first resolution of the fault. If they cant fix the problem quickly or if the resolution requires detailed skills in a particular area such as DBA, they will pass it on to the next level of support. Typical tools used would be Help Desk systems such as Remedy, Clarify, HP Service Manager etc.

2) Proactive monitoring, looking at failure events generated directly from devices in the failing infrastructure and then resolving or escalating the problem to the next level of support. Typical proactive tools would be Tivoli Netcool, EMC SMARTS etc

This model has often resulted in a shifting balance of people resources between these two competing drivers. In essence the effectiveness (or otherwise) of the proactive monitoring tools driven by the desire to keep the end customer happy by fixing failures before they impact them, against the limited amount of people resources. See my blog on Realtime Proactive Monitoring

As always the business demanded that IT Operations should do more and more with less and less. While at the same time the task was getting almost exponentially harder due to IT infrastructure developments such as Virtualisation and the integration of partial Cloud based solutions.

Several years ago many strategy groups in large organizations believed they had the perfect solution to this increasingly limited resource balancing act. That solution was to deploy their existing proactive monitoring solutions as a "Black Box" solution, others refer to this approach as "Dark Siding". 

The concept was that the existing proactive monitoring systems would not be looked at by operational people. Rather they would get the vendors to develop their solutions to automatically raise trouble tickets (aka Incidents) directly into the Help Desk system. Job done!

The model looked very good, and promised to help resolve many of the limited people resource issues. In fact many, including myself bought into the idea. It was not until a few years later that myself and a few others started to realize that the concept had significant flaws.

Without going into too much technical detail (I will do that later in a separate blog) the basic problems are divided into three main areas:

1) Going Live with the "Black Box" approach:

Getting high quality, accuracy and relevant automated incidents generated from Proactive monitoring systems, such as Netcool etc, into the help desk or incident management system was much harder than expected. Traditionally these proactive tools were good at letting the users see the bigger picture, however they always required a final human element to correlate the underlying failures accurately.

The vendors had many ideas that they claimed would remove the need for this final human element. Methods such as auto-discovery, network sniffing, impact policies, topology models, CMDB's, application discovery etc. Unfortunately when these solutions were tried in large complex environments they never managed to delivered the results needed. As always the tool vendors had the usual escape clauses such as "well we did say you need an accurate CMDB, so its not our fault" is one of many.

2) Ongoing maintenance of the "Black Box" solution:

Even if the problems above are resolved by throwing huge amounts of people and money at the problem by manually creating accurate rules, topology models and CMDB's etc, the accuracy and relevancy of the incidents generated by the black box method degrades rapidly with time. By its very nature it does not take long for the IT infrastructure you are trying to monitor to change to a point where the modeling solution becomes significantly inaccurate.

Obviously if you keep a permanent army of people updating the rules, policies, CMDBs etc you can theoretically compensate for this drift, however you are then loosing all the people savings you thought you were going to make in the first place.

3) Correlation and Service Visualization

Most people did not realize that by "Black Boxing" the proactive monitoring systems you are loosing your ability to visualize service failures. 

GUI's for proactive monitoring systems have slowly evolved over the last 10 years to provide service orientated visualization. Although visualization and correlation provided by proactive tools is not perfect, by "black boxing" the proactive monitoring system all this work has been thrown away. 

Compared to proactive tools Help desk systems are still in their infancy with their automatic correlation and service visualization offerings. In fact many of these tools don't have any worth mentioning.

What to do Next:

As mentioned in my introduction I believe that abandoning this "black box" strategic dead end would involve loss of face for many service assurance architects. Millions have been spent and years have been wasted, so few if any are willing to admit this is a real problem.

Architects would rather blame the proactive monitoring vendors/technology which to be fair is at least partly true. Unfortunately if this Strategy is not changed, I believe that proactive monitoring of large complex IT infrastructures will wither and fade, putting far more reliance on reactive monitoring. That in turn will be a significant detriment to the customers of IT Operations.

I believe that vendors of Service Assurance Tools need to step up to the mark and develop an innovative alternative to this "Black Box" Strategy before "Proactive Monitoring" and the vendors themselves become a footnote in history. Unfortunately I have seen nothing so far to indicate such innovation is on the horizon.

I would be very interested if you have seen this "Black Box" approach in your environment and if you are seeing the same issues yet? Especially if you have a large complex IT infrastructure.

No comments: