Archive

Archive for April, 2012

TLAs…what are they?

OK, it was innevitable. We just got asked “what’s a TLA”

In short, it stands for Two Letter Acronym or Three Letter Acronym.

In long, it stands for IBM, HP, BMC, CA, EMC, and any other megalith company in the IT service management arena.

In verbose, it means those companies who are sucking the life out of our IT budgets.

Stop The Press!! The Future of Incident Avoidance Is Here. NOT!!!

How often do you go into a meeting with high levels of endorphins driving a feeling of intense satisfaction because you know that what you are about to be told will be groundbreaking, a revolution, and you’re going to want to spread the new gospel because it just sounds so right!

Not often huh?

Same here. But, that being said, that’s how I and my team entered a briefing from a major TLA recently.

Only to be totally and utterly disappointed with their extreme lack of vision, clear mis-understanding of the true issues affecting the operations of Incident Management today, and shock at how blind they were to our perspectives.

We’ve all had vendors telling us about their ability to Predict that an Incident is going to occur a given time before that incident occurs right?

Netuitive, Integrien (before they were swallowed by EMC/VMware and, BMC with ProactiveNet have been selling us on this schpeel for years.

The notion goes:
1. I analyse your historic performance trends across a large number of attributes
2. I use big data analytics techniques (multi-variate correlation seems to get bandied around a lot) to correlate coincident adverse performance spikes or “Anomalies” and the trend leading to them to create a model of the good and the adverse behavior of the infrastructure
3. I use that “Model” applied against real time performance trend data and when I see the conditions which look like they will lead to an Anomaly pattern, I generate a Predictive Event telling me that i’m about to have an Incident.
4. The Predictive Event should give me time to take remedial action prior to the forecasted incident, ensuring that the incident does not manifest.

Wonderful.

But as those of us who have spent money on this kind of software, and then the time, and more time, and more time (…) to train, re-train, guide, edit, assist etc. the model know, it doesn’t work.

What is it really telling us? IF (and the if is BIG), it does actually produce a non-phantom early warning, all it’s really saying is that we have a performance or capacity issue on the way (most of these vendors say “at least 20 minutes” early warning – yeah right!). Even with an early warning of a performance or capacity issue, our processes don’t enable us to react in that time.

But that’s not the key issue. More importantly, that early warning does not indicate any cause to the performance or capacity issue, only that there may or might be one. We’re still in the dark as where to look or what to do.

So what does it really do? Simply tells us, IF it works, that our customers are going to suffer a possible usability issue. Hah! So now, when they call, we can say, “yes, we know, awful isn’t it!”.

But I keep using the “Big IF”.

Does it work?

Well, here’s the rub. When does a historic performance trend actually offer any indication as to the likely future performance trend? We’re in the financial services community and I can tell you, our systems and behavior are changing all the time. So we cannot reliably say that the behavior we saw last week will be the same this week.

Not only that, but the fact that network capacity and CPU peaks correlated last week were coincident with an Incident does not mean that they will be the same this week. Ben Bernanke gave a speech last week and at the moment, we’re not able to factor that in!!!

The fact is, these tools do not have any understanding of real fault conditions because they do not work with fault data.

The algorithm these tools use (ask Netuitive) have their roots in Seismic Survey data analysis. The first generation use case for Big Data analytics and where it all started.

The idea goes that when you set off an explosive charge at the top of a rock formation, each of the layers will return a different ‘performance signal’. Correlating certain performance signals together allows you to look for certain patterns (what I referred to as an Anomaly above). In the case of seismic analysis, a certain pattern will correspond to what we already know is the pattern for a hydrocarbon deposit…so we drill there to see if its true.

Now, I may be simple and not have a degree. After all, I came from punch-card entry, but, rock formations do not change in millions of years and the pattern of the hydrocarbon deposit therefore will not have changed in that time, so looking to correlate that pattern is really a numbers and coverage game – how much land area can I test?

However, our IT infrastructures can change monthly (or in some of our cases, weekly), and the load characteristics can change daily when you have a business like ours.

My point? There is never a consistent “Truth” Anomaly pattern to look for. The Anomaly patterns must be constantly changing. How can their model ever be useful, accurate or right?

So, vendor TLA came in. They didn’t talk Seismic, they talked “Facial Recognition”. They talked about how faces change, men grow beards, their hair recedes. Women (few grow moustaches) change their hair, make-up, etc. Their algorithms still work.

We were all taken in. Then some bright spark in the room said “but actually, although the adornment of a face changes, the key features and positions of the eyes, nose, ears, and mouth do not change and that’s how forensic scientists are able to scaffold skeletons to recreate life-like realistic faces, and that’s how facial recognition software works. The algorithms are the same as seismic pattern detection algorithms.

Then another bright spark asked “what about fault events”? Well that just about threw them.

We walked out of the room massively disappointed. After all, we are all to wall with their stuff. Disapointed in that they have no new thinking. They’ve simply followed, ten years late, where Netuitive, BMC ProactiveNet and Integrien have already trodden and we are not taken in.

The audacity of it all though is that none of them seem to get the real problem of Incident Management. WE WANT TO KNOW WHEN OUR CUSTOMERS ARE IMPACTED SO THAT WE CAN TALK TO THEM AND MAKE THEM FEEL GOOD ABOUT US!!!

Give us working BSM please. Stop trying to give us prediction when we really don’t have the processes to react in time and the prediction doesn’t show us what the cause is!

Oh it is so good to vent.