Flying Cows: On Event Log analysis and Causal Relationships

Flying CowBackground:
For the last few months I have been doing some research on event log analysis. The whole thing started with this simple idea: Let’s say we have an EventID, for e.g. 15001 from MsExchangeTransport, Can we drill down and trace what caused it? You have the present event-data at X point in time, so you should be able to search-back in a tree of events.

I was expecting answers along the lines of event chains –EventID: 15001+MsExchangeTransport -> EventID:1111+SourceA [TimeStamp]-> EventID:2222+SourceB[TimeStamp]
Well the situation is a bit more complicated than that.

  1. Event ID’s are tied to the Event Source. Different sources can have same EventID’sin the Microsoft EventID namespace. Application developers can write to event logs with their own custom Event ID’s.
  2. While going through Event logs is like finding a needle in a haystack. There are filters for source and eventID’s, but they don’t help much. Also, event logs are localized. You would need some log analysis software which collects logs across your Server Stack, and network layers to model some type of relationship.
    Just to give a sense of how many providers can write to your event-log, try this >
    Type this on your dos prompt – wevtutil el
    Now write a script to do this across your server farm, and do a Compare-Object to get a list of all providers who can write to Windows Event logs.
  3. All indicators for a system under pressure may not be collected entirely in event logs even with systems with cranked-up diagnostic logging. You may need to look at Perfmon data, windows log files, syslog data and events from your networking stack (Cisco Netflow / OpenFlow). If you have a virtualized environment, you add more layers of events- Storage stack data, Virtualized hypervisor data etc.
  4. EventID’s by themselves do not have the intelligence to understand their own severity. You may need an external reference data-set or a codebook, which provides guidance on whether the particular event is severe, or can be ignored. Think http://support.microsoft.com/KB/ID
  5. EventID’s have temporal data, meaning, every EventID has a time-stamp attached to it. This can be used for plotting events on a time-series, like Splunk search app.

CLASSIFICATION:
So how would you know if your infrastructure is in trouble?
Typically, you would receive an alert for an event you are watching [for e.g. – EventID, “disk space”, “network layer alerts”, “storage alerts”], and it’s entirely upto the system-admin’s technical ability to figure out the root-cause. Finding a root-cause may not help in certain cases when mutually exclusive events are generated at different layers of the stack to create a storm. Also, you may have concurrent root causes.

I want to classify the different categories of events we are trying to analyze:

  1. CAT1: Outage events.
  2. CAT2: General errors and warnings related to a server-role, or an application in the core-stack.
  3. CAT3: Other application events
  4. CAT4:Events triggered by end-user action.
  5. CAT5: Rare one-off events which are critical, but may not trigger an alert.

[Core-Stack]

  1. Core OS
  2. Core Infra @{AD, DNS, DHCP, IPAM, Server-Role @{TS,Hyper-V,ADRMS,ADFS}}
  3. Tier-1 Applications:
    1. Exchange @{DB, Transport}
    2. IIS @{}
    3. SQL @{}
    4. Sharepoint @{}
    5. Network Layer @{}
    6. Storage Layer @{}
    7. Virtualization layer @{}
    8. Security Layer @{}

Let me explain the categories a little bit.
1.       Cat1: Outage events.
By definition, you can’t plan or do analytics on outage events. The best you can hope in this situation is to bake redundancies in to your infrastructure – Dual power supply, Dual WAN, Disaster Recovery/BCP, Alternate Data Center, WAN Failover’s come to mind. The idea is, if your hardware or the supporting CAT1 infrastructure is under pressure, move it to a stable pool.
CAT1 events by their very nature are unpredictable and random and are usually caused by failures outside the measurable category.
(Ross Smith IV, gave an example during his Failure Domain presentation where a tree-cutting service chopped off Dual WAN link to a data center).
2.       Cat2: Core stack.
This is the space where system administrators spend most time. Also core stack + application events cover close to 90% of the logs generated by volume. Event data in this category may lend itself to pattern analysis, and I am going to discuss some of the options down the line.
3.       Cat3: Application specific – Desktop or Web-based or Apps
Application specific events from a non-Microsoft vendor, or from an internally developed application
4.       CAT4: User Stack
You can investigate the client log-data connecting to your core infrastructure and try to find patterns and causality.
E.g: Email stuck in Outbox affects the exchange subsystem, Changing Outlook views does too.
User watching streaming videos on Internet Explorer during Olympics, affects the VDI infrastructure.
5.        CAT5: Rare chronic events
Rare one-off events which may or may not be critical, and does not trigger an alert.

Categories explained:
I am going to discuss some of the existing research in CAT [2-4] and CAT5. My initial thought going into this was, “Surely I am not reinventing the wheel here. Someone else must have faced similar problems and they must have done some research.” Well they did.

CAT2/CAT3/CAT4 Research:
a)      Multi-variate analysis techniques. UT Austin and AT&T Labs Research published a paper on a diagnosis tool (GIZA) they developed to diagnose events in AT&T IPTV infrastructure. Giza was used to trace events across the network stack from the SHO, to the DSLAM to the set-top box. Giza used a hierarchical heavy hitter detection algorithm to identify spatial locations, and then applied a statistical event correlation to identify event-series that are strongly correlated. Then they applied statistical lag correlation techniques to discover causal dependencies. According to the authors, Giza scores better than WISE using the ATT test data. Giza also has an advantage of traversing across the stack and collects logs in different formats, across devices and use that to model causality.
b)      Coding Approach: Implemented in SMARTS Event Management Systems (SEMS) (Sold to EMC in 2005). Tags event data into (P) Problem and (S) Symptom and uses a Causality Graph Model. Paper
c)        End to End tracing: Using tree-augmented naïve-bayes to determine resource-usage metrics that are most correlated with anomalous period. [System Slow -> High WorkingSet data -> IE eating up memory due to a incorrect plugin, Native memory leaks ]d)      Microsoft Research Netmedic: NetMedic diagnoses problems by capturing dependencies between components, and analyzing the joint behavior of these components in the past to estimate the likelihood of them impacting one another in the present and rank them by likelihood of occurrence.[This used microsoft stack test data, perfmon data etc.]

CAT5 Research:
CMU and ATT Labs Research published an excellent paper on this topic. They call these events chronics – the recurring below the radar event-data. They analyzed the CDR (Call Detail Record) data across AT&T VOIP infrastructure to detect below the radar events which were not captured in any triggers. They use a Bayesian distributed learning algorithm and then they filter the resulting dataset using KL Divergence. This is a novel approach. Had they used just a Bayesian algorithm, or a Learning algorithm – the resulting data-set will have events with high-scores which will reinforce the results. Any future events will be scored based on historical data, which you don’t want when you are trying to find out oddball events. They recursively removed all CDR events with high-scores using KL divergence algorithm, till they have a handful of odd-events.  The full paper can be accessed here.

Other Challenges:
No commercial solution exists as of date which can find causal relationships across the stack. Splunk does an awesome job in collecting, parsing, tokenizing and presenting the data in a searchable form. But this type of analysis may not lend itself to a search-based approach. You can find if (x) occurred before (y) and try to do stats on that and establish some sort of a correlation, but there are some issues with that approach.

  1. Correlation:
    1. You may not have necessary volume of event data to establish a correlation. Remember, you cannot control how a specific event is generated, however you can do analytics on the one’s that are logged. For example 2 EventID – 100, cannot be used to calculate this.
    2. Correlation does not necessarily establish causality, but adding temporal data into the mix can help you in identifying the culprit.
    3. Different event logs present data in different formats. There is no one common universal logging standard, which is used by every vendor from OS to Hardware or Applications, Networks to Power.
  2. Heterogeneous data-set.
    1. Most algorithmic approach address a homogeneous data-set – only Call data, only network traces, only IP data. We are trying to walk up and down the stack dealing with different log-formats.
  3. Context.
    1. An IP Address in a Windows ADDS has a different context, compared to an IP address from a BGP Context, or a Switch Context. Event log searches cannot distinguish the context.

CAUSALITY:
So, that brings us to the next question: How do we establish causality from event data ? Well, you can use one of the algorithms and model a relationship, and then prove causality using instrumentation.
By instrumentation I mean, you write scripts which reproduce that error and you watch for those errors to show-up in your logging interface. You should have an ability to increase / decrease the event generation by dialing-up or dialing-down your parameters. The concept is similar to writing unit-tests to detect failure. If your test scripts can’t detect failure, then you have a pass.

Thanks to Powershell and its deep inspection capability’s using WMI and .Net, you maybe able to reproduce the problem by writing a script for that.

End {}
From a data center analytics point of view, we need analytics software which can model temporal dependencies in CAT 2-5 and provided a consistent alert mechanism for system administrators.

My original question was, “Can I see a storm brewing in my network infrastructure?” and use that to get some sense of predictability.
I had just finished watching Twister sometime ago, and hence the flying cow reference. In the movie Helen Hunt and Bill Paxton’s characters chase tornadoes with a Tornado modeling device called DOROTHY. So, if you are into modeling event data across stacks and you see a flying-cow nearby, Well – Good Luck 🙂

I hope someone finds this useful.

Get-EventLog via Protocol or via a Database ?

What are the pros and cons for a Protocol Backed / vs Database-backed event monitor ?

Domain: Powershell, IIS, Event Log Monitor / notification system.

Motivation:
1. Dont want to receive event notifications via email. – I am concerned with event-delivery only, not event capture.
2. There are many ways of capturing events on poshcode / MSFT Script Repo.

Definition:
Protocol based event monitor – Use ODATA/ATOM or anything else to poll event logs from a System X, and display it anywhere else.

Database based event monitor – Uses this flow > Event (ETW) -> DB -> UI (Event-to-UI in milliseconds MAX 1 second)

Protocol:
PROS

1) You can only query what you want.
CONS
1) Slow / Sluggish?
2) You need to convert events to a Feed. Then write a WCF service (or Publish an Application in IIS), to get started. [maybe there is a better way, but I have tested only the IIS way till now]
3) Susceptible to fallacies of distributed computing

Database:
PROS
1) If you choose your tools well, you can achieve near millisecond round-trip from ETW to DB to UI. IIS doesn’t figure in this

CONS
1) You are forcing stuff into columns and splitting it up thereby losing objects. But, you are capturing the whole event-message (whatever is in the XML), so does it matter if you lose objects

Anything else ?

Thanks Zero Water.

Usually when you read about product or service reviews on a blog, it’s usually about how XYZ sucks. Well, I had a pleasant experience with a company and wanted to blog about it.

Product Purchased: Zero Water 23 cup Dispenser. for $35 on Amazon.

We bought the Zero Water 23 cup water dispenser last year. We loved it from day one. It came with a tester which checks for solid deposits in water and gives you a score.  We tested this with bottled water, Poland Spring, Tap Water and found the following scores.

  • Bottled Water – 20 -50
  • Poland Spring – 50
  • Other bottled water – 90 – 200 (Some bottled water results are worse than tap water?!)
  • Tap Water 150-272
  • Zero Water 0-3

If the tester indicates a score that is greater than 6, it’s time to replace the filters. The basic unit ships with 1 filter, and you can buy filters in bulk and save some money. The water from zero water dispense is tasty. (Yes, I said it.)

However, in May, the tap on the dispenser broke when I was holding it to pour out some water. The spring came out and there was no way for me to put it back. I was going to buy another unit but I thought to myself that I should at least give them a call.

So I called the 800 number and spoke to a rep and described the issue. She was apologetic and said that they will send a replacement unit. I was astounded that they were so accommodating. I asked them if I was expected to pay for it since I wasn’t sure about the warranty, but they told me that I wouldn’t be charged for it. I gave them my street address and the new unit arrived in a week.

Two weeks ago, the tap broke again, and this time without any user intervention. So I called them again and explained the problem. The rep confessed that the fault lay with them. There was an engineering problem with the tap design.

I thought – Did they just accept responsibility for a faulty tap design? I have heard apologies, but this one goes a step further. They had my street address on file and I didn’t have to give it a second time. She said they would expedite the replacement unit if possible. So till then, they had sent me 2 free replacements for a product which broke for apparently no fault of mine.

What happened next should be a lesson in customer service.

I was driving back from work early this week and got a call from a funny 800-number. This person was calling back from Zero Water and letting me know that the specific product which I asked for is back-ordered.

How many times have you received a call back from a company, for something you bought on retail ? Probably ZERO. These guys were calling me back to let me know that the product was back ordered; I was really impressed.

Then the rep said,  “I am sorry for your troubles. Can I ship you a replacement unit for free ? We will expedite this, so that you have something at home to drink clean water, while you wait for your product to arrive.”

I was blown away by their customer service. I cannot recall any experience I have had with any other company that matches this.

As an enterprise customer, maybe I have – but in retail , these guys are the kings as far as customer service is concerned. BTW, the clean-awesome-tasting-water bit helps too. 🙂

The replacement pitcher arrived today. Attached price list $0.0. Not a refurb pitcher, but a brand new sealed box. The replacement unit was an 8-cup pitcher.

So Zero Water, you have won yourself a lifetime customer.

Every element in the company – operations + customer service + products came together to result in an awesome user experience.

Disclaimer: I don’t work for Zero Water, nor am I affiliated to it in any way.

An advice for someone graduating from high-school: Focus on the Oyster.

I had a conversation today with a friend who is graduating from high-school. Amongst other things, he had some questions about choosing computing as a career option.

Some background is necessary. He is really good at high-school level Maths and Science and one of the options he is considering is a career in Engineering. He is not sure what branch of Engineering – Computing / EE / Mechanical / Chemical / Civil. He wanted to know what I think about it. He also had a few questions regarding Computing specifically:
a)    Every year there are new languages, packages and applications released. Do I need to really study and “keep-up” (his emphasis, not mine)
b)    Most people I know in computing work long hours. Do I need to work like 12-14 hrs a day and sacrifice my personal life.
c)    Lots of people I know graduate in history, philosophy etc and then move over to computing. Should I try that ?

First things first.
I am really honored that he sought out my advice on such an important decision. Needless to say, decisions like this are not to be taken lightly. I am going to attempt to answer the specific questions raised here and then give my reasons when I decided to go into computing.

Specific answers:
a)    Every year there are new languages, packages and applications released. Do I need to really study and “keep-up” (his emphasis, not mine)
ANSWER: That depends on how you look at it. Some, will try to use this as an example of constant innovation (which is good?) in computing. But I will try to reason differently.
My logic is more utilitarian than anything to do with innovation. Twenty years ago, if you are trying to create any distributed application in C/C++, you will have to write huge amount of scaffolding, not to mention you have to keep track of garbage clean-up, database connections, persistence etc. Modern languages do the “plumbing” and “scaffolding” for you so that you can focus on your business logic.
Also people realized that you cannot have one programming language do everything. Some languages are better suited to certain tasks than others. It can be a combination of abilities in the language itself or availability of libraries thereof. We don’t have one-way of doing things. We have perspectives, and we use appropriate tools (languages, databases and frameworks) to actualize those perspectives.

End of the day, you don’t learn everyday not because you have to or you will fall behind. You learn everyday because everyday brings in joy and wonder by collecting these pebbles on the sea-shore of knowledge.

b)    Most people I know in computing work long hours. Do I need to work like 12-14 hours a day and sacrifice my personal life.
ANSWER: I don’t think people work long hours in computing anymore. You can finish up your job quickly as long as you plan and schedule things. You have to be realistic about your assumptions and factor in external factors which affect you. The very idea that you can go straight 30 hrs and come-up with some whiz-bang thing at the end of it, is not scalable. There is a good chance that you might end-up hurting the project than helping it.

c)    Lots of people I know graduate in history, philosophy etc and then move over to computing. Should I try that ?
ANSWER: I am not sure how to answer this. I do know of people who graduated in something other than computing and moved in to computing for various reasons. As long as you do it for the love of the game, it’s good. If you are switching to computing because there are more jobs and money in this field than something else – well, as long as you can keep your sanity and don’t feel overwhelmed, I guess that’s good too. I have a hard time judging the reasoning for my own actions, let alone evaluating others.

My reasons for getting into computing:
I was always uncomfortable with the decision-making process of “some” people. I believe too much was left to individual discretion. There were instances when I believe I finished the task to completion, but some saw it differently. I had an issue with the way they applied discretion, and I wanted a single objective mechanism to do a certain task and be evaluated objectively with a Yes (Good) or No (Not Good) answer. I am intentionally trying to keep it simplistic to avoid going on a tangent. I do understand that most decisions cannot be evaluated in a Yes/no situation. I was a cocky weird kid and had a mind of my own about almost everything. If I followed the heard, I had my own qualms about my following without not knowing and consequences of my own actions.

And then sometime during college, I found this post on Phrack written by ++The Mentor++
This was an eye-opener of sorts. Especially this part:

I made a discovery today.  I found a computer.  Wait a second, this is cool.  It does what I want it to.  If it makes a mistake, it’s because I screwed it up.  Not because it doesn’t like me…
Or feels threatened by me…
Or thinks I’m a smart ass…
Or doesn’t like teaching and shouldn’t be here…
Damn kid.  All he does is play games.  They’re all alike.

And then it happened… a door opened to a world… rushing through the phone line like heroin through an addict’s veins, an electronic pulse is sent out, a refuge from the day-to-day incompetencies is sought… a board is found.
“This is it… this is where I belong…”
I know everyone here… even if I’ve never met them, never talked to them, may never hear from them again… I know you all…

Damn kid.  Tying up the phone line again.  They’re all alike…

I had dabbled in programming since I was in school. Did bunch of weird stuff. I didn’t know what I was getting into, but that paragraph resonated with me so much more than anything I had seen, heard or read till that point in my life.

I was out of control after that.
I read up ESR’s www.Catb.org/~esr, howto’s, phrack and Alt2600’s and burnt all my pocket money doing this. I remember wget-ting Jakob Nielsen’s whole site on usability and going through it over night. (I am still not sure why I did that, but I loved his line of reasoning.) I really didn’t know if I learned anything tangible, like a language, or a program or a database, but the philosophy has stayed with me ever since. (Some people may call it tangible 🙂 )

I tried getting into a computer science course, and when I couldn’t, I took up Physics thinking that’s the easiest thing to graduate, and I thought yeah well, I can still do my computer stuff in spare time. (There were other reasons to take up physics, and one Mr. Richard Feynman’s biography by James Gleick played a far more important role in that decision.)

I studied all the things that I can lay my hands on.
Funny story. I had a BSD SystemV Manual but I didn’t have a Unix terminal or a computer to practice this on. During those days I would usually print out the man pages on a dot-matrix printer from the diploma school I was attending, and write my code by hand on the other side of the pages. That practice enforced a certain discipline. I had a 2 hr time to type, compile and test my code on a SCO Unix machine and I didn’t want to waste it by typing gibberish and wondering what’s going on. I miss that. I think I was better at programming during those days without IDE’s, Intellisense, debuggers and object reference guides. It was just – write by hand, inspect the logic, check the loops and then type and compile.
School level stuff, but I loved it.

My key argument in favor of computing:
The things what you can do with computers is limitless. Apart from the oft repeated that everything runs on computers, I have an example of my own.

We eat out often, and we pack a lot of stuff “to go” and keep it in the fridge. Two weeks later while cleaning the fridge I find this really sorry looking pie which I was supposed to eat – well 2 weeks ago. I would like to have a program or a device which alerts me when things are going to go bad in the fridge so that I can consume it in time and not throw it.

My friend had a use-case too. This program or device can be applied to keep track of all the expired pills in the medicine cabinet. What a great idea !!

So we have two use-cases, but no program or device that can accomplish that. This is my prime example of why we need innovation in computing. I am not sure if this feasible, or how I will do this. Should I take a picture of the fridge and do an object matching, based on shapes – that would mean keeping a large database of shapes of consumer objects. There might be some pattern recognition library and you can run it on Amazon products database to get your “shape” database.
OR, maybe you can have a tablet type device which stays in the fridge take bar-code scans (wont work for cooked food, do we have a tablet which can function in a fridge ? What about power-supply..)
OR, maybe you can look at the Arduino market place and look for similar offerings with a combination of searching for alternatives in your favorite search-engine.
Maybe someone thought of this in the refrigeration industry and beta-testing somewhere.

The possibilities are limitless and there is a huge scope for creating something which has not been there before. How cool is that ?

My most important reason:
The joy that you get by watching someone use something you created; and if they love it and give you feedback, what else do you want in life ?
I cannot design a bridge or a building and get instant feedback on how it’s doing. I cannot design or do stuff to chemicals without a large lab. Same goes for other engineering, tools are required, and I may not have them at my disposal when I want them.
But yeah, I can whip up a code and run it and get the results – all within 15 minutes, If I am good. (Emphasis on If I am good.) If I am not, the Truth Machine (or the debugger) tells me what’s wrong with it.

Why choosing a career that you love is important:
It takes a tremendous amount of energy to get out of bed, get ready, do the commute, and get to your desk. If you don’t love what you do, your mind will soon start rejecting your job and you will see the result in your own actions. You really need to find stuff that makes you tick, that makes you happy and for the lack of a better word – stuff that’s cool, and awesome, which makes work feel like play.

It’s not just a blackberry ad – Love what you do.

The other day I wrote a Powershell script and was watching calls being made from Powershell which were getting logged in the debugger.

Man, it was awesome !!
It was a crappy script, but it was my script.
Every time someone passed a wrong argument, I felt like telling them —DUDE !!! Let me help you.

I didn’t do programming after I graduated from college. I did a bunch of other stuff, and then went into System Administration.

I am starting fresh with Powershell and I feel all that excitement coming back.

The world is my Oyster.

PS: Thanks for reading :-). It’s rather long.

How to get the Don Jones Powershell v3 book real cheap

If you have been toying around with the idea of buying the Ebook+MEAP+Printed Book combo for $49.99, let me bring some additional information to the table so that you can make up your mind.

From Manning Store here, price for the book is

MEAP + Ebook only – $39.99
MEAP + Print book (includes Ebook) when available – $49.99

I received an email alert from Manning today which goes like this.
37% off orders under $50–use code a2637
42% off orders over $50–use code a2642
50% off orders over $100–use code a2650
Offer applies to your entire purchase–eBook, pBook, or MEAP. Expires May 2. Only at manning.com

I got my $50 combo for $31.49. That’s cheaper than MEAP+Ebook.

Do you really want to sit this one out ?

An ode to Richard Thaler (or, How to Make a book stand for 300 pg books)

After much grief to my back, I decided to finally buy a Book-Stand.
Did some research on Amazon and selected this by Best Book Stand & Holder.

But its 2:30 AM in the night, and I need something now so that I can finish this chapter.

Imagination to the rescue.

Scanned the room and saw this Amazon box. Pulled out a scissor and made this.

This box book-holder fits small books easily. Tested it with Richard Thaler’s Winner’s Curse.

I think the test-case was appropriate for the occasion. After doing a 15 min research on Amazon, you find the perfect book stand for $30 (+$10 shipping) which weighs 4 lbs.

5 mins later, you grow so impatient that you decide to make your own.