Flying Cows: On Event Log analysis and Causal Relationships

Flying CowBackground:
For the last few months I have been doing some research on event log analysis. The whole thing started with this simple idea: Let’s say we have an EventID, for e.g. 15001 from MsExchangeTransport, Can we drill down and trace what caused it? You have the present event-data at X point in time, so you should be able to search-back in a tree of events.

I was expecting answers along the lines of event chains –EventID: 15001+MsExchangeTransport -> EventID:1111+SourceA [TimeStamp]-> EventID:2222+SourceB[TimeStamp]
Well the situation is a bit more complicated than that.

  1. Event ID’s are tied to the Event Source. Different sources can have same EventID’sin the Microsoft EventID namespace. Application developers can write to event logs with their own custom Event ID’s.
  2. While going through Event logs is like finding a needle in a haystack. There are filters for source and eventID’s, but they don’t help much. Also, event logs are localized. You would need some log analysis software which collects logs across your Server Stack, and network layers to model some type of relationship.
    Just to give a sense of how many providers can write to your event-log, try this >
    Type this on your dos prompt – wevtutil el
    Now write a script to do this across your server farm, and do a Compare-Object to get a list of all providers who can write to Windows Event logs.
  3. All indicators for a system under pressure may not be collected entirely in event logs even with systems with cranked-up diagnostic logging. You may need to look at Perfmon data, windows log files, syslog data and events from your networking stack (Cisco Netflow / OpenFlow). If you have a virtualized environment, you add more layers of events- Storage stack data, Virtualized hypervisor data etc.
  4. EventID’s by themselves do not have the intelligence to understand their own severity. You may need an external reference data-set or a codebook, which provides guidance on whether the particular event is severe, or can be ignored. Think http://support.microsoft.com/KB/ID
  5. EventID’s have temporal data, meaning, every EventID has a time-stamp attached to it. This can be used for plotting events on a time-series, like Splunk search app.

CLASSIFICATION:
So how would you know if your infrastructure is in trouble?
Typically, you would receive an alert for an event you are watching [for e.g. – EventID, “disk space”, “network layer alerts”, “storage alerts”], and it’s entirely upto the system-admin’s technical ability to figure out the root-cause. Finding a root-cause may not help in certain cases when mutually exclusive events are generated at different layers of the stack to create a storm. Also, you may have concurrent root causes.

I want to classify the different categories of events we are trying to analyze:

  1. CAT1: Outage events.
  2. CAT2: General errors and warnings related to a server-role, or an application in the core-stack.
  3. CAT3: Other application events
  4. CAT4:Events triggered by end-user action.
  5. CAT5: Rare one-off events which are critical, but may not trigger an alert.

[Core-Stack]

  1. Core OS
  2. Core Infra @{AD, DNS, DHCP, IPAM, Server-Role @{TS,Hyper-V,ADRMS,ADFS}}
  3. Tier-1 Applications:
    1. Exchange @{DB, Transport}
    2. IIS @{}
    3. SQL @{}
    4. Sharepoint @{}
    5. Network Layer @{}
    6. Storage Layer @{}
    7. Virtualization layer @{}
    8. Security Layer @{}

Let me explain the categories a little bit.
1.       Cat1: Outage events.
By definition, you can’t plan or do analytics on outage events. The best you can hope in this situation is to bake redundancies in to your infrastructure – Dual power supply, Dual WAN, Disaster Recovery/BCP, Alternate Data Center, WAN Failover’s come to mind. The idea is, if your hardware or the supporting CAT1 infrastructure is under pressure, move it to a stable pool.
CAT1 events by their very nature are unpredictable and random and are usually caused by failures outside the measurable category.
(Ross Smith IV, gave an example during his Failure Domain presentation where a tree-cutting service chopped off Dual WAN link to a data center).
2.       Cat2: Core stack.
This is the space where system administrators spend most time. Also core stack + application events cover close to 90% of the logs generated by volume. Event data in this category may lend itself to pattern analysis, and I am going to discuss some of the options down the line.
3.       Cat3: Application specific – Desktop or Web-based or Apps
Application specific events from a non-Microsoft vendor, or from an internally developed application
4.       CAT4: User Stack
You can investigate the client log-data connecting to your core infrastructure and try to find patterns and causality.
E.g: Email stuck in Outbox affects the exchange subsystem, Changing Outlook views does too.
User watching streaming videos on Internet Explorer during Olympics, affects the VDI infrastructure.
5.        CAT5: Rare chronic events
Rare one-off events which may or may not be critical, and does not trigger an alert.

Categories explained:
I am going to discuss some of the existing research in CAT [2-4] and CAT5. My initial thought going into this was, “Surely I am not reinventing the wheel here. Someone else must have faced similar problems and they must have done some research.” Well they did.

CAT2/CAT3/CAT4 Research:
a)      Multi-variate analysis techniques. UT Austin and AT&T Labs Research published a paper on a diagnosis tool (GIZA) they developed to diagnose events in AT&T IPTV infrastructure. Giza was used to trace events across the network stack from the SHO, to the DSLAM to the set-top box. Giza used a hierarchical heavy hitter detection algorithm to identify spatial locations, and then applied a statistical event correlation to identify event-series that are strongly correlated. Then they applied statistical lag correlation techniques to discover causal dependencies. According to the authors, Giza scores better than WISE using the ATT test data. Giza also has an advantage of traversing across the stack and collects logs in different formats, across devices and use that to model causality.
b)      Coding Approach: Implemented in SMARTS Event Management Systems (SEMS) (Sold to EMC in 2005). Tags event data into (P) Problem and (S) Symptom and uses a Causality Graph Model. Paper
c)        End to End tracing: Using tree-augmented naïve-bayes to determine resource-usage metrics that are most correlated with anomalous period. [System Slow -> High WorkingSet data -> IE eating up memory due to a incorrect plugin, Native memory leaks ]d)      Microsoft Research Netmedic: NetMedic diagnoses problems by capturing dependencies between components, and analyzing the joint behavior of these components in the past to estimate the likelihood of them impacting one another in the present and rank them by likelihood of occurrence.[This used microsoft stack test data, perfmon data etc.]

CAT5 Research:
CMU and ATT Labs Research published an excellent paper on this topic. They call these events chronics – the recurring below the radar event-data. They analyzed the CDR (Call Detail Record) data across AT&T VOIP infrastructure to detect below the radar events which were not captured in any triggers. They use a Bayesian distributed learning algorithm and then they filter the resulting dataset using KL Divergence. This is a novel approach. Had they used just a Bayesian algorithm, or a Learning algorithm – the resulting data-set will have events with high-scores which will reinforce the results. Any future events will be scored based on historical data, which you don’t want when you are trying to find out oddball events. They recursively removed all CDR events with high-scores using KL divergence algorithm, till they have a handful of odd-events.  The full paper can be accessed here.

Other Challenges:
No commercial solution exists as of date which can find causal relationships across the stack. Splunk does an awesome job in collecting, parsing, tokenizing and presenting the data in a searchable form. But this type of analysis may not lend itself to a search-based approach. You can find if (x) occurred before (y) and try to do stats on that and establish some sort of a correlation, but there are some issues with that approach.

  1. Correlation:
    1. You may not have necessary volume of event data to establish a correlation. Remember, you cannot control how a specific event is generated, however you can do analytics on the one’s that are logged. For example 2 EventID – 100, cannot be used to calculate this.
    2. Correlation does not necessarily establish causality, but adding temporal data into the mix can help you in identifying the culprit.
    3. Different event logs present data in different formats. There is no one common universal logging standard, which is used by every vendor from OS to Hardware or Applications, Networks to Power.
  2. Heterogeneous data-set.
    1. Most algorithmic approach address a homogeneous data-set – only Call data, only network traces, only IP data. We are trying to walk up and down the stack dealing with different log-formats.
  3. Context.
    1. An IP Address in a Windows ADDS has a different context, compared to an IP address from a BGP Context, or a Switch Context. Event log searches cannot distinguish the context.

CAUSALITY:
So, that brings us to the next question: How do we establish causality from event data ? Well, you can use one of the algorithms and model a relationship, and then prove causality using instrumentation.
By instrumentation I mean, you write scripts which reproduce that error and you watch for those errors to show-up in your logging interface. You should have an ability to increase / decrease the event generation by dialing-up or dialing-down your parameters. The concept is similar to writing unit-tests to detect failure. If your test scripts can’t detect failure, then you have a pass.

Thanks to Powershell and its deep inspection capability’s using WMI and .Net, you maybe able to reproduce the problem by writing a script for that.

End {}
From a data center analytics point of view, we need analytics software which can model temporal dependencies in CAT 2-5 and provided a consistent alert mechanism for system administrators.

My original question was, “Can I see a storm brewing in my network infrastructure?” and use that to get some sense of predictability.
I had just finished watching Twister sometime ago, and hence the flying cow reference. In the movie Helen Hunt and Bill Paxton’s characters chase tornadoes with a Tornado modeling device called DOROTHY. So, if you are into modeling event data across stacks and you see a flying-cow nearby, Well – Good Luck 🙂

I hope someone finds this useful.

Advertisements

Bryan Cantrill at QCon

Every Windows System Admin should watch this video. Debugging Production Systems

The video is all about failures, transient failures and post-mortem debugging. It’s amazing how he covers the history, NTSB, Disaster Porn and makes it all come together in a debugger for Node.Js. It gives you a perspective on how you can approach debugging production systems.

The good news is, Windows already has good debugging tools and Symbol files where you can trace the issue down to a module / function level.

I also found this blog on replay debugging using vmWare Workstation.

Golden Rule : There are no one-off errors in production.

If it happened earlier, there’s a good chance that it can happen. It might be a good idea to get a full dump/VM Snapshots and start Windbg.

PS: You should check out Bryan Cantrill’s answer in Google Groups here.