EvilMonkey-Part 1: Trim Working Set of a running process.

I have never had so much fun with 5 lines of Powershell code.

Remember those pesky situations when you have processes running-off and chewing-up the working-set values. Powershell to the rescue. This small 5-line powershell script will trim your working-set data. I have used (-1,-1) values to trim to the max possible.

What happens when you trim working-set data ? From MSDN:

The working set of a process is the set of memory pages in the virtual address space of the process that are currently resident in physical memory. These pages are available for an application to use without triggering a page fault. The minimum and maximum working set sizes affect the virtual memory paging behavior of a process.
The working set of the specified process can be emptied by specifying the value (SIZE_T)–1 for both the minimum and maximum working set sizes. This removes as many pages as possible from the working set.
The operating system allocates working set sizes on a first-come, first-served basis. For example, if an application successfully sets 40 megabytes as its minimum working set size on a 64-megabyte system, and a second application requests a 40-megabyte working set size, the operating system denies the second application’s request.

I have primarily used this script to trim WorkingSet values of a browser. I have tested this using Firefox / iexplore  and chrome. The script works with no tab crashing. You can keep a Process Explorer / Task Manager / Process Hacker window open and observe the trimmed PID values.

Trimming WorkingSet doesn’t mean that the WorkingSet values wont go back to their previous numbers. I have seen some IE windows go back to similar numbers. However, this script is really useful for fixing a stuck-browser or a frozen screen situation.

Linus said, Talk is cheap. Show me the Code. So here it is. All 5 lines of goodness.
http://poshcode.org/3653

Warning Notice:
***********READ CAREFULLY***********
!!!! Do not use in production environment before thoroughly testing and understanding the script. !!!!
Do not use this to Trim Working Set for SQL Databases, there will be data-loss.
Do not use this to Trim Working Set database for a Microsoft Exchange database process (store.exe)
I haven’t tried this on IIS worker process (w3wp.exe).
More NEGATIVE effects of trimming working-set are listed here in detail.

Advertisements

Flying Cows: On Event Log analysis and Causal Relationships

Flying CowBackground:
For the last few months I have been doing some research on event log analysis. The whole thing started with this simple idea: Let’s say we have an EventID, for e.g. 15001 from MsExchangeTransport, Can we drill down and trace what caused it? You have the present event-data at X point in time, so you should be able to search-back in a tree of events.

I was expecting answers along the lines of event chains –EventID: 15001+MsExchangeTransport -> EventID:1111+SourceA [TimeStamp]-> EventID:2222+SourceB[TimeStamp]
Well the situation is a bit more complicated than that.

  1. Event ID’s are tied to the Event Source. Different sources can have same EventID’sin the Microsoft EventID namespace. Application developers can write to event logs with their own custom Event ID’s.
  2. While going through Event logs is like finding a needle in a haystack. There are filters for source and eventID’s, but they don’t help much. Also, event logs are localized. You would need some log analysis software which collects logs across your Server Stack, and network layers to model some type of relationship.
    Just to give a sense of how many providers can write to your event-log, try this >
    Type this on your dos prompt – wevtutil el
    Now write a script to do this across your server farm, and do a Compare-Object to get a list of all providers who can write to Windows Event logs.
  3. All indicators for a system under pressure may not be collected entirely in event logs even with systems with cranked-up diagnostic logging. You may need to look at Perfmon data, windows log files, syslog data and events from your networking stack (Cisco Netflow / OpenFlow). If you have a virtualized environment, you add more layers of events- Storage stack data, Virtualized hypervisor data etc.
  4. EventID’s by themselves do not have the intelligence to understand their own severity. You may need an external reference data-set or a codebook, which provides guidance on whether the particular event is severe, or can be ignored. Think http://support.microsoft.com/KB/ID
  5. EventID’s have temporal data, meaning, every EventID has a time-stamp attached to it. This can be used for plotting events on a time-series, like Splunk search app.

CLASSIFICATION:
So how would you know if your infrastructure is in trouble?
Typically, you would receive an alert for an event you are watching [for e.g. – EventID, “disk space”, “network layer alerts”, “storage alerts”], and it’s entirely upto the system-admin’s technical ability to figure out the root-cause. Finding a root-cause may not help in certain cases when mutually exclusive events are generated at different layers of the stack to create a storm. Also, you may have concurrent root causes.

I want to classify the different categories of events we are trying to analyze:

  1. CAT1: Outage events.
  2. CAT2: General errors and warnings related to a server-role, or an application in the core-stack.
  3. CAT3: Other application events
  4. CAT4:Events triggered by end-user action.
  5. CAT5: Rare one-off events which are critical, but may not trigger an alert.

[Core-Stack]

  1. Core OS
  2. Core Infra @{AD, DNS, DHCP, IPAM, Server-Role @{TS,Hyper-V,ADRMS,ADFS}}
  3. Tier-1 Applications:
    1. Exchange @{DB, Transport}
    2. IIS @{}
    3. SQL @{}
    4. Sharepoint @{}
    5. Network Layer @{}
    6. Storage Layer @{}
    7. Virtualization layer @{}
    8. Security Layer @{}

Let me explain the categories a little bit.
1.       Cat1: Outage events.
By definition, you can’t plan or do analytics on outage events. The best you can hope in this situation is to bake redundancies in to your infrastructure – Dual power supply, Dual WAN, Disaster Recovery/BCP, Alternate Data Center, WAN Failover’s come to mind. The idea is, if your hardware or the supporting CAT1 infrastructure is under pressure, move it to a stable pool.
CAT1 events by their very nature are unpredictable and random and are usually caused by failures outside the measurable category.
(Ross Smith IV, gave an example during his Failure Domain presentation where a tree-cutting service chopped off Dual WAN link to a data center).
2.       Cat2: Core stack.
This is the space where system administrators spend most time. Also core stack + application events cover close to 90% of the logs generated by volume. Event data in this category may lend itself to pattern analysis, and I am going to discuss some of the options down the line.
3.       Cat3: Application specific – Desktop or Web-based or Apps
Application specific events from a non-Microsoft vendor, or from an internally developed application
4.       CAT4: User Stack
You can investigate the client log-data connecting to your core infrastructure and try to find patterns and causality.
E.g: Email stuck in Outbox affects the exchange subsystem, Changing Outlook views does too.
User watching streaming videos on Internet Explorer during Olympics, affects the VDI infrastructure.
5.        CAT5: Rare chronic events
Rare one-off events which may or may not be critical, and does not trigger an alert.

Categories explained:
I am going to discuss some of the existing research in CAT [2-4] and CAT5. My initial thought going into this was, “Surely I am not reinventing the wheel here. Someone else must have faced similar problems and they must have done some research.” Well they did.

CAT2/CAT3/CAT4 Research:
a)      Multi-variate analysis techniques. UT Austin and AT&T Labs Research published a paper on a diagnosis tool (GIZA) they developed to diagnose events in AT&T IPTV infrastructure. Giza was used to trace events across the network stack from the SHO, to the DSLAM to the set-top box. Giza used a hierarchical heavy hitter detection algorithm to identify spatial locations, and then applied a statistical event correlation to identify event-series that are strongly correlated. Then they applied statistical lag correlation techniques to discover causal dependencies. According to the authors, Giza scores better than WISE using the ATT test data. Giza also has an advantage of traversing across the stack and collects logs in different formats, across devices and use that to model causality.
b)      Coding Approach: Implemented in SMARTS Event Management Systems (SEMS) (Sold to EMC in 2005). Tags event data into (P) Problem and (S) Symptom and uses a Causality Graph Model. Paper
c)        End to End tracing: Using tree-augmented naïve-bayes to determine resource-usage metrics that are most correlated with anomalous period. [System Slow -> High WorkingSet data -> IE eating up memory due to a incorrect plugin, Native memory leaks ]d)      Microsoft Research Netmedic: NetMedic diagnoses problems by capturing dependencies between components, and analyzing the joint behavior of these components in the past to estimate the likelihood of them impacting one another in the present and rank them by likelihood of occurrence.[This used microsoft stack test data, perfmon data etc.]

CAT5 Research:
CMU and ATT Labs Research published an excellent paper on this topic. They call these events chronics – the recurring below the radar event-data. They analyzed the CDR (Call Detail Record) data across AT&T VOIP infrastructure to detect below the radar events which were not captured in any triggers. They use a Bayesian distributed learning algorithm and then they filter the resulting dataset using KL Divergence. This is a novel approach. Had they used just a Bayesian algorithm, or a Learning algorithm – the resulting data-set will have events with high-scores which will reinforce the results. Any future events will be scored based on historical data, which you don’t want when you are trying to find out oddball events. They recursively removed all CDR events with high-scores using KL divergence algorithm, till they have a handful of odd-events.  The full paper can be accessed here.

Other Challenges:
No commercial solution exists as of date which can find causal relationships across the stack. Splunk does an awesome job in collecting, parsing, tokenizing and presenting the data in a searchable form. But this type of analysis may not lend itself to a search-based approach. You can find if (x) occurred before (y) and try to do stats on that and establish some sort of a correlation, but there are some issues with that approach.

  1. Correlation:
    1. You may not have necessary volume of event data to establish a correlation. Remember, you cannot control how a specific event is generated, however you can do analytics on the one’s that are logged. For example 2 EventID – 100, cannot be used to calculate this.
    2. Correlation does not necessarily establish causality, but adding temporal data into the mix can help you in identifying the culprit.
    3. Different event logs present data in different formats. There is no one common universal logging standard, which is used by every vendor from OS to Hardware or Applications, Networks to Power.
  2. Heterogeneous data-set.
    1. Most algorithmic approach address a homogeneous data-set – only Call data, only network traces, only IP data. We are trying to walk up and down the stack dealing with different log-formats.
  3. Context.
    1. An IP Address in a Windows ADDS has a different context, compared to an IP address from a BGP Context, or a Switch Context. Event log searches cannot distinguish the context.

CAUSALITY:
So, that brings us to the next question: How do we establish causality from event data ? Well, you can use one of the algorithms and model a relationship, and then prove causality using instrumentation.
By instrumentation I mean, you write scripts which reproduce that error and you watch for those errors to show-up in your logging interface. You should have an ability to increase / decrease the event generation by dialing-up or dialing-down your parameters. The concept is similar to writing unit-tests to detect failure. If your test scripts can’t detect failure, then you have a pass.

Thanks to Powershell and its deep inspection capability’s using WMI and .Net, you maybe able to reproduce the problem by writing a script for that.

End {}
From a data center analytics point of view, we need analytics software which can model temporal dependencies in CAT 2-5 and provided a consistent alert mechanism for system administrators.

My original question was, “Can I see a storm brewing in my network infrastructure?” and use that to get some sense of predictability.
I had just finished watching Twister sometime ago, and hence the flying cow reference. In the movie Helen Hunt and Bill Paxton’s characters chase tornadoes with a Tornado modeling device called DOROTHY. So, if you are into modeling event data across stacks and you see a flying-cow nearby, Well – Good Luck 🙂

I hope someone finds this useful.