An engineer, a priest and a lawyer are scheduled for execution …

Three people are scheduled for execution: a priest, an attorney, and an engineer.

First, the priest steps up to the gallows. The executioner pulls the lever to drop the hatch, but nothing happens. The priest claims divine intervention and demands his release, so he is set free.

Next, the attorney takes a stand at the gallows. The executioner pulls the lever, but again nothing happens. The attorney claims another attempt would be double jeopardy and demands release, so he is set free.

Finally, the engineer steps up to the gallows, and begins a careful examination of the scaffold. Before the executioner can pull the lever, he looks up and declares, “Aha, here’s your problem.”

Via: Coding Horror

Powershell Polya and Chess Grandmasters

I wanted to talk about 2 things today.

  • What we can learn from Polya’s How to Solve it and apply that to solve Powershell problems. This is not guidance, but a list of things that has worked for me in the past and some reflections thereof.
  • What we can learn from Chess Grandmasters and be a better Powershell developer.

POINT-1:: Prof. Polya and How to Solve it.
Prof. George Polya in his seminal book How to solve it, wrote about these techniques for an aspiring mathematician. The book was also meant to be a guide for teachers of mathematics. I think some of these patterns can be used by Powershell developers or just developers in general.

– I am reproducing the text here from Wikipedia article and this Utah.Edu page.
Polya’s text is in Italics.
– My annotations are in PS+.
All emphasis are mine and underlined.
– I am just mentioning the relevant parts and skipping some of the other points mentioned by Polya.

Four principles (for becoming a Jedi Powershell master)
Understand: First, you have to understand the problem.
Plan: After understanding, then make a plan.
Execute: Carry out the plan.
Reflect and Learn: Look back on your work. How could it be better?

Understand:
– What are you asked to find or show?
[PS+: Think output. Think PsObject. Think pipeline preservation. Think not killing a puppy.
Write-out the .EXAMPLE section of your Get-Help.
For.eg. Get-WorkingSet -computers, get-qadcomputer | Get-WorkingSet]

– Can you restate the problem in your own words?

– Can you think of a picture or a diagram that might help you understand the problem?
[PS+: David Agans in his debugging book also talks about writing a block-diagram or a flow-chart to flesh out the problem. Very helpful for tricky PS problems.
I wouldn’t recommend a diagram if you are going to do a one-liner.]

– Is there enough information to enable you to find a solution?
[PS+: Do you need other module? Do you know what blog to read, or whose blog post to read? Do you know what to search for on MSDN, StackOverflow?
Install Powertab
get-help about_Regular_Expressions
get-command *service*
get-process | get-member -membertype Method]

– Do you understand all the words used in stating the problem?
[PS+: Talk to your friend]

– Do you need to ask a question to get the answer?
[PS+: Twitter, IRC, StackOverflow, and MS Forums. You should get to that after making an honest attempt at solving the problem.)

Strategy/Plan: (Hardest Step.)
**The skill at choosing an appropriate strategy is best learned by solving many problems. You will find choosing a strategy increasingly easy.**
Find the connection between the data and the unknown.
You may be obliged to consider auxiliary problems if an immediate connection cannot be found.
You should obtain eventually a plan of the solution.

– Have you seen it before?
[PS+: Codeplex, Poshcode, Github, StackOverflow, Script Center?]

– Or have you seen the same problem in a slightly different form?
[PS+: In a StackOverflow, blog or on #Powershell tag]

– Do you know a related problem?

– Look at the unknown! And try to think of a familiar problem having the same or a similar unknown.
[PS+: I have been trying out some P/Invoke stuff and I found this point particularly useful. P/Invoke is about converting Native Win32API functions and making them usable for managed code (C#) or Powershell. All P/Invoke stuff is written in C. My C is less than n00b level but after reading the blocks and blocks of solved C-code, I could sense how I will attack the problem. ]

– Here is a problem related to yours and solved before.
Could you use it? Could you use its result? Could you use its method?
[PS+: This again has been a really helpful tip. I usually end-up reading a lot of code on Poshcode and end-up modifying, extending some of the code. At times, I have ended-up using the functions but using it for a different purpose than what was originally intended.
You may need a Powershell buddy or a Powershell mentor to get you started thinking this way.]

– Should you introduce some auxiliary element in order to make its use possible?
[PS+: Use Modules. Check Poshcode. Check Gists, snippets. Code reusability using Powershell modules is not a catch-phrase. You would want to solve some problems and get it done with so that you don’t have to redo it from scratch later. Better still, someone may have done this for you already. You also need to maintain it depending on the use-cases where your modules have been used.
PsGet is an important step in that direction IMHO.

Also helpful:
SearchCo.De
Ohloh Code Search
Google
(Define Irony: Bing doesn’t support filetype:ps. But Bing supports filetype:pdf.]

– Could you restate the problem? Could you restate it still differently?
[PS+: Avoid thinking in terms of Cmdlets, Modules and .Net Classes at the outset. Try to be as generic and split the problem into its essential components – in English. This can be your SPEC document for your Powershell module.]

– If you cannot solve the proposed problem try to solve first some related problem. Could you imagine a more accessible related problem? A more general problem? A more special problem? An analogous problem?
[PS+: I usually work my way up from specific to general. I try to solve a specific problem with specific computername / PID and see if I can generalize it after solving it for 1 case.]

– Could you solve a part of the problem? Keep only a part of the condition, drop the other part?
[PS+: This should be easy. Don’t forget to understand the part you have not solved, and how it relates to the part you did solve.]

– Could you change the unknown or data, or both if necessary, so that the new unknown and the new data are nearer to each other?

– Did you use all the data? Did you use the whole condition? Have you taken into account all essential notions involved in the problem?
[PS+: Did you miss anything? Did you kill the puppy?]

Carry out the plan:
This step is usually easier than devising the plan. In general, all you need is care and patience, given that you have the necessary skills.

– Carrying out your plan of the solution, check each step. Can you see clearly that the step is correct?
[PS+: PowerGui and Powershell_ISE provides a step-into, step-out option in IDE. You can also create breakpoints and toggle between them.
You can check John Robbins‘s article on debugging]

– Persist with the plan that you have chosen.

– If it continues not to work discard it and choose another. Don’t be misled; this is how mathematics is done, even by professionals.
[PS+: Don’t get hung-over a function or a .Net Method if it doesn’t work. XYZ person may say so, but you are the only one coding it – not them. Move on.]

Fourth principle: Review/extend
Polya mentions that much can be gained by taking the time to reflect and look back at what you have done, what worked and what didn’t. Doing this will enable you to predict what strategy to use to solve future problems, if these relate to the original problem.
[PS+: If there is one takeaway from Polya’s book, it’s this and this only.
REFLECT.
Polya argues that after you have solved a problem, or in our case, a function or a script, much can be learned from reflecting on your solution and extending it to other class of problems.
Maybe you should group these scripts into a module? Maybe they are better apart since they don’t have a single purpose. Maybe it’s just a snippet for re-use later.

If you do not spend the time understanding the use-cases of your own script, you are losing a valuable learning opportunity.

I must admit that I rarely do this. The temptation to upload it on PoshCode and Tweet about it gets the best of me. But, if you do spend time reflecting on your own script and your own module, you will improve your own solution and extend it to other cases.]

POINT-2:: Annotating your and other’s code like Chess GM’s

What do chess grandmasters do, when they are not playing in competitions? They practice with other GM’s, try to beat Rybka (yes), and they review their own moves and annotate. They review games played by other masters, or their competitors and annotate.

I think annotating your code and reflecting on it is an important step in your personal development. You can also write really verbose in-line #comments and release it to the public. I’d go for annotating over verbose comments for one simple reason – when I annotate code, I use references which make sense only to me and in a particular context. All that verbiage may be useless if you don’t have the reference or the context.

Github provides an easy way to annotate code. (I don’t work for Github.)
Github has a free open-source plan, and also a paid private individual plan with 5 repos which starts at $7/mo.
I think that’s a better option, than building Mercurial on your own computer, administer and backup stuff only to get a 50% solution. Since you will be reusing your investment in knowledge and using the same tools (PoshGit), you can add that to your skill-set.
If you are starting out for the first-time, sooner or later you will end-up using code hosted on GitHub. It might be worthwhile investing some time learning about it.

I like Codeplex and PoshCode for cross-posting. But if I have to use a version control system which I will use day in day out and store my knowledge and annotations, I’d rather use Github.

Thanks for reading.

PS: I couldn’t incorporate these bullets above, so they made it to the PS section.

  1. Don’t forget going to user-group meetings.
  2. Don’t forget watching videos on TechEd, Channel9 and YouTube.
  3. Don’t forget to incorporate some form of unit-testing. I know Powershell doesn’t have a full-blow unit testing framework like nUnit, but you can review code-repos on Codeplex and see how they do it.

What I am working on [WeekEnding 12/16/2012]

For Now:

  • P/Invoke calls for Win32.NativeMethods and read code in PowerSploit framework
  • Windows Internals Part-6 Chapter 2. Wed Meeting.
  • Create a random generator for small range values [0-100]. Modify Get-Random function using non-terminating, non-recurring digits of PI. Random number testing using – http://www.fourmilab.ch/random/
  • Test P/Invoke code using API Monitor tool

For Later:

  • Configure KeithDahlby’s (@dahlbykPoshGit
  • Hack Mike Chaily’s (@chailyPsGet.
  • Pester – Powershell BDD Testing framework –> really interesting.
  • Powershell code reading on Github, searchco.de, ohloh and inter-tubes. Search Google filetype:ps1 >Search Text<
  • Print select pages from multiple PDF files using iTextSharp. [It works for .TXT]
  • Check mWinApi as a possible replacement for P/Invoke calls – http://mwinapi.sourceforge.net/
  • zwQuerySystemInformation vs NtQuerySystemInformation [ANS:] You cannot call zwXXX functions from user-mode. They are reserved for Kernel Mode. Article here

Interesting:

Thanks matt (for a Powershell wrapper to NTQuerySystemInformation)

Thanks Matt! (@mattifestation) for this wrapper code.
And yes, this is a big deal.

Let me explain why.
For last 2 weeks, I have been working on a Get-Handles code which spits out the handles, base address, handle type. I even wrote a blog post and didn’t publish it , something like Mitt Romney’s transition website :). (I am usually too excited to blog after finishing the code and usually skip going into details.)
But for some reason, I was missing a parameters, or a particular flag or something in my P/Invoke calls. The last iteration from today spit out a Process Access Violation error. It’s frustrating to say the least.

Matt’s code is a big deal.
Right now we have a way to access Windows native functions from Powershell and do stuff. It’s a nice little wrapper code probably a first-step in replacing Sysinternals set of tools.

SysInternals rant.
Every Admin/Dev/DevOps guy in Windows world uses SysInternals tools. They are small, and they expose the inner-workings of your application.
But:
– There is no source code. There used to be at one point in time. (Fortunately there is an open source project by WJ32 with source code – ProcessHacker)
– You cannot package any of SysInternal tools with your code. That’s a license violation.
– Sysinternal tools works great on the console but they suck if you are trying to get these values remotely. The amount of plumbing required to do this remotely is just not worth it.

There are lots of handles.C code floating around in the intertubes. There is a bunch of C# code also. But, this is probably the first time someone did this in Powershell.

Now that you can call NTSTATUS and NTQSI from Powershell, I am hoping someone can write a Powershell wrapper and produce these:
– PsMutex, PsSemaphores – to list out Mutant’s and Semaphores.
– List all processes which are not closing their handles ()
– List out contents of TEB, EPRCB.
– Maybe someday someone will write code to extract DPC/ISR’s

I call these class of problems as “Alerting Code“, meaning, by executing GetHandles remotely you are setting up an alert for your monitors to investigate further. This goes deeper than just investigating the PerfMon spikes and checking eventlogs, and gives a view of things at OS Level – not your fluffy managed code level accessible via Powershell using [System.Whatever..]

Alerting Code doesnt necessitate starting a Windbg session.
Typically, you would want to setup alerts form your alerting code, and depending on the results, you may initiate a remote WinDBG / OllyDBG session to the target machine. You can automate this and intercept viruses, badly coded runaway programs, and hopefully someday solve the classic Windows problem of “My Computer is So Slowwwww”.
In case you cannot initiate a WinDbg session, then these tiny little Powershell wrapper code comes in handy.

I can do all of that now without Powershell, but I am resorted to using SysInternals tools and I have to physically on the console of the target machine to do this.

If I have a bunch of Powershell modules which sets-up alerts that’s the first-loop in Systems Automation right there.

I understand that most of the values returned may not have the ntdll!Ke – functions. You would need the Symbol table for that.
Nicholas Dorier did a Symbol table lookup in nHook.
WJ32 did something smarter in his famed ProcessHacker; he wrote a lookup table for Symbols:)
So, there is a scope for improvement and I hope others join-in

Overall I am really happy that Matt wrote this and that it works flawlessly.

PS: If you think this post makes sense, please join us for Windows Internals Study Group and code with us :). We do a Google Hangout every Wednesday 8:30pm – 11:30pm (EST).  All artifacts open sourced.

Musings of a Powershell Developer.

I have been off this blog for a while.
I was trying to learn some new stuff and figure out some Powershell code.

I am still not sure what my original intention was for starting this blog. I always thought that whatever needs to be said, has already been said about Powershell multiplied by a factor of 3. However my views on the topic are changing and that merits a post.

I made some conscious decisions off late to change the way I learn things about Powershell.

I started learning Powershell in April 2012. There was this free all day pass at MSFT NYC office at Techstravaganza and I thought of taking a day-off and learning stuff. Plus Ross Smith and Ed Wilson were speaking.
Tome’s presentation on Powershell-V3, and the subsequent interactions left me deeply impressed.
I have been on and off Powershell for about a year before that. I used Powershell primarily for administering Exchange servers. But Tome showed me how you can do more than just get-process and get-service and restart-service and I was excited.
I think excitement is key to learning. If you are excited about something, you will spend more time, talk to more people, interact with masters of the craft and pick their brains on “Why” they are doing a certain thing a particular way and just not “How did you solve this problem”.

Determinism and Versatility.
Powershell is so versatile, and over the years huge script-repos have been built at Poshcode and MSFT Script Center which are really useful. Usually you end-up going to the repos when you have a specific problem to solve. You download the script, make changes to the script to fit your particular task and then get it done with.

Based upon my interactions with masters of their craft (HT to Tome (@toenuff) and Doug Finke @dfinke), and my own explorations with Powershell, I have come to realize that Powershell is a game-changer. Powershell not only provides a Systems Language to manage and control vast array of Servers and Workstations in a deterministic fashion, but it also exposes a rich programming environment because it exposes the .Net Environment. In the last few weeks, I also learnt that using Powershell you can interact with native Win32API’s. Powershell also has a rich set of functions to interact with web-services and databases, and you can do all this in less than 10 lines of code.
Did I mention you can control systems in a deterministic fashion?

Think about it for a second.

You have one language using which you can interact with all entities of computing in a Windows or a Non-Windows environment. And you can do that with few lines of code. The brevity of Powershell is important to me, as I can see the whole data manipulation in one screen. I realized that while going through some C Code – I had a print out of the header-file and then multiple windows open for different .C files and I had to toggle. Toggling between windows is context-switching for me and unlike C, I can visualize the whole thing in one-window. Plus the instant gratification of seeing your results. There is also a -WhatIf switch, so that you don’t accidentally break things. Powershell developers have been gentle with the n00b’s.

My Experience
I started off with Don Jones‘s Lunches series, and then moved on to PSiA-V2, and The Cookbook by Lee Holmes. My initial focus can primarily be summed-up with “getting some cool-scripts off my table“. If you see some of my initial code on the subject, I was more interested in doing the cool-stuff, like making phone-calls, calling a web-service or chopping off Workingset for a slow-system.
And the scripts worked, which gave me my initial positive reinforcement which is necessary for my own learning.
I know PowerShell is cool, but I needed to do some cool stuff of my own to invest significant amount of my time to learning it.

After a few months, I could start visualizing the pipeline.
I could see streams of data coming-off from left side, being split-up filtered, manipulated, and going through conditional loops to give you the desired output. Kinda like atoms smashing in a particle accelerator. (Maybe I am making it too sexy than what it is..). I didn’t start debugging, till I read John Robbins‘s article a few weeks ago.

I also listened to the podcasts. My initial thoughts were 2 dudes, hanging out with some beer, chilling and discussing Powershell. But as I listened to more podcasts (Which are free by the way), I started to get a sense of how do Masters of their craft think. The interviews are awesome, and Jon Walz and Hal Rottenberg have made it a point to interview all sorts of Powershell practitioners. After listening to a few episodes the breadth of Powershell started to sink-in.

Community !! Community !! Community !! Community !! Community !! Community !!
Powershell has an excellent community and I wanted to stress that.
On every other blog posts and presentations, Jeffrey Snover, Lee Holmes and Bruce Payette mentioned this. My initial thoughts were – “This must be some marketing spin from Microsoft to retain the interest of Developers and Admins”.
But now I truly understood the importance of community after going through discussion boards of other languages like Ruby.

In Powershell community:
->You are not treated as a n00b, even if you are a n00b. (For all practical purposes, I consider myself a n00b based upon the maxim -“It takes 10,000 hours of practice to master a craft”. I have devoted hardly 6% of time to Powershell.)
-> If you have a n00b question, the community will try to answer your questions to the best of their ability, and give you code samples to help you.
-> If you attend one of the User Groups, the speakers always seek out questions and try to answer it to the best of their ability.
-> If you ask someone in twitter, they respond back with code.
-> The MVP’s and Non-MVP masters of Powershell are extremely tolerant and open, and it really helps.
Case in point, if you didn’t attend the $1500 Experts Conference in Germany, Dmitry posted all the videos from that conference. There is no pay-wall or snickering-wall.
It really helps you to explore freely and that I think is the key to all learning.

Videos: Teaching a new dog new tricks.
After a few months of practicing Powershell, you may get to a stage where you would want to skip the preliminaries on a Powershell screencast or a video. I would advise against doing that. There are so many practitioners of the language with different backgrounds and Powershell is so versatile, that the speakers end-up picking some “habits” which are fascinating to watch and learn.

A case in point is Jim Christopher‘s video series on Pluralsight
I was skipping through the first few chapters till I saw how Jim used invoke-command to launch applications in his Push-Project function.
5-Lines of code. It basically opens up your Visual Studio environment for you.
What was fascinating was the idea that you can launch multiple stuff from command line and have your desktop ready for you with 1-click. I am not sure why I didn’t think of it.

I am sure I can modify this script to pull my emails, get daily headlines from Google News, Download attachments from email into a Focus folder, Check Tweet’s for #Powershell before I even start at my desk at 8:30 in the morning.
I can run this script at 8am and have the whole thing Out-File’d to a .TXT file so that I don’t have to look for things and get distracted.

That to me, is a case of Teaching a new dog new tricks.
These habits of Powershellers can teach you something so elementary which you never bothered thinking through in the first place.

Shortcomings and More Todo’s: It takes a village.
There is still a lot of work to be done in Powershell.
I am not talking about security, execution-policy, or making .Net functions run faster. That’s Microsoft’s domain and some extremely capable people are working on that. If you have some good ideas, you can plug it in Connect.
I think the community itself needs to come-up with some entities.

First on the list, Powershell Functions Repo like Nuget or CPAN.
(The idea is borrowed from this guy.)
There is a lot of Powershell code by different practitioners in multiple repos. There are probably 200 different Workstation Inventory functions. That can be a stumbling block for some, who would need reliable packages which can be deployed from the command line in a distributed fashion.
If I am starting out on a project right now, I am going to think in terms of modules and components. I need a reliable repo where I can take  Jim’s Mongo functions and multi-threaded code from Oisin
There are lots of scripts and script-repos floating around, but we need reusable modules and start thinking in terms of modules to build a solution. Developing a module repo won’t be easy as it requires significant ownership and investment of top-flight developers who can maintain the code across different releases of Powershell and .Net across different OS’s. It takes a village..

We also need to distill the idioms, patterns and practices of the language, so that new practitioners make fewer mis-steps.
Powershell Idioms repo as of  now > “Every time you use write-host in your script, God kills a puppy.”

I would also like to see some sort of a PowerShell wiki, where we arrange scripts and blog-posts along a hierarchical tags and topic list.
For Example – Databases > Mongo > Jim Christopher Mongo Module.

Conclusion:
If you are still reading it, I thank you for your patience 🙂

I wanted to write a post about how I started off with Powershell, and how I progressed and highlight some of the significant milestones in my journey. I have started reading and watching videos of Don Knuth & Abelson and Sussman. After watching a few of them I realized that their message is universal across all languages and helps you in becoming better at your craft. For that matter, you can learn from any dedicated practitioner of their craft. (Recommended movie – Jiro dreams of Sushi.)
Given Powershell’s versatility, I imagine someday someone would want to go and analyze the algorithms and the essence of computing, just not the task at hand. Plus Knuth is really cool and fun to watch any day.

If you want to be better at your craft, you need to listen to the masters. I am not talking UML and Kanban, I am talking about understanding the way they think. I read somewhere that you can solve multiple computing problems with just a Graph-Coloring algorithm. I am at a stage in my journey where I don’t even known what Graph-coloring algorithm is, let alone how to use it. But I hope to get there someday.

If you manage to distill the thoughts of the Masters and make them your own – You are one cool Grasshopper!!

System Administration is not System Administration anymore and that’s a good thing.
You have a much richer environment to interact with different systems, engineer them and duplicate in a deterministic fashion across wide array of things locally, remotely and in the cloud.

Hurricane Sandy Wind+Flooding Check realtime.

Raritan River watershed: empties near Staten I...

Raritan River watershed: empties near Staten Island in Raritan Bay on the Atlantic Ocean. (Photo credit: Wikipedia)

If you live in New Jersey and you are a data guy going crazy with different news & data points covering Hurricane Sandy, here is my quick checklist.

What do you need to track:
a) Hurricane Sandy will cause flooding and high winds, so you need to watch out for rising water levels.
b) Best place to monitor current wind-speed = your favorite weather channel www.weather.com Search by Zip code.
c) Best place to monitor current water levels are at below flooding range – NOAA.
d) High winds might cause trees to fall-down causing power-outage.

Data-Points to Track:

  1. Water Levels at your local river. http://water.weather.gov/ahps2/hydrograph.php?wfo=phi&gage=rrtn4. Try googling for this. raritan river site:water.weather.gov (Replace this with your own river / ref. points. Check the flood / storm surge maps below)
  2. NJ Flooding County-wise Storm Surge Maps. http://www.state.nj.us/njoem/plan/evacuation-routes.html
  3. Google All in One Tracker Page (Summarizes all Gov. Data Points) http://google.org/crisismap/sandy-2012
  4. Cams: http://www.eastcoastcams.com/newjerseysurfcams.htm
  5. Outage Maps: PSEG Outage Map and First Energy Outage Map
  6. 511: www.511nj.org/

Stay safe.

EvilMonkey-Part 1: Trim Working Set of a running process.

I have never had so much fun with 5 lines of Powershell code.

Remember those pesky situations when you have processes running-off and chewing-up the working-set values. Powershell to the rescue. This small 5-line powershell script will trim your working-set data. I have used (-1,-1) values to trim to the max possible.

What happens when you trim working-set data ? From MSDN:

The working set of a process is the set of memory pages in the virtual address space of the process that are currently resident in physical memory. These pages are available for an application to use without triggering a page fault. The minimum and maximum working set sizes affect the virtual memory paging behavior of a process.
The working set of the specified process can be emptied by specifying the value (SIZE_T)–1 for both the minimum and maximum working set sizes. This removes as many pages as possible from the working set.
The operating system allocates working set sizes on a first-come, first-served basis. For example, if an application successfully sets 40 megabytes as its minimum working set size on a 64-megabyte system, and a second application requests a 40-megabyte working set size, the operating system denies the second application’s request.

I have primarily used this script to trim WorkingSet values of a browser. I have tested this using Firefox / iexplore  and chrome. The script works with no tab crashing. You can keep a Process Explorer / Task Manager / Process Hacker window open and observe the trimmed PID values.

Trimming WorkingSet doesn’t mean that the WorkingSet values wont go back to their previous numbers. I have seen some IE windows go back to similar numbers. However, this script is really useful for fixing a stuck-browser or a frozen screen situation.

Linus said, Talk is cheap. Show me the Code. So here it is. All 5 lines of goodness.
http://poshcode.org/3653

Warning Notice:
***********READ CAREFULLY***********
!!!! Do not use in production environment before thoroughly testing and understanding the script. !!!!
Do not use this to Trim Working Set for SQL Databases, there will be data-loss.
Do not use this to Trim Working Set database for a Microsoft Exchange database process (store.exe)
I haven’t tried this on IIS worker process (w3wp.exe).
More NEGATIVE effects of trimming working-set are listed here in detail.

Flying Cows: On Event Log analysis and Causal Relationships

Flying CowBackground:
For the last few months I have been doing some research on event log analysis. The whole thing started with this simple idea: Let’s say we have an EventID, for e.g. 15001 from MsExchangeTransport, Can we drill down and trace what caused it? You have the present event-data at X point in time, so you should be able to search-back in a tree of events.

I was expecting answers along the lines of event chains –EventID: 15001+MsExchangeTransport -> EventID:1111+SourceA [TimeStamp]-> EventID:2222+SourceB[TimeStamp]
Well the situation is a bit more complicated than that.

  1. Event ID’s are tied to the Event Source. Different sources can have same EventID’sin the Microsoft EventID namespace. Application developers can write to event logs with their own custom Event ID’s.
  2. While going through Event logs is like finding a needle in a haystack. There are filters for source and eventID’s, but they don’t help much. Also, event logs are localized. You would need some log analysis software which collects logs across your Server Stack, and network layers to model some type of relationship.
    Just to give a sense of how many providers can write to your event-log, try this >
    Type this on your dos prompt – wevtutil el
    Now write a script to do this across your server farm, and do a Compare-Object to get a list of all providers who can write to Windows Event logs.
  3. All indicators for a system under pressure may not be collected entirely in event logs even with systems with cranked-up diagnostic logging. You may need to look at Perfmon data, windows log files, syslog data and events from your networking stack (Cisco Netflow / OpenFlow). If you have a virtualized environment, you add more layers of events- Storage stack data, Virtualized hypervisor data etc.
  4. EventID’s by themselves do not have the intelligence to understand their own severity. You may need an external reference data-set or a codebook, which provides guidance on whether the particular event is severe, or can be ignored. Think http://support.microsoft.com/KB/ID
  5. EventID’s have temporal data, meaning, every EventID has a time-stamp attached to it. This can be used for plotting events on a time-series, like Splunk search app.

CLASSIFICATION:
So how would you know if your infrastructure is in trouble?
Typically, you would receive an alert for an event you are watching [for e.g. – EventID, “disk space”, “network layer alerts”, “storage alerts”], and it’s entirely upto the system-admin’s technical ability to figure out the root-cause. Finding a root-cause may not help in certain cases when mutually exclusive events are generated at different layers of the stack to create a storm. Also, you may have concurrent root causes.

I want to classify the different categories of events we are trying to analyze:

  1. CAT1: Outage events.
  2. CAT2: General errors and warnings related to a server-role, or an application in the core-stack.
  3. CAT3: Other application events
  4. CAT4:Events triggered by end-user action.
  5. CAT5: Rare one-off events which are critical, but may not trigger an alert.

[Core-Stack]

  1. Core OS
  2. Core Infra @{AD, DNS, DHCP, IPAM, Server-Role @{TS,Hyper-V,ADRMS,ADFS}}
  3. Tier-1 Applications:
    1. Exchange @{DB, Transport}
    2. IIS @{}
    3. SQL @{}
    4. Sharepoint @{}
    5. Network Layer @{}
    6. Storage Layer @{}
    7. Virtualization layer @{}
    8. Security Layer @{}

Let me explain the categories a little bit.
1.       Cat1: Outage events.
By definition, you can’t plan or do analytics on outage events. The best you can hope in this situation is to bake redundancies in to your infrastructure – Dual power supply, Dual WAN, Disaster Recovery/BCP, Alternate Data Center, WAN Failover’s come to mind. The idea is, if your hardware or the supporting CAT1 infrastructure is under pressure, move it to a stable pool.
CAT1 events by their very nature are unpredictable and random and are usually caused by failures outside the measurable category.
(Ross Smith IV, gave an example during his Failure Domain presentation where a tree-cutting service chopped off Dual WAN link to a data center).
2.       Cat2: Core stack.
This is the space where system administrators spend most time. Also core stack + application events cover close to 90% of the logs generated by volume. Event data in this category may lend itself to pattern analysis, and I am going to discuss some of the options down the line.
3.       Cat3: Application specific – Desktop or Web-based or Apps
Application specific events from a non-Microsoft vendor, or from an internally developed application
4.       CAT4: User Stack
You can investigate the client log-data connecting to your core infrastructure and try to find patterns and causality.
E.g: Email stuck in Outbox affects the exchange subsystem, Changing Outlook views does too.
User watching streaming videos on Internet Explorer during Olympics, affects the VDI infrastructure.
5.        CAT5: Rare chronic events
Rare one-off events which may or may not be critical, and does not trigger an alert.

Categories explained:
I am going to discuss some of the existing research in CAT [2-4] and CAT5. My initial thought going into this was, “Surely I am not reinventing the wheel here. Someone else must have faced similar problems and they must have done some research.” Well they did.

CAT2/CAT3/CAT4 Research:
a)      Multi-variate analysis techniques. UT Austin and AT&T Labs Research published a paper on a diagnosis tool (GIZA) they developed to diagnose events in AT&T IPTV infrastructure. Giza was used to trace events across the network stack from the SHO, to the DSLAM to the set-top box. Giza used a hierarchical heavy hitter detection algorithm to identify spatial locations, and then applied a statistical event correlation to identify event-series that are strongly correlated. Then they applied statistical lag correlation techniques to discover causal dependencies. According to the authors, Giza scores better than WISE using the ATT test data. Giza also has an advantage of traversing across the stack and collects logs in different formats, across devices and use that to model causality.
b)      Coding Approach: Implemented in SMARTS Event Management Systems (SEMS) (Sold to EMC in 2005). Tags event data into (P) Problem and (S) Symptom and uses a Causality Graph Model. Paper
c)        End to End tracing: Using tree-augmented naïve-bayes to determine resource-usage metrics that are most correlated with anomalous period. [System Slow -> High WorkingSet data -> IE eating up memory due to a incorrect plugin, Native memory leaks ]d)      Microsoft Research Netmedic: NetMedic diagnoses problems by capturing dependencies between components, and analyzing the joint behavior of these components in the past to estimate the likelihood of them impacting one another in the present and rank them by likelihood of occurrence.[This used microsoft stack test data, perfmon data etc.]

CAT5 Research:
CMU and ATT Labs Research published an excellent paper on this topic. They call these events chronics – the recurring below the radar event-data. They analyzed the CDR (Call Detail Record) data across AT&T VOIP infrastructure to detect below the radar events which were not captured in any triggers. They use a Bayesian distributed learning algorithm and then they filter the resulting dataset using KL Divergence. This is a novel approach. Had they used just a Bayesian algorithm, or a Learning algorithm – the resulting data-set will have events with high-scores which will reinforce the results. Any future events will be scored based on historical data, which you don’t want when you are trying to find out oddball events. They recursively removed all CDR events with high-scores using KL divergence algorithm, till they have a handful of odd-events.  The full paper can be accessed here.

Other Challenges:
No commercial solution exists as of date which can find causal relationships across the stack. Splunk does an awesome job in collecting, parsing, tokenizing and presenting the data in a searchable form. But this type of analysis may not lend itself to a search-based approach. You can find if (x) occurred before (y) and try to do stats on that and establish some sort of a correlation, but there are some issues with that approach.

  1. Correlation:
    1. You may not have necessary volume of event data to establish a correlation. Remember, you cannot control how a specific event is generated, however you can do analytics on the one’s that are logged. For example 2 EventID – 100, cannot be used to calculate this.
    2. Correlation does not necessarily establish causality, but adding temporal data into the mix can help you in identifying the culprit.
    3. Different event logs present data in different formats. There is no one common universal logging standard, which is used by every vendor from OS to Hardware or Applications, Networks to Power.
  2. Heterogeneous data-set.
    1. Most algorithmic approach address a homogeneous data-set – only Call data, only network traces, only IP data. We are trying to walk up and down the stack dealing with different log-formats.
  3. Context.
    1. An IP Address in a Windows ADDS has a different context, compared to an IP address from a BGP Context, or a Switch Context. Event log searches cannot distinguish the context.

CAUSALITY:
So, that brings us to the next question: How do we establish causality from event data ? Well, you can use one of the algorithms and model a relationship, and then prove causality using instrumentation.
By instrumentation I mean, you write scripts which reproduce that error and you watch for those errors to show-up in your logging interface. You should have an ability to increase / decrease the event generation by dialing-up or dialing-down your parameters. The concept is similar to writing unit-tests to detect failure. If your test scripts can’t detect failure, then you have a pass.

Thanks to Powershell and its deep inspection capability’s using WMI and .Net, you maybe able to reproduce the problem by writing a script for that.

End {}
From a data center analytics point of view, we need analytics software which can model temporal dependencies in CAT 2-5 and provided a consistent alert mechanism for system administrators.

My original question was, “Can I see a storm brewing in my network infrastructure?” and use that to get some sense of predictability.
I had just finished watching Twister sometime ago, and hence the flying cow reference. In the movie Helen Hunt and Bill Paxton’s characters chase tornadoes with a Tornado modeling device called DOROTHY. So, if you are into modeling event data across stacks and you see a flying-cow nearby, Well – Good Luck 🙂

I hope someone finds this useful.

Get-EventLog via Protocol or via a Database ?

What are the pros and cons for a Protocol Backed / vs Database-backed event monitor ?

Domain: Powershell, IIS, Event Log Monitor / notification system.

Motivation:
1. Dont want to receive event notifications via email. – I am concerned with event-delivery only, not event capture.
2. There are many ways of capturing events on poshcode / MSFT Script Repo.

Definition:
Protocol based event monitor – Use ODATA/ATOM or anything else to poll event logs from a System X, and display it anywhere else.

Database based event monitor – Uses this flow > Event (ETW) -> DB -> UI (Event-to-UI in milliseconds MAX 1 second)

Protocol:
PROS

1) You can only query what you want.
CONS
1) Slow / Sluggish?
2) You need to convert events to a Feed. Then write a WCF service (or Publish an Application in IIS), to get started. [maybe there is a better way, but I have tested only the IIS way till now]
3) Susceptible to fallacies of distributed computing

Database:
PROS
1) If you choose your tools well, you can achieve near millisecond round-trip from ETW to DB to UI. IIS doesn’t figure in this

CONS
1) You are forcing stuff into columns and splitting it up thereby losing objects. But, you are capturing the whole event-message (whatever is in the XML), so does it matter if you lose objects

Anything else ?

Thanks Zero Water.

Usually when you read about product or service reviews on a blog, it’s usually about how XYZ sucks. Well, I had a pleasant experience with a company and wanted to blog about it.

Product Purchased: Zero Water 23 cup Dispenser. for $35 on Amazon.

We bought the Zero Water 23 cup water dispenser last year. We loved it from day one. It came with a tester which checks for solid deposits in water and gives you a score.  We tested this with bottled water, Poland Spring, Tap Water and found the following scores.

  • Bottled Water – 20 -50
  • Poland Spring – 50
  • Other bottled water – 90 – 200 (Some bottled water results are worse than tap water?!)
  • Tap Water 150-272
  • Zero Water 0-3

If the tester indicates a score that is greater than 6, it’s time to replace the filters. The basic unit ships with 1 filter, and you can buy filters in bulk and save some money. The water from zero water dispense is tasty. (Yes, I said it.)

However, in May, the tap on the dispenser broke when I was holding it to pour out some water. The spring came out and there was no way for me to put it back. I was going to buy another unit but I thought to myself that I should at least give them a call.

So I called the 800 number and spoke to a rep and described the issue. She was apologetic and said that they will send a replacement unit. I was astounded that they were so accommodating. I asked them if I was expected to pay for it since I wasn’t sure about the warranty, but they told me that I wouldn’t be charged for it. I gave them my street address and the new unit arrived in a week.

Two weeks ago, the tap broke again, and this time without any user intervention. So I called them again and explained the problem. The rep confessed that the fault lay with them. There was an engineering problem with the tap design.

I thought – Did they just accept responsibility for a faulty tap design? I have heard apologies, but this one goes a step further. They had my street address on file and I didn’t have to give it a second time. She said they would expedite the replacement unit if possible. So till then, they had sent me 2 free replacements for a product which broke for apparently no fault of mine.

What happened next should be a lesson in customer service.

I was driving back from work early this week and got a call from a funny 800-number. This person was calling back from Zero Water and letting me know that the specific product which I asked for is back-ordered.

How many times have you received a call back from a company, for something you bought on retail ? Probably ZERO. These guys were calling me back to let me know that the product was back ordered; I was really impressed.

Then the rep said,  “I am sorry for your troubles. Can I ship you a replacement unit for free ? We will expedite this, so that you have something at home to drink clean water, while you wait for your product to arrive.”

I was blown away by their customer service. I cannot recall any experience I have had with any other company that matches this.

As an enterprise customer, maybe I have – but in retail , these guys are the kings as far as customer service is concerned. BTW, the clean-awesome-tasting-water bit helps too. 🙂

The replacement pitcher arrived today. Attached price list $0.0. Not a refurb pitcher, but a brand new sealed box. The replacement unit was an 8-cup pitcher.

So Zero Water, you have won yourself a lifetime customer.

Every element in the company – operations + customer service + products came together to result in an awesome user experience.

Disclaimer: I don’t work for Zero Water, nor am I affiliated to it in any way.