Sunday, 3 March 2013

MECE for Performance Tuning

One of my favourite lines is a Denny Cranism from Boston Legal : “Pull a rabbit out of your hat. That's the secret of both trial law and life.” This happens in the season one episode Head Cases. Alan Shore is fighting a case about racial discrimination. Heeding the sage counsel of the senior partner, he pulls a rabbit out of his hat -- in the form of Reverend Al Sharpton, who plays himself and shows up for the defendant.

I recently happened to face a situation which we had to overcome so as to not lose our client to a competitor product. And I know of no politician or parliamentarian. But I did pull out a rabbit -- in the form of an Excel sheet.

The situation itself is a very technical one. At least the solutioning part. It is one of those dreaded calls that you want to avoid -- your implementation team reports that the application ‘has performance problems’ or ‘is very slow’ or ‘freezes and server has to be restarted’. So you instruct the performance engineering group to do the standard set of things: heap size, perm gen size, check for memory leaks and so on.

But none of it helps. The client keeps calling the management that they have an enormous problem and they have to restart the server very frequently and that they are mighty unpleased. Which keeps everyone on the tenterhook, that the project may shut down and we may lose the client.

A similar situation happened a couple of times in the past. In both instances, MECE stood me in good stead. MECE stands for ‘Mutually Exclusive, Collectively Exhaustive’. The first time I heard about it was from Kal Gangavarapu, the then COO with Four Soft Limited who used to say that McKinsey made billions just on mee see.

Let me dwell on it a bit more. It is the framework that McKinsey consultants use as part of their problem-solving process. Ethan M. Rasiel and Paul N. Friga explain it The McKinsey Mind:
The McKinsey problem-solving process begins with the use of structured frameworks to generate fact-based hypotheses followed by data gathering and analysis to prove or disprove the hypothesis.

Although McKinsey & Company often uses the term fact-based to describe it, the McKinsey problem-solving process begins not with facts but with structure. Structure can refer to particular problem-solving frameworks or more generally to defining the boundaries of a problem and then breaking it down into its component elements.

For McKinsey-ites, structure is less a tool and more a way of life... Being MECE in the context of problem solving means separating your problem into distinct, nonoverlapping issues while making sure that no issues relevant to your problem have been overlooked.

… so let’s turn now to defining and simplifying the problem. In the generic approach to framing the problem, McKinsey-ites put this concept into practice by breaking the problem before them into its component elements. Why? In most cases, a complex problem can be reduced to a group of smaller, simpler problems that can be solved individually.

The most common tool McKinsey-ites use to break problems apart is the logic tree, a hierarchical listing of all the components of a problem, starting at the “20,000-foot view” and moving progressively downward.

You generate your initial hypothesis by drawing conclusions based on the limited facts that you know about the problem at hand without doing a lot of additional research.

Your next step is to figure out which analyses you have to perform and which questions you have to ask in order to prove or disprove your hypothesis. One way to lay out these questions is an issue tree. The issue tree, a species of logic tree in which each branch of the tree is an issue or question, bridges the gap between structure and hypothesis.

Where a logic tree is simply a hierarchical grouping of elements, an issue tree is the series of questions or issues that must be addressed to prove or disprove a hypothesis. Issue trees bridge the gap between structure and hypothesis. Every issue generated by a framework will likely to be reducible to subissues, and these in turn may break down further. By creating an issue tree, you lay out the issues and subissues in a visual progression. This allows you to determine what questions to ask in order to form your hypothesis and serves as a road map for your analysis. It also allows you very rapidly to eliminate dead ends in your analysis, since the answer to any issue immediately eliminates all the branches falsified by that answer.

[Issue tree and logic tree]

I used it first back in 2009. Since then I resorted to it three more times.
The first case was when a reported performance degradation problem was seemingly hopeless and going no way. We came to know of problems with the quote management application that we had built for a logistics giant (D**) for their Asian region. They complained of slow response time. I called for a team meeting and began with precisely writing the problem statement. When users in Australia and New Zealand enter a quote, the application freezes between the second and third screens. But on some days, the same users have normal response with the same screens.

We then went on with a structured approach, starting with Fiddler report on page sizes, then moved on to profiler reports, connection leakage areas, database parameters, compression filters, network latency times and so forth. The customer’s IT team also ran network tests for us. For example, the jacked up the latency times to a high values using live video feeds, file copies, mapping drives etc but still the user time did not change. The saving grace was that due to a lot of the actions we took, after the first week, there was some kind of normalcy for the users.

However, in the very next week, the users reported slowness again. By this time, the application loggers started showing high execution time, something that went unnoticed by all of us. We narrowed to a couple of stored procedures and immediately called for a code review. The team reported that there was a high-cost select query with static subsets of data was repeatedly used in loop with dynamic parameters. The solution: move static subset to cursor, outside the loop, and work on the cursor with dynamic parameters.

There was difference from the McKinsey approach in what I did. I didn’t use the two trees, issue tree and logic tree. Rather I used two MS Excel sheets. The first sheet was the problem.xls with the columns : S.No, Issue, Remarks. Entries in the Remarks column were colour coded to indicate valid / rejected. The second sheet was action-plan.xls with the columns : S.No, Activity, Start Date, End Date, Responsibility, Status, Remarks.

This was the first time I employed MECE for performance engineering and we could crack the problem. It helped me build the confidence that it was a reliable methodology.

Second time was on our warehouse solution that we implemented for a Chennai based customer, F****c. Once again we went through a whirlwind cycle of issues / possible causes / data collection / validation-rejection. This time around, instead of two Excel sheets, I used only one with the columns: Issues, Questions, Action Plan. Under Issues column, I made four sections: Client environment, Application / App server, Data server, Network. We finally narrowed down the problem to database queries.

Third time was in 2011, when a U.S. client reported slowness on some screen of our flagship product. Though I employed MECE, it quickly was found that all we had to do was look at our loops, heavy objects and queries. Though the problem got resolved, it was rather monotonous with regular stuff and did not have the magic of cracking a puzzle.

Fourth time was last month, this year. It was with our order management and visibility (track-n-trace) solution VisiLog implemented for G***** W*****, a customer in Europe. It was a seemingly futile quest where nothing seemed to provide a breakthrough. However, we believed in the structured approach and fact-based hypothesis validation/rejection.

This time too I used only one Excel book (sheet) with two sheets (tabs). The first one was for Problem & Observations. In here, I wrote the clear statement of the problem and updated the observations every day. During daily meetings, people were assigned tasks to document observations with or without running some tests. Monitoring the server parameter does not need tests, we have to just document the observations. The second sheet (tab) in the xls was MECE. The format was:


This is inspired by the figure Content Map given on pg 123 in The McKinsey Engagement by Paul N. Friga.

The root cause for the slowness problem was found to be an exponential increase of connection pool by the application server when the database server was out of reach. This outage happened during the nightly maintenance window. We notified the infrastructure team and the app server vendor who provided the solutions or workarounds.

I joked around with my associates at the beginning of my engagement that when nothing else will work, MECE will work (it has mahathyam). And it worked again. That’s MECE magic.

Problems related to performance engineering or fixing slowness issue or however we might describe them have the root cause and solution in a wide set of areas. These range from client side environment to app server, application, database server, database, and procedural code. Applying MECE, which originated in a non-technical business domain to an intensely technical engagement is a long shot. If I tell someone that I look for approaches in business books like McKinsey books, they wouldn’t believe me or encourage me. However I persisted with MECE.

I’d like to think this as my out-of-the-box idea or a rabbit as Denny Crane called it. Needless to say, Alan Shore won the case.

No comments:

Post a Comment