Weâve seen it time and time again, a ticket comes into the help desk, a customer is complaining about a slow application or poor voice quality during a call. We start digging into the problem, maybe pull some logs, grab some performance data from the NMS. Everything we find is inconclusive, and when we check back with the client the symptoms have passed. The problem is gone, and another ticket is in the queue, so we move on â no wiser as to what caused the issue â knowing that it will reappear.
The process of getting to the core, or root of a fault or problem is called Root Cause Analysis (RCA). Root Cause Analysis is not a single, stringent process, but rather a collection of steps, often organized specifically by type of problem or industry, which can be customized for a particular problem. When it comes to identifying the root cause of an IT-related event a combination of process-based and failure-based analysis can be employed. By applying an RCA process, and remediating issues to prevent their future occurrence, reactive engineering teams can be transformed into proactive ones that solve problems before they occur or escalate.
In this article I will attempt to outline a general process for troubleshooting network-related events, meaning those issues which directly impact the performance of a computer network or application resulting in a negative impact on user experience. While I will use Opmantekâs Solutions in the examples, these steps can be applied to any collection of NMS tools.
Introduction to Root Cause Analysis
Describing the RCA process is like peeling back an onion. Every step of the process is itself comprised of several steps. The three main steps of the RCA process are included below. The first two steps are often completed in tandem, either by an individual or by a team in a post-mortem incident review meeting.
- Correctly and completely identify the event or problem,
- Establish a timeline from normal operation through to the event,
- Separate root causes from causal factors
Identifying the Event or Problem
Completely and accurately identifying the problem or event is perhaps the easiest part of RCA when it comes to networking issues.
Thatâs right, I said easiest.
Itâs easy because all you have to do is ask yourself Why? When you have an answer to the question Why ask yourself why that thing occurred. Keep asking yourself Why until you canât ask it anymore â usually thatâs somewhere from 4-5 times. This process is often referred to as the 5 Whys.
Many engineers advocate utilizing an Ishikawa, or fishbone diagram to help organize the answers you collect to the 5 Whys. I like this myself, and often utilize a whiteboard and sticky notes while working the problem. If you prefer using a software diagramming tool thatâs fine, just use what is comfortable for you.
Example â The Power of Why
Hereâs a real-world example Opmantekâs Professional Services team encountered while conducting onsite training in system operation. An internal user called into the clientâs help desk and reported poor audio quality during a GoToMeeting with a potential customer.
- Why? â A user reported poor voice quality during a GoToMeeting (first why)
- Why? â Router interface that services switch to userâs desktop experiencing high ifOutUtil (second why)
- Why? â Server backups running during business hours (third why)
- Why? â cron job running backup scripts set to run at 9 pm local timezone (fourth why)
- Why? â Server running cron job is configured in UTC (fifth why)
The team started with the initial problem as reported and asked themselves Why is this happening. From there, they quickly came up with several spot checks and pulled performance data from the switch the userâs desktop was connected to, and the upstream router for that switch; this identified a bandwidth bottleneck at the router and gave us our second Why.
Once the bandwidth bottleneck was identified, the engineers used our solutions to identify where the traffic through the router interface was originating from. This gave them the backup server, and a quick check of running tasks identified the backup job â and there the third Why was identified.
System backups were handled by a cron job, which was scheduled for 9 pm. A comment in the cron job suggested this was meant to be 9 pm local timezone (EST) to the serverâs physical location. This gave the team the fourth Why.
A check of the serverâs date and time indicated the server was configured for UTC, which gave us the fifth Why.
Not every problem analysis will be this simple, or straightforward. By organizing your Why questions, and their answers, into a fishbone diagram you will identify causes (and causal factors) leading to a problem definition and root cause. In short, keep asking Why until you canât ask it any further â this is usually where your problem starts.
Establish a Timeline
When establishing a timeline itâs important to investigate both the micro (this eventâs occurrence) and the macro (has this event occurred in the past). Thinking back to grade school mathematics, I draw a timeline, a horizontal line, across my whiteboard. In the center I place a dot â this is T0 (time zero) when the event occurred.
To the right of T0 I add tick marks for when additional details were discovered, when the user reported the issue, and when we collected performance or NetFlow information. I also add in marks for when other symptoms occurred or were reported, and for any additional NMS raised events.
To the left of the T0, I place everything we learned from asking Why â when did the backups start, when should they have started? I also review my NMS for events leading up to the performance issue; was interface utilization slowly on the rise, or did it jump dramatically?
Once I have mapped the micro timeline for this one occurrence I begin to look back through my data. This is where having a good depth of time-related performance information comes in handy. Opmantekâs Network Management Information System (NMIS) can store detailed polling data indefinitely which allows fast visual analysis for time-based recurring events.