01 February 2019

Surviving Dangerous (event) Storms

Surviving Dangerous (event) Storms

With the recent onslaught of weather events to hit North America, I would be remiss to not bring up events that we have the ability to control. Network events can storm a control center and often lead to support teams being inundated with support tickets. Every engineer has felt this way at some point where control has slipped and you are just reacting to anything that comes your way.

These times should reinforce the need to turn your operational environment from a reactive atmosphere to a more controlled, proactive environment. One of the easiest ways to start this process is by implementing opEvents and leveraging the built-in event deduplication, correlation and synthetic event features.

A clearer example of how these features would operate in an environment can be theorised based on the weather mentioned above. For example, you may have to monitor two data centers located in the United States, one in Chicago and one in Miami. You are quite lucky because you are based in Australia where it is summer and very warm in comparison.

During the night, there have been major issues in the Chicago data center and there are a number of devices that have been experiencing issues. With this in mind, it would be desirable to have a single alert that notifies us that the Chicago site is experiencing a problem, versus many (10 ~ 500+) alerts from individual nodes.  This would cut down on the noise and it would also automate a component of the troubleshooting process, enabling your team to focus in on a common symptom in order to remedy the problem.

All the tools to set up this process are standard features that ship with opEvents. The process takes into account three key features that will increase the intelligence and automation surrounding your event management; Event deduplication, event correlation and synthetic event generation.

Event Deduplication

The principle behind event deduplication is that if there are two events that have occurred, in a given window, that are considered identical, you should only be notified once. opEvents handles this out of the box and it is really valuable when similar events are re-occurring, such as node flapping. This type of deduplication is essential for dealing with event storms; it is therefore always active and non-adjustable.

Event Correlation

Similar to how event deduplication will reduce the number of your events, the in-built event correlation rules will also help reduce mass event notifications. The correlation engine can group events together based on a number of different factors, such as event type, location, name or customer. When a combination of events occurs a synthetic event will be generated.

Synthetic Event Generation

A synthetic event is an event that has been intelligently created by the system that is a combination of multiple other events. Once synthetic events are created in your system you will now log into a system that will provide you with knowledge and not just data.

Imagine after your morning coffee, logging into your event management software and seeing one event, ‘Chicago Site Issues’ or similar compared to logging in and seeing hundreds of notifications and flashing lights. Not only is this a lot easier on the stress levels, but you will also be able to solve the issue at hand quicker due to the focused wisdom that has been generated.

Next Steps: