How to Feed Your Network Monitoring Solution

Introduction

A common challenge I hear from prospective customers is their concern with the number of resources needed for the daily upkeep of a network monitoring solution. Resources are at a premium, and making sure devices are added, updated, and retired from the monitoring platform is commonly a low-priority task, often relegated to inexperienced engineers if not forgotten altogether.

Opmantek was recently selected byĀ NextLink Internet,Ā a Wireless ISP located in Hudson Oaks, Texas, to provide solutions around fault and performance monitoring, event and configuration management, and NetFlow analysis. Like many other clients, a key requirement of Ross Tanner, NextLink’s Director of Network Operations, was automating the general upkeep of devices, or as Ross put it ā€œthe daily feeding and watering of the solutionā€.

 

Operational Process Automation

Definition

Operational Process Automation (OPA) is all about using digital tools to automate repetitive processes. Sometimes fully autonomous automation can be achieved, but more often complex workflows can make use of partial automation with human intervention at key decision points.

 

Automating the Feeding and Watering

The key to maintaining the list of devices to be monitored is keeping track of new, existing, and retired devices. Opmantek’s suite of network monitoring tools includes Open-AudIT, an agentless device discovery and auditing platform. While Open-AudIT contains a built-in connection to Opmantek’s NMIS fault and performance platform, the connection required significant manual intervention which could not scale easily to the scope needed by NextLink.

As part of system implementation, Opmantek conducted onsite interviews with NextLink’s engineering teams; everyone from internal architects to field managers, to understand their concerns and requirements. As a result, it was quickly determined that Open-AudIT’s existing link to NMIS needed to be automated in a way that was easy to set up and maintain, even by novice engineers.

As NextLink was deploying a 2-tiered monitoring architecture, comprised of a series of pollers connecting directly with devices and reporting back to one or more primary servers, the solution would need to scale horizontally as well as vertically. While NextLink intended to start with a single server dedicated to device discovery and auditing, the solution would also need to be flexible enough to support multiple Open-AudIT servers.

These conversations resulted in the design and development of opIntegration to intelligently link Open-AudIT with NMIS.

 

ā€œAt Nextlink we care for our customers, we want them to succeed as much as we do, when it comes to partnering with vendors that is a large deciding factor for us. With Opmantek it was never a question… we could not have asked for a better team to work with on going to the next level of monitoring and automation.ā€

Ross Tanner, Director of Network Operations, NextLink Internet

 

Use Cases

The first step in developing an automation system is to identify the most common use cases, and if time permits, as many edge cases as possible. For this implementation, Opmantek’s engineering team storyboarded the following as a version 1 release:

 

New Device Added to Network

A list of devices would be periodically pulled from Open-AudIT via the Open-AudIT API and added to the NMIS server. By maintaining a list of integrated devices, opIntegration will know if a device was new, or if an update was being provided to an existing integrated device.

Existing Device Replaced (same IP and/or device Name)

It is not uncommon, especially in WISPs like NextLink, to regularly swap out in-field equipment due to failure or simply as part of a planned upgrade. Depending on your configuration of Open-AudIT, these replacement devices can either be categorized as a new device (usual) or overwrite an existing device entry (considered an edge case). As a result, opIntegration will either add the new device as previously described or update an existing device entry in NMIS with the new device type.

 

Device Retired/Removed from Network

Queries for devices not seen for several audit cycles are already included with Open-AudIT. Once a device has exceeded a given period (not seen forĀ yĀ audit cycles orĀ xĀ number of days) then a custom query would be used by opIntegration to retrieve that list, and set those devices to inactive in NMIS, effectively retiring them without deleting their historical data. Permanently removing the device from NMIS would remain a manual, user-initiated step.

 

Add Device(s) Manually to NMIS

In addition to creating an automation path, it was imperative that the solution allow and account for users manually adding devices to NMIS either through the GUI or some import process.

 

Building the Feeding System

Creating the Proof of Concept

The initial Proof of Concept (POC) leveraged Open-AudIT’s powerful API to retrieve a list of devices for each poller. This list was created using custom queries built in Open-AudIT. By using custom queries, users would be able to very granularly control the list of devices being sent to each NMIS poller. Once each poller had its list of devices, opIntegration would then utilize NMIS’ Node Administration function to manage adding, updating, and retiring devices from NMIS. A series of configuration files on each NMIS poller would control the Open-AudIT query to be executed, manage specifics like NMIS Group assignment and other parameters. A simple cron job would call opIntegration on whatever cadence the client desired.

User Acceptance Testing (UAT) went well, with only minor changes to the initial code base, primarily in the areas of debugging and visual presentation. After operating successfully on-premise with NextLink for 90-days the solution passed Opmantek’s internal tests and validations and was determined stable enough for inclusion in shipping code.

 

Next Steps in Automation

opIntegration will be included natively starting with the pending release of Open-AudIT 3.1. While the POC version was driven from the command line, Open-AudIT 3.1 will include a fully detailed GUI under Manage -> Integrations, to make configuration straight forward for all users. While the GUI will be designed to configure a single server (i.e. Open-AudIT and NMIS installed on the same server) the Integration can be used to set up configuration with remote NMIS platforms by copying the resulting configuration and script files to the remote NMIS server. Integrations can also be scheduled like any other Task in Open-AudIT, providing a simple GUI to create a detailed schedule.

How it looks

From the Open-AudIT GUI, navigate to Manage -> Integrations -> List Integrations.

Integrations 1 - 700
This will provide you with a list of all integrations that have been created.
Integrations 2 - 700
If we click on the blue details icon it will give you a summary of the integration if you have not run this before, the green execute button will launch this process for you.
Integrations 3 - 700
By clicking the devices tab you will see exactly which devices are included in the integration.
Integrations 4 - 700

Conclusion

Operational Process Automation is a large concept, often traversing multiple processes and stages. However, by prioritizing problem points, identifying manpower intensive steps, and focusing automation efforts on those items you can achieve significant improvements in performance, reliability, and satisfaction. With the new Integration routines that are built-into Open-AudIT Professional and Enterprise, users can easily automate the feeding and watering of NMIS for live performance and fault monitoring.

ā€œWith the integration of these two powerful systems, it has given us the automation that we have dreamed of in Operations. No longer are there missing gaps in monitoring or inventory, nor do you have to worry about the device model being incorrect as the system does it for you.ā€

Ross Tanner, Director of Network Operations, NextLink Internet

Uncategorized

How to Purchase Open-AudIT Professional

Getting Open-AudIT Professional has never been easier.

 

TheĀ Discovery and Audit softwareĀ used across 95% of the globe can be yours with a few easy steps.

This guide assumes that you have Open-AudIT installed and running if you aren’t at that stage yet, these links will help you:

Once you have Open-AudIT installed, you can navigate to the ā€˜Upgrade Licenses’ menu item and click on ā€˜Buy More Licenses’.

This will bring up theĀ feature and price listĀ for Open-AudIT. Click on the node count that suits your needs.

Currently, only Professional license can be purchased online, if you wish to purchase an Enterprise license,Ā you can request a quoteĀ from our team.

The next screen will confirm your selection and you can proceed to the checkout.

Fill in all the details that you would like associated with the account, the email address will be used to create an account that is required to access the licenses/support.
Once the payment has been processed our team will email you a confirmation and a license key for the software. To add this navigate back to the ā€˜Upgrade Licenses’ menu item, this time clicking ā€˜Restore My Licenses’.N.B. The license will be automatically added to your account if you have an Opmantek User account – register here!

Click on the ā€˜Enter License Key’ button and that will show a text box for you to paste the license key into and add it to your profile.

After that, you will have full access to Open-AudIT Professional.

Uncategorized

A Primer in Root Cause Analysis

We’ve seen it time and time again, a ticket comes into the help desk, a customer is complaining about a slow application or poor voice quality during a call. We start digging into the problem, maybe pull some logs, grab some performance data from the NMS. Everything we find is inconclusive, and when we check back with the client the symptoms have passed. The problem is gone, and another ticket is in the queue, so we move on – no wiser as to what caused the issue – knowing that it will reappear.

The process of getting to the core, or root of a fault or problem is called Root Cause Analysis (RCA). Root Cause Analysis is not a single, stringent process, but rather a collection of steps, often organized specifically by type of problem or industry, which can be customized for a particular problem. When it comes to identifying the root cause of an IT-related event a combination of process-based and failure-based analysis can be employed. By applying an RCA process, and remediating issues to prevent their future occurrence, reactive engineering teams can be transformed into proactive ones that solve problems before they occur or escalate.

In this article I will attempt to outline a general process for troubleshooting network-related events, meaning those issues which directly impact the performance of a computer network or application resulting in a negative impact on user experience. While I will use Opmantek’s Solutions in the examples, these steps can be applied to any collection of NMS tools.

 

Introduction to Root Cause Analysis

Describing the RCA process is like peeling back an onion. Every step of the process is itself comprised of several steps. The three main steps of the RCA process are included below. The first two steps are often completed in tandem, either by an individual or by a team in a post-mortem incident review meeting.

  1. Correctly and completely identify the event or problem,
  2. Establish a timeline from normal operation through to the event,
  3. Separate root causes from causal factors

 

Identifying the Event or Problem

Completely and accurately identifying the problem or event is perhaps the easiest part of RCA when it comes to networking issues.

That’s right, I saidĀ easiest.

It’s easy because all you have to do is ask yourselfĀ Why? When you have an answer to the questionĀ WhyĀ ask yourself why that thing occurred. Keep asking yourself WhyĀ until you can’t ask it anymore – usually that’s somewhere from 4-5 times. This process is often referred to as theĀ 5 Whys.

Many engineers advocate utilizing an Ishikawa, or fishbone diagram to help organize the answers you collect to the 5 Whys. I like this myself, and often utilize a whiteboard and sticky notes while working the problem. If you prefer using a software diagramming tool that’s fine, just use what is comfortable for you.

 

Example – The Power of Why

Here’s a real-world example Opmantek’s Professional Services team encountered while conducting onsite training in system operation. An internal user called into the client’s help desk and reported poor audio quality during a GoToMeeting with a potential customer.

  1. Why? – A user reported poor voice quality during a GoToMeeting (first why)
  2. Why? – Router interface that services switch to user’s desktop experiencing high ifOutUtil (second why)
  3. Why? – Server backups running during business hours (third why)
  4. Why? – cron job running backup scripts set to run at 9 pm local timezone (fourth why)
  5. Why? – Server running cron job is configured in UTC (fifth why)

 

The team started with the initial problem as reported and asked themselvesĀ WhyĀ is this happening. From there, they quickly came up with several spot checks and pulled performance data from the switch the user’s desktop was connected to, and the upstream router for that switch; this identified a bandwidth bottleneck at the router and gave us our secondĀ Why.

Once the bandwidth bottleneck was identified, the engineers used our solutions to identify where the traffic through the router interface was originating from. This gave them the backup server, and a quick check of running tasks identified the backup job – and there the thirdĀ WhyĀ was identified.

System backups were handled by a cron job, which was scheduled for 9 pm. A comment in the cron job suggested this was meant to be 9 pm local timezone (EST) to the server’s physical location. This gave the team the fourthĀ Why.

A check of the server’s date and time indicated the server was configured for UTC, which gave us the fifthĀ Why.

Not every problem analysis will be this simple, or straightforward. By organizing yourĀ WhyĀ questions, and their answers, into a fishbone diagram you will identify causes (and causal factors) leading to a problem definition and root cause. In short, keep askingĀ WhyĀ until you can’t ask it any further – this is usually where your problem starts.

 

Establish a Timeline

When establishing a timeline it’s important to investigate both the micro (this event’s occurrence) and the macro (has this event occurred in the past).Ā  Thinking back to grade school mathematics, I draw a timeline, a horizontal line, across my whiteboard. In the center I place a dot – this is T0Ā (time zero) when the event occurred.

To theĀ rightĀ of T0Ā I add tick marks for when additional details were discovered, when the user reported the issue, and when we collected performance or NetFlow information. I also add in marks for when other symptoms occurred or were reported, and for any additional NMS raised events.

To theĀ leftĀ of the T0, I place everything we learned from askingĀ Why – when did the backups start, when should they have started? I also review my NMS for events leading up to the performance issue; was interface utilization slowly on the rise, or did it jump dramatically?

Once I have mapped the micro timeline for this one occurrence I begin to look back through my data. This is where having a good depth of time-related performance information comes in handy. Opmantek’s Network Management Information System (NMIS) can store detailed polling data indefinitely which allows fast visual analysis for time-based recurring events.

Timeline - 700

Example – The Power of Time

As the engineers worked through theirĀ WhyĀ questions and built a fishbone diagram, they also created a timeline of the event.

They started defining T0Ā as when the event was reported, but as they collected data adjusted this to when the impact on the network actually started.

To theĀ rightĀ of T0,Ā they added in when the user reported the problem, when they started the event analysis, when performance data was collected from the switch and router, and the NetFlow information from the NetFlow collector. They also add in marks when other users reported performance impacts, and when NMIS raised events for rising ifOutUtil on both the router and backup server interfaces.

To theĀ leftĀ of T0,Ā they added when the backups started as well as when they should have started.Ā  They reviewed NMIS and found the initial minor, major, and warning events for rising ifIOutUtil on the router interface.

Once the timeline was complete the engineering team went on to look for past occurrences of this event. By widening the scale on the router’s interface graphs the engineers could instantly see this interface had been reporting high ifOutUtil at the same time every weekday for several weeks. This cyclic behavior suggested it was a timed process and not a one-time user related event.

 

Root Causes vs. Causal Factors

As you build out and answer yourĀ WhyĀ questions you will inevitably discover several possible endpoints, each a potential root cause. However, some of these issues will simply be an issue caused by the root cause – a causal factor – and not the root cause itself.

It is critical to effecting long-term improvements in network performance that these causal factors be identified for what they are, and not misrepresented as the root cause.

 

Example – Distracted by a Causal Factor

The engineering team quickly realized that all servers had, at one time on the past, been configured for local timezones and had only recently been standardized to UTC. While a process had been in place to identify schedules, like this cron job, and update them for the change to UTC, this one had been missed.

Some members of the team wanted to stop here and fix the cron schedule. However, the wider group asked: Why was a cron job for a critical process, on a critical server, missed in the update process?

Pulling the list of processes and files from the update team’s records showed this file HAD been identified and updated, testing had been completed and verified. This brought about the next question: Why was the updated cron job changed, by who/what process?

While you can address causal factors they are often just a temporary fix or workaround for the larger issue. Sometimes this is all that can be done at the time, but if you don’t identify and completely address the root cause any temporary solutions you apply to causal factors will break down over time.

 

Example – Finding the Root Cause

Upon digging further, the engineers discovered that the cron jobĀ hadĀ been properly updated, but an older archived version of the cron job had been copied onto the server via a DevOps process. A Tiger Team was put together to research the DevOps archive and determine the extent of the problem. The Tiger Team reported back to the engineering group the next day; other outdated schedule files were found and also corrected. The engineering team worked with the DevOps engineers to put a processĀ in place to keep the DevOps file archive updated.

 

Closing Out the Event

At this point, you’ve completed the Root Cause Analysis and identified the core issue causing the reported performance degradation. Since you’ve addressed the root cause this issue should not reoccur again. However, you can’t stop here – there are two follow-up steps that are critical to your future success as an engineering organization –

 

  1. Document the issue
    I like to use a centralized wiki, like Atlassian’s Confluence, to capture my organization’s knowledge in a central repository available for the entire engineering team. Here I will document the entire event, what was reported by the user, the performance information we captured, the RCA process and the end result – how and what we did to prevent this from happening again. Using tools like Opmantek’s opEvents, I can then relate this wiki entry to the server, router, interfaces, and ifOutUtil event so if it reoccurs future engineers will have this reference available to them.

 

  1. Follow-Up
    The root cause has been identified, a remediated put in place, and a process developed to preclude it from happening again. However, this doesn’t mean we’re done with the troubleshooting. There are times where what we assume is the root cause is, in fact, just a causal factor. If that is the case, this problem will reassert itself again as the solution would only be a workaround for the true problem. The solution is to put a process in place, usually through a working group or engineering team meeting, to discuss user impacting events and determine if they relate to either open or remediated RCA processes.

 

What I’ve presented here is a simplified root cause analysis process. There are additional steps that can be taken depending on the type of equipment or process involved, but if you start here you can always grow the process as your team absorbs the process and methodology.

 

Best,

Uncategorized

Automated Remediation And getting a Good Nights Sleep

A major PaaS provider offering a non-stop compute platform needed to automate their way around recurring issues to continue guaranteeing their millisecond data loss and recovery SLAs, giving them time to diagnose and remove the underlying problems.

I assisted one of the engineers from aĀ major PaaS provider the other week so he could get back to having a good nights sleep.

The company he works for offer a non-stop IBM Power PC compute platform and he needed to automate activities his engineers were doing so his platform could be relied on to continue guaranteeing the millisecond data loss and recovery SLAs it was designed for.Ā  They knew they needed time to diagnose and remove the underlying problems and so needed a fast and reliable way to fix the issues as they occurred in the meantime.

This blog describes what was done in this specialist environment but provides a great example of applying remediation automation in any environment. This service provider happens to offer an IBM PowerPC Platform as a Service to banks, utilities, telcos and the like making use of two clusters in two data centres, cross site replication provides High Availability and zero dataloss failover. The engineers use NMIS, opEvents and opConfig for managing the whole environment. NMIS is used to collect statistics, state and events from the IBM Power PC Cluster instances, NMIS also collects data from the member servers, the Fibre Channel and Ethernet switching and the SAN. Having NMIS meant visibility across the entire environment and all the way down to OS health particularly the underlying daemons, services etc on the Power PC cluster. In this case making use of NMIS’s Service management and plugin capabilities to monitor the IBM systems.

The team were making use ofĀ NMIS’s server management functionsĀ to collect state and performance data from several Nagios plugins for the PowerPC servers. NMIS and opEvents were successfully alerting the team to the fact that SVC replication was failing occasionally byĀ sending notifications via SMS and EmailĀ to the right teams. The team were responding to these by following a process to restart the SVC service on the machines, of course this was usually in the middle of the night! They needed a way toĀ automate this remediationĀ task quickly so here is what they did in about 20 minutes to complete the work and without spending money.

First they read the Docs on Automating Event Actions in opEventsĀ 

Next they looked at the SVC replication events in opEvents by looking at the details tab for one of the previous events. It was decided they only wanted this triggered if the alert was ā€œService Downā€ rather than ā€œService Degradedā€ and they only wanted it to happen if the service was down on the Primary site not if it was down on the Secondary site. In the details tab of the event they noted the following event attributes:
event => Service Down
element => svc_lun_inservice
host => primary_cluster.example.com

Next they tested a shell script they had used for restarting the svc service remotely, this was simply three remote ssh commands they had been issuing by hand, they placed it on the NMIS server in:
/usr/local/nmis8/bin/restart_svc.sh

Final piece of the puzzle was to call the script when the event details matched.

So editing the EventActions.nmis file
/usr/local/omk/conf/EventActions.nmis

The following was added:

1. A script action – added to the ā€˜script’ section of EventActions.nmis (copying one of the example scripts as an example) they created the follwing:
'script' => {
'restart_svc' => {
arguments => 'node.host',Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā # this script takes the nodes IP or hostname
exec => '/usr/local/nmis8/bin/restart_svc.sh',
output => 'save',Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā  Ā # save a copy of the script's stdout which echos what it is doing
stderr => "save",
exitcode => "save",
max_tries => 2,
},

2. A matching rule – added near the top of EventActions.nmis in the appropriate location (again the easiest was copying and editing one of the existing entries).
'20' => {
IF => ' event.event eq "Service Down" and event.element eq "svc_lun_inservice" and node.host eq "primary_node.example.com" ' ,
THEN => 'script.restart_svc ()', # note this name matches the the name in the script section above
BREAK => 'false'
},

Finally they needed a way to test it without actually doing anything or breaking anything. So they first edited the restart script slightly so it only echoed the key commands it would issue on the server. They then found a copy of the service down event in /usr/local/nmis/logs/event.log copied it to the clipboard changed the datestamp(epoch) to now and appended it back to the log
"echo {the editied event line} >> event.log"
and watched opEvents for the event to appear.

They then looked at the actions section of the event in the GUI and were able to see the script had fired and were able to see everything the script had done, from it’s output to it’s exit code to it’s stderr messages.

Finally they changed the script so it actually ran the commands and went home.

That night the svc replication failed – they didn’t get emailed or smsed about this time though as the system repaired itself immediately and before the 5 minute escalation time had passed. Job done.

In the meantime after a month of further diagnostics with the network carrier, the data-centre provider the SAN vendor and a team of others they found the underlying issue for the LUN replication, a timeout problem related to cross site Fibre Channel and the additional backup loads happening at night.Ā  Couple of timers changed and all good.

 

Get the E-bookĀ here

Uncategorized

Why Managed Service Providers Should Consider a Self-hosted RMM Solution Over Software As a Service

With a growing dependence on the internet, many small and medium-sized organisations are opting out of managing their own networks and handing the reigns over to Managed Service Providers.Ā  Managed services is a fast-growing industry that is expected to reach $193.34 billion with 12.5% CAGR by 2019 and Opmantek has long been recognised as a leading self-hosted RMM software provider to some of the biggest players in the industry.

In recent times, there has been a shift in the market to Software as a Service (SaaS) purchasing and many vendors now offer cloud-based solutions as a ā€˜simple’ software play for MSP’s. However, we have found that our customers require flexibility, and cloud is a one size fits most service that is not capable of supporting all network devices.Ā  Every day we are getting more and more enquiries from MSP’s looking to regain control of their network by taking back ownership of their network management system.

Here are the top reasons our customers prefer on-premise or private cloud hosted Opmantek software over SaaS.

100% visibility and control

SaaS-based remote management systems are often a good option for micro MSP’s because of the ease of deployment and monthly subscription payments for each monitored instance that make budgeting easy until your client base starts to grow. The devices under management become more obscure as the size of the networks you are managing increase.Ā  That’s when you start to lose network visibility, the event, health and performance data starts to lose quality and additional licensing costs associated with more complex network architectures begin to emerge.

Opmantek Software can be deployed in the cloud or on-premise but because you retain ownership of the database and have access to the source code at the core of NMIS, you maintain architectural, device and cost control.

 

Flexible integration

It is unlikely that any business will be implementing network management software for the first time when they come to search for a solution.Ā  It is highly likely that they will already have a number of products performing different functions within their network environment and they may not wish to, or be able to, replace them all at once.

Opmantek supports a huge range of integration options through REST APIs (HTTP(S)), batch operations, information provided in JSON files or CSV forms.

Service Now API integration is tried and tested in both directions for incident feeds and CMDB asset feeds as well.

Opmantek offers end to end solutions, you can implement them in stages (or only implement some of the features) and be assured that whatever other platforms you are working with, you will have the ability to integrate and feed data between your systems and maintain complete visibility in your Opmantek dashboards.

Tailored Business Requirements and Ownership of data

One size fits all rarely works for large network environments, but it is this cookie cutter system approach that is standard with SaaS RMM software providers.

ā€œConfiguration over customizationā€Ā has long been a core value of our development team – rather than creating new versions of software for individual customers, we co-create software features with our customers and make them available and configurable for the hundreds of thousands of organisations that have used Opmantek’s software over the years.Ā  Almost every conceivable business requirement is catered for using configurations. Fault configuration, for example, can be adjusted in hugely flexible ways with our modelling policy engine, enabling one to make use of any metadata about a device to selectively apply fault and performance management.Ā  opEvents enables event management, correlation, escalation, network remediation automation, network fault analysis automation to be configured in an infinitely flexible manner to suit customer difference and your NOC.

Unlimited Scalability

With the number of connected devices within every organization increasing exponentially over recent years the ability to grow and scale to meet the needs of a business, without eroding profits for managed service providers is becoming more and more critical.Ā  As Managed Service Businesses grow, many SaaS providers force their users to compromise on the richness of data and troubleshooting capabilities by requesting paid upgrades to store more historical data.

At Opmantek we provide unparalleled scalability and ease of scaling.

We are highly efficient in our polling so you gain the most from each poller with thousands of devices per server instance even with full-featured polling. We also enable you to use as many pollers as you wish using opHA, we haven’t seen a limit of how many devices or elements one can manage on a single system. MongoDB provides cost-effective storage of as much data as you choose to store, making trending information and machine learning capabilities far more effective.

We are regularly approached to replace other major vendors products because they are unable to scale, enormously costly to scale or they lose functionality as they do.

Want to learn more about how Opmantek’s RMM solutions can increase your network visibility, deliver unmatched automation and save money for your Managed Service organization?Ā Ā Click here to request a personalised demoĀ from one of our engineers today!

Uncategorized

Enhancing Event Management Using Live Real World Data

Overview

When dealing with thousands of events coming into your NOC from many locations and different customers, operators are relying on getting useful information which will help them to make sense of the events pouring through the NOC.

Using opEvents, it is relatively easy to bring just about any data source into your event feed so that the Operations team has improved context for what is happening and ultimately what might be the root cause of the network outage they are currently investigating.

Using Twitter Feeds for Event Management

If you look into Twitter, you will find many Government and other organisations using Twitter to issue alerts and make announcements. A little bit of Googling and I found some excellent Twitter feeds for severe weather, general weather and earthquake tweets. By monitoring for these in opEvents, the result is that you have tweets visualized in your overall Event Management view.

opEvents MGMT View - 700

Useful Twitter Feeds

Severe Weather

Weather Tweet

Earthquake Tweet

Listening to Twitter Feeds

There are several ways to listen to Twitter feeds. The quickest one for me was to use Node-RED, something I use for Home Automation and IoT like applications.Ā  Configuring Node-RED with the feed data above and then creating an opEvents JSON event was very straightforward.

Node Red Configuration View - 700

The code included in the node ā€œMake Eventā€ is below. It creates a JSON document with the correct payload which is a compatible opEvents JSON event (which are aĀ really great way to deal with events), then writes it to the file:

if ( msg.lang === "en" ) {
// initialise payload to be an object.
details = msg.payload;
event = msg.topic;
timenow = Date.now();
msg.filename = "/data/json-events/event-"+timenow+".json";
msg.payload = {
node: "twitter",
event: event,
element: "Sentiment: " + msg.sentiment.score,
details: details,
sentiment_score: msg.sentiment.score
};
return msg;
}

Getting Twitter Events into opEvents

Now we have a well-formed JSON document with the necessary fields, opEvents will consume that once told which directory to look into.

I added the following to the opCommon.nmis in the section opevents_logs and restarted the opEvents daemon, opeventsd.

'nmis_json_dir' => [
'/data/json-events'
],

The result can be seen well in opEvents when you drill into the ā€œtwitterā€ node (you could, of course, call this node anything you like, e.g. ā€œweatherā€ or ā€œearthquakeā€).

opEvents Twitter Feed - 700

Clicking on one of the weather events with a high sentiment score (more on that in a second), you can see more details about this event and what impact it might have.Ā  Unfortunately we have a Tropical Cyclone in North Queensland at the moment; hopefully, no one will be injured.

opEvents Event View - 700

Enriching the Tweet with a Sentiment Score

The sentiment score is a heuristic which calculates how positive or negative some text is, i.e., what is the sentiment of that text.Ā  The text analysis looks for keywords and computes a score, then in opEvents, we use this score to set the priority of the event so that we can better see the more critical weather events because the sentiment of those tweets will be negative.

In the opEvents, EventActions.nmis I included some event policy to set the event priority based on the sentiment score which was an event property carried across from Node-RED.Ā  This carries through the rest of opEvents automagically.

'15' => {
IF => 'event.sentiment_score =~ /\d+/',
THEN => {
'5' => {
IF => 'event.sentiment_score > 0',
THEN => 'priority(2)',
BREAK => 'false'
},
'10' => {
IF => 'event.sentiment_score == -1',
THEN => 'priority(3)',
BREAK => 'false'
},
'20' => {
IF => 'event.sentiment_score == -2',
THEN => 'priority(4)',
BREAK => 'false'
},
'30' => {
IF => 'event.sentiment_score == -3',
THEN => 'priority(5)',
BREAK => 'false'
},
'40' => {
IF => 'event.sentiment_score < -3',
THEN => 'priority(8)',
BREAK => 'false'
},
},
BREAK => 'false'
},

Because opEvents uses several techniques to make integration easy, I was able to get the tweets into the system in less than one hour (originally I was monitoring tweets about the Tour de France), then I spent a little more time looking for interesting weather tweets and refining how the events were viewed (another hour or so).

Summing Up

If you would like an event management system which can easily integrate with any type of data from virtually any source into your workflow, then opEvents could be the right solution for you.Ā  As a bonus, you can watch the popularity of worldwide sporting events like the Tour de France.

Monitoring Tour de France Tweets with opEvents

opEvents Tour de France View - 700
Uncategorized