How to Purchase Open-AudIT Professional

Getting Open-AudIT Professional has never been easier.

 

The Discovery and Audit software used across 95% of the globe can be yours with a few easy steps.

This guide assumes that you have Open-AudIT installed and running if you aren’t at that stage yet, these links will help you:

Once you have Open-AudIT installed, you can navigate to the ‘Upgrade Licenses’ menu item and click on ‘Buy More Licenses’.

This will bring up the feature and price list for Open-AudIT. Click on the node count that suits your needs.

Currently, only Professional license can be purchased online, if you wish to purchase an Enterprise license, you can request a quote from our team.

The next screen will confirm your selection and you can proceed to the checkout.

Fill in all the details that you would like associated with the account, the email address will be used to create an account that is required to access the licenses/support.
Once the payment has been processed our team will email you a confirmation and a license key for the software. To add this navigate back to the ‘Upgrade Licenses’ menu item, this time clicking ‘Restore My Licenses’.N.B. The license will be automatically added to your account if you have an Opmantek User account – register here!

Click on the ‘Enter License Key’ button and that will show a text box for you to paste the license key into and add it to your profile.

After that, you will have full access to Open-AudIT Professional.

Uncategorized

A Primer in Root Cause Analysis

We’ve seen it time and time again, a ticket comes into the help desk, a customer is complaining about a slow application or poor voice quality during a call. We start digging into the problem, maybe pull some logs, grab some performance data from the NMS. Everything we find is inconclusive, and when we check back with the client the symptoms have passed. The problem is gone, and another ticket is in the queue, so we move on – no wiser as to what caused the issue – knowing that it will reappear.

The process of getting to the core, or root of a fault or problem is called Root Cause Analysis (RCA). Root Cause Analysis is not a single, stringent process, but rather a collection of steps, often organized specifically by type of problem or industry, which can be customized for a particular problem. When it comes to identifying the root cause of an IT-related event a combination of process-based and failure-based analysis can be employed. By applying an RCA process, and remediating issues to prevent their future occurrence, reactive engineering teams can be transformed into proactive ones that solve problems before they occur or escalate.

In this article I will attempt to outline a general process for troubleshooting network-related events, meaning those issues which directly impact the performance of a computer network or application resulting in a negative impact on user experience. While I will use Opmantek’s Solutions in the examples, these steps can be applied to any collection of NMS tools.

 

Introduction to Root Cause Analysis

Describing the RCA process is like peeling back an onion. Every step of the process is itself comprised of several steps. The three main steps of the RCA process are included below. The first two steps are often completed in tandem, either by an individual or by a team in a post-mortem incident review meeting.

  1. Correctly and completely identify the event or problem,
  2. Establish a timeline from normal operation through to the event,
  3. Separate root causes from causal factors

 

Identifying the Event or Problem

Completely and accurately identifying the problem or event is perhaps the easiest part of RCA when it comes to networking issues.

That’s right, I said easiest.

It’s easy because all you have to do is ask yourself Why? When you have an answer to the question Why ask yourself why that thing occurred. Keep asking yourself Why until you can’t ask it anymore – usually that’s somewhere from 4-5 times. This process is often referred to as the 5 Whys.

Many engineers advocate utilizing an Ishikawa, or fishbone diagram to help organize the answers you collect to the 5 Whys. I like this myself, and often utilize a whiteboard and sticky notes while working the problem. If you prefer using a software diagramming tool that’s fine, just use what is comfortable for you.

 

Example – The Power of Why

Here’s a real-world example Opmantek’s Professional Services team encountered while conducting onsite training in system operation. An internal user called into the client’s help desk and reported poor audio quality during a GoToMeeting with a potential customer.

  1. Why? – A user reported poor voice quality during a GoToMeeting (first why)
  2. Why? – Router interface that services switch to user’s desktop experiencing high ifOutUtil (second why)
  3. Why? – Server backups running during business hours (third why)
  4. Why? – cron job running backup scripts set to run at 9 pm local timezone (fourth why)
  5. Why? – Server running cron job is configured in UTC (fifth why)

 

The team started with the initial problem as reported and asked themselves Why is this happening. From there, they quickly came up with several spot checks and pulled performance data from the switch the user’s desktop was connected to, and the upstream router for that switch; this identified a bandwidth bottleneck at the router and gave us our second Why.

Once the bandwidth bottleneck was identified, the engineers used our solutions to identify where the traffic through the router interface was originating from. This gave them the backup server, and a quick check of running tasks identified the backup job – and there the third Why was identified.

System backups were handled by a cron job, which was scheduled for 9 pm. A comment in the cron job suggested this was meant to be 9 pm local timezone (EST) to the server’s physical location. This gave the team the fourth Why.

A check of the server’s date and time indicated the server was configured for UTC, which gave us the fifth Why.

Not every problem analysis will be this simple, or straightforward. By organizing your Why questions, and their answers, into a fishbone diagram you will identify causes (and causal factors) leading to a problem definition and root cause. In short, keep asking Why until you can’t ask it any further – this is usually where your problem starts.

 

Establish a Timeline

When establishing a timeline it’s important to investigate both the micro (this event’s occurrence) and the macro (has this event occurred in the past).  Thinking back to grade school mathematics, I draw a timeline, a horizontal line, across my whiteboard. In the center I place a dot – this is T0 (time zero) when the event occurred.

To the right of T0 I add tick marks for when additional details were discovered, when the user reported the issue, and when we collected performance or NetFlow information. I also add in marks for when other symptoms occurred or were reported, and for any additional NMS raised events.

To the left of the T0, I place everything we learned from asking Why – when did the backups start, when should they have started? I also review my NMS for events leading up to the performance issue; was interface utilization slowly on the rise, or did it jump dramatically?

Once I have mapped the micro timeline for this one occurrence I begin to look back through my data. This is where having a good depth of time-related performance information comes in handy. Opmantek’s Network Management Information System (NMIS) can store detailed polling data indefinitely which allows fast visual analysis for time-based recurring events.

Timeline - 700

Example – The Power of Time

As the engineers worked through their Why questions and built a fishbone diagram, they also created a timeline of the event.

They started defining T0 as when the event was reported, but as they collected data adjusted this to when the impact on the network actually started.

To the right of T0, they added in when the user reported the problem, when they started the event analysis, when performance data was collected from the switch and router, and the NetFlow information from the NetFlow collector. They also add in marks when other users reported performance impacts, and when NMIS raised events for rising ifOutUtil on both the router and backup server interfaces.

To the left of T0, they added when the backups started as well as when they should have started.  They reviewed NMIS and found the initial minor, major, and warning events for rising ifIOutUtil on the router interface.

Once the timeline was complete the engineering team went on to look for past occurrences of this event. By widening the scale on the router’s interface graphs the engineers could instantly see this interface had been reporting high ifOutUtil at the same time every weekday for several weeks. This cyclic behavior suggested it was a timed process and not a one-time user related event.

 

Root Causes vs. Causal Factors

As you build out and answer your Why questions you will inevitably discover several possible endpoints, each a potential root cause. However, some of these issues will simply be an issue caused by the root cause – a causal factor – and not the root cause itself.

It is critical to effecting long-term improvements in network performance that these causal factors be identified for what they are, and not misrepresented as the root cause.

 

Example – Distracted by a Causal Factor

The engineering team quickly realized that all servers had, at one time on the past, been configured for local timezones and had only recently been standardized to UTC. While a process had been in place to identify schedules, like this cron job, and update them for the change to UTC, this one had been missed.

Some members of the team wanted to stop here and fix the cron schedule. However, the wider group asked: Why was a cron job for a critical process, on a critical server, missed in the update process?

Pulling the list of processes and files from the update team’s records showed this file HAD been identified and updated, testing had been completed and verified. This brought about the next question: Why was the updated cron job changed, by who/what process?

While you can address causal factors they are often just a temporary fix or workaround for the larger issue. Sometimes this is all that can be done at the time, but if you don’t identify and completely address the root cause any temporary solutions you apply to causal factors will break down over time.

 

Example – Finding the Root Cause

Upon digging further, the engineers discovered that the cron job had been properly updated, but an older archived version of the cron job had been copied onto the server via a DevOps process. A Tiger Team was put together to research the DevOps archive and determine the extent of the problem. The Tiger Team reported back to the engineering group the next day; other outdated schedule files were found and also corrected. The engineering team worked with the DevOps engineers to put a process in place to keep the DevOps file archive updated.

 

Closing Out the Event

At this point, you’ve completed the Root Cause Analysis and identified the core issue causing the reported performance degradation. Since you’ve addressed the root cause this issue should not reoccur again. However, you can’t stop here – there are two follow-up steps that are critical to your future success as an engineering organization –

 

  1. Document the issue
    I like to use a centralized wiki, like Atlassian’s Confluence, to capture my organization’s knowledge in a central repository available for the entire engineering team. Here I will document the entire event, what was reported by the user, the performance information we captured, the RCA process and the end result – how and what we did to prevent this from happening again. Using tools like Opmantek’s opEvents, I can then relate this wiki entry to the server, router, interfaces, and ifOutUtil event so if it reoccurs future engineers will have this reference available to them.

 

  1. Follow-Up
    The root cause has been identified, a remediated put in place, and a process developed to preclude it from happening again. However, this doesn’t mean we’re done with the troubleshooting. There are times where what we assume is the root cause is, in fact, just a causal factor. If that is the case, this problem will reassert itself again as the solution would only be a workaround for the true problem. The solution is to put a process in place, usually through a working group or engineering team meeting, to discuss user impacting events and determine if they relate to either open or remediated RCA processes.

 

What I’ve presented here is a simplified root cause analysis process. There are additional steps that can be taken depending on the type of equipment or process involved, but if you start here you can always grow the process as your team absorbs the process and methodology.

 

Best,

Uncategorized

Automated Remediation And getting a Good Nights Sleep

A major PaaS provider offering a non-stop compute platform needed to automate their way around recurring issues to continue guaranteeing their millisecond data loss and recovery SLAs, giving them time to diagnose and remove the underlying problems.

I assisted one of the engineers from a major PaaS provider the other week so he could get back to having a good nights sleep.

The company he works for offer a non-stop IBM Power PC compute platform and he needed to automate activities his engineers were doing so his platform could be relied on to continue guaranteeing the millisecond data loss and recovery SLAs it was designed for.  They knew they needed time to diagnose and remove the underlying problems and so needed a fast and reliable way to fix the issues as they occurred in the meantime.

This blog describes what was done in this specialist environment but provides a great example of applying remediation automation in any environment. This service provider happens to offer an IBM PowerPC Platform as a Service to banks, utilities, telcos and the like making use of two clusters in two data centres, cross site replication provides High Availability and zero dataloss failover. The engineers use NMIS, opEvents and opConfig for managing the whole environment. NMIS is used to collect statistics, state and events from the IBM Power PC Cluster instances, NMIS also collects data from the member servers, the Fibre Channel and Ethernet switching and the SAN. Having NMIS meant visibility across the entire environment and all the way down to OS health particularly the underlying daemons, services etc on the Power PC cluster. In this case making use of NMIS’s Service management and plugin capabilities to monitor the IBM systems.

The team were making use of NMIS’s server management functions to collect state and performance data from several Nagios plugins for the PowerPC servers. NMIS and opEvents were successfully alerting the team to the fact that SVC replication was failing occasionally by sending notifications via SMS and Email to the right teams. The team were responding to these by following a process to restart the SVC service on the machines, of course this was usually in the middle of the night! They needed a way to automate this remediation task quickly so here is what they did in about 20 minutes to complete the work and without spending money.

First they read the Docs on Automating Event Actions in opEvents 

Next they looked at the SVC replication events in opEvents by looking at the details tab for one of the previous events. It was decided they only wanted this triggered if the alert was “Service Down” rather than “Service Degraded” and they only wanted it to happen if the service was down on the Primary site not if it was down on the Secondary site. In the details tab of the event they noted the following event attributes:
event => Service Down
element => svc_lun_inservice
host => primary_cluster.example.com

Next they tested a shell script they had used for restarting the svc service remotely, this was simply three remote ssh commands they had been issuing by hand, they placed it on the NMIS server in:
/usr/local/nmis8/bin/restart_svc.sh

Final piece of the puzzle was to call the script when the event details matched.

So editing the EventActions.nmis file
/usr/local/omk/conf/EventActions.nmis

The following was added:

1. A script action – added to the ‘script’ section of EventActions.nmis (copying one of the example scripts as an example) they created the follwing:
'script' => {
'restart_svc' => {
arguments => 'node.host',                           # this script takes the nodes IP or hostname
exec => '/usr/local/nmis8/bin/restart_svc.sh',
output => 'save',                 # save a copy of the script's stdout which echos what it is doing
stderr => "save",
exitcode => "save",
max_tries => 2,
},

2. A matching rule – added near the top of EventActions.nmis in the appropriate location (again the easiest was copying and editing one of the existing entries).
'20' => {
IF => ' event.event eq "Service Down" and event.element eq "svc_lun_inservice" and node.host eq "primary_node.example.com" ' ,
THEN => 'script.restart_svc ()', # note this name matches the the name in the script section above
BREAK => 'false'
},

Finally they needed a way to test it without actually doing anything or breaking anything. So they first edited the restart script slightly so it only echoed the key commands it would issue on the server. They then found a copy of the service down event in /usr/local/nmis/logs/event.log copied it to the clipboard changed the datestamp(epoch) to now and appended it back to the log
"echo {the editied event line} >> event.log"
and watched opEvents for the event to appear.

They then looked at the actions section of the event in the GUI and were able to see the script had fired and were able to see everything the script had done, from it’s output to it’s exit code to it’s stderr messages.

Finally they changed the script so it actually ran the commands and went home.

That night the svc replication failed – they didn’t get emailed or smsed about this time though as the system repaired itself immediately and before the 5 minute escalation time had passed. Job done.

In the meantime after a month of further diagnostics with the network carrier, the data-centre provider the SAN vendor and a team of others they found the underlying issue for the LUN replication, a timeout problem related to cross site Fibre Channel and the additional backup loads happening at night.  Couple of timers changed and all good.

 

Get the E-book here

Uncategorized

Why Managed Service Providers Should Consider a Self-hosted RMM Solution Over Software As a Service

With a growing dependence on the internet, many small and medium-sized organisations are opting out of managing their own networks and handing the reigns over to Managed Service Providers.  Managed services is a fast-growing industry that is expected to reach $193.34 billion with 12.5% CAGR by 2019 and Opmantek has long been recognised as a leading self-hosted RMM software provider to some of the biggest players in the industry.

In recent times, there has been a shift in the market to Software as a Service (SaaS) purchasing and many vendors now offer cloud-based solutions as a ‘simple’ software play for MSP’s. However, we have found that our customers require flexibility, and cloud is a one size fits most service that is not capable of supporting all network devices.  Every day we are getting more and more enquiries from MSP’s looking to regain control of their network by taking back ownership of their network management system.

Here are the top reasons our customers prefer on-premise or private cloud hosted Opmantek software over SaaS.

100% visibility and control

SaaS-based remote management systems are often a good option for micro MSP’s because of the ease of deployment and monthly subscription payments for each monitored instance that make budgeting easy until your client base starts to grow. The devices under management become more obscure as the size of the networks you are managing increase.  That’s when you start to lose network visibility, the event, health and performance data starts to lose quality and additional licensing costs associated with more complex network architectures begin to emerge.

Opmantek Software can be deployed in the cloud or on-premise but because you retain ownership of the database and have access to the source code at the core of NMIS, you maintain architectural, device and cost control.

 

Flexible integration

It is unlikely that any business will be implementing network management software for the first time when they come to search for a solution.  It is highly likely that they will already have a number of products performing different functions within their network environment and they may not wish to, or be able to, replace them all at once.

Opmantek supports a huge range of integration options through REST APIs (HTTP(S)), batch operations, information provided in JSON files or CSV forms.

Service Now API integration is tried and tested in both directions for incident feeds and CMDB asset feeds as well.

Opmantek offers end to end solutions, you can implement them in stages (or only implement some of the features) and be assured that whatever other platforms you are working with, you will have the ability to integrate and feed data between your systems and maintain complete visibility in your Opmantek dashboards.

Tailored Business Requirements and Ownership of data

One size fits all rarely works for large network environments, but it is this cookie cutter system approach that is standard with SaaS RMM software providers.

“Configuration over customization” has long been a core value of our development team – rather than creating new versions of software for individual customers, we co-create software features with our customers and make them available and configurable for the hundreds of thousands of organisations that have used Opmantek’s software over the years.  Almost every conceivable business requirement is catered for using configurations. Fault configuration, for example, can be adjusted in hugely flexible ways with our modelling policy engine, enabling one to make use of any metadata about a device to selectively apply fault and performance management.  opEvents enables event management, correlation, escalation, network remediation automation, network fault analysis automation to be configured in an infinitely flexible manner to suit customer difference and your NOC.

Unlimited Scalability

With the number of connected devices within every organization increasing exponentially over recent years the ability to grow and scale to meet the needs of a business, without eroding profits for managed service providers is becoming more and more critical.  As Managed Service Businesses grow, many SaaS providers force their users to compromise on the richness of data and troubleshooting capabilities by requesting paid upgrades to store more historical data.

At Opmantek we provide unparalleled scalability and ease of scaling.

We are highly efficient in our polling so you gain the most from each poller with thousands of devices per server instance even with full-featured polling. We also enable you to use as many pollers as you wish using opHA, we haven’t seen a limit of how many devices or elements one can manage on a single system. MongoDB provides cost-effective storage of as much data as you choose to store, making trending information and machine learning capabilities far more effective.

We are regularly approached to replace other major vendors products because they are unable to scale, enormously costly to scale or they lose functionality as they do.

Want to learn more about how Opmantek’s RMM solutions can increase your network visibility, deliver unmatched automation and save money for your Managed Service organization?  Click here to request a personalised demo from one of our engineers today!

Uncategorized

Enhancing Event Management Using Live Real World Data

Overview

When dealing with thousands of events coming into your NOC from many locations and different customers, operators are relying on getting useful information which will help them to make sense of the events pouring through the NOC.

Using opEvents, it is relatively easy to bring just about any data source into your event feed so that the Operations team has improved context for what is happening and ultimately what might be the root cause of the network outage they are currently investigating.

Using Twitter Feeds for Event Management

If you look into Twitter, you will find many Government and other organisations using Twitter to issue alerts and make announcements. A little bit of Googling and I found some excellent Twitter feeds for severe weather, general weather and earthquake tweets. By monitoring for these in opEvents, the result is that you have tweets visualized in your overall Event Management view.

opEvents MGMT View - 700

Useful Twitter Feeds

Severe Weather

Weather Tweet

Earthquake Tweet

Listening to Twitter Feeds

There are several ways to listen to Twitter feeds. The quickest one for me was to use Node-RED, something I use for Home Automation and IoT like applications.  Configuring Node-RED with the feed data above and then creating an opEvents JSON event was very straightforward.

Node Red Configuration View - 700

The code included in the node “Make Event” is below. It creates a JSON document with the correct payload which is a compatible opEvents JSON event (which are a really great way to deal with events), then writes it to the file:

if ( msg.lang === "en" ) {
// initialise payload to be an object.
details = msg.payload;
event = msg.topic;
timenow = Date.now();
msg.filename = "/data/json-events/event-"+timenow+".json";
msg.payload = {
node: "twitter",
event: event,
element: "Sentiment: " + msg.sentiment.score,
details: details,
sentiment_score: msg.sentiment.score
};
return msg;
}

Getting Twitter Events into opEvents

Now we have a well-formed JSON document with the necessary fields, opEvents will consume that once told which directory to look into.

I added the following to the opCommon.nmis in the section opevents_logs and restarted the opEvents daemon, opeventsd.

'nmis_json_dir' => [
'/data/json-events'
],

The result can be seen well in opEvents when you drill into the “twitter” node (you could, of course, call this node anything you like, e.g. “weather” or “earthquake”).

opEvents Twitter Feed - 700

Clicking on one of the weather events with a high sentiment score (more on that in a second), you can see more details about this event and what impact it might have.  Unfortunately we have a Tropical Cyclone in North Queensland at the moment; hopefully, no one will be injured.

opEvents Event View - 700

Enriching the Tweet with a Sentiment Score

The sentiment score is a heuristic which calculates how positive or negative some text is, i.e., what is the sentiment of that text.  The text analysis looks for keywords and computes a score, then in opEvents, we use this score to set the priority of the event so that we can better see the more critical weather events because the sentiment of those tweets will be negative.

In the opEvents, EventActions.nmis I included some event policy to set the event priority based on the sentiment score which was an event property carried across from Node-RED.  This carries through the rest of opEvents automagically.

'15' => {
IF => 'event.sentiment_score =~ /\d+/',
THEN => {
'5' => {
IF => 'event.sentiment_score > 0',
THEN => 'priority(2)',
BREAK => 'false'
},
'10' => {
IF => 'event.sentiment_score == -1',
THEN => 'priority(3)',
BREAK => 'false'
},
'20' => {
IF => 'event.sentiment_score == -2',
THEN => 'priority(4)',
BREAK => 'false'
},
'30' => {
IF => 'event.sentiment_score == -3',
THEN => 'priority(5)',
BREAK => 'false'
},
'40' => {
IF => 'event.sentiment_score < -3',
THEN => 'priority(8)',
BREAK => 'false'
},
},
BREAK => 'false'
},

Because opEvents uses several techniques to make integration easy, I was able to get the tweets into the system in less than one hour (originally I was monitoring tweets about the Tour de France), then I spent a little more time looking for interesting weather tweets and refining how the events were viewed (another hour or so).

Summing Up

If you would like an event management system which can easily integrate with any type of data from virtually any source into your workflow, then opEvents could be the right solution for you.  As a bonus, you can watch the popularity of worldwide sporting events like the Tour de France.

Monitoring Tour de France Tweets with opEvents

opEvents Tour de France View - 700
Uncategorized

Marriott data breach: wake-up call for companies storing customer data

How using the Gartner cyber security CARTA model can help secure customer data

The disclosure by hotel chain, Marriott, that the personal details of up to 500 million guests may have been compromised is a cyber security wake-up call for companies that store customer details—including in the cloud.

The potential theft of millions of passport details  ̶ reported on Friday, 30 November  ̶  could prove expensive. According to US magazine, Fortune, Marriott will offer to reimburse customers the cost, if fraud has been committed and customers need new passports.

For companies that store customers’ financial and personal details, the breach highlights two key issues that need to be addressed in corporate cyber security policies.

First, cyber prevention requires vigilance. The Marriott breach was detected more than two years after it first occurred. This is a sobering thought for chief information officers. Just because your systems and people have not detected a breach, that doesn’t guarantee that a breach hasn’t occurred.

The second issue is agility. Cyber security is a continuous arms race between cyber security professionals and attackers. The cloud is now extending that arms race into new dimensions. To stay secure, companies have to be fast-paced and stay pro-active. This involves a change in mindset.

Proactive mindset the key to cyber prevention

But what practical steps should your company take to avoid a similar breach? Most important is, don’t wait for a cyber security alert: look into new ways of detecting any breaches that may already have occurred.

And don’t rest easy. If you are a major corporate, it is safest to assume you are constantly being attacked—and that some attacks will succeed.

Four-step process to mitigate risk

To mitigate and manage similar cyber security risks, we recommend a cyber response process built around four key steps:

  1. Prevention. Review firewalls and update controls to comply with the latest threat assessments. This includes a rigorous assessment of cloud-systems security.
  2. Detection. Understand the control tools that identify attacks, and continually review them as you move more functions and data into the cloud.
  3. Remediation. Work out now how you will respond if you discover a breach. This includes a customer-communications strategy.
  4. Restoration. Figure out how you can restore a secure environment quickly if you discover that your data  ̶  or your customers’ data  ̶  has been compromised.

This four-step process is built on a methodology put together by Gartner, called the ‘Continuous Adaptive Risk and Trust Assessment’ (CARTA). Gartner provides a great 60-minute introduction to this approach, accessible with registration.

To stay secure, though, the key will always be vigilance. As companies move more functions and databases into the cloud, malware designers will refine their attacks. A continuous re-assessment of cyber prevention tactics will prove the most effective strategy in this ongoing cyber arms race. ​Talk to Roger and his team of experts today on +61 2 9409 7000 to find out more about protecting your business.

Uncategorized