Automated Remediation And getting a Good Nights Sleep

A major PaaS provider offering a non-stop compute platform needed to automate their way around recurring issues to continue guaranteeing their millisecond data loss and recovery SLAs, giving them time to diagnose and remove the underlying problems.

I assisted one of the engineers from a major PaaS provider the other week so he could get back to having a good nights sleep.

The company he works for offer a non-stop IBM Power PC compute platform and he needed to automate activities his engineers were doing so his platform could be relied on to continue guaranteeing the millisecond data loss and recovery SLAs it was designed for.  They knew they needed time to diagnose and remove the underlying problems and so needed a fast and reliable way to fix the issues as they occurred in the meantime.

This blog describes what was done in this specialist environment but provides a great example of applying remediation automation in any environment. This service provider happens to offer an IBM PowerPC Platform as a Service to banks, utilities, telcos and the like making use of two clusters in two data centres, cross site replication provides High Availability and zero dataloss failover. The engineers use NMIS, opEvents and opConfig for managing the whole environment. NMIS is used to collect statistics, state and events from the IBM Power PC Cluster instances, NMIS also collects data from the member servers, the Fibre Channel and Ethernet switching and the SAN. Having NMIS meant visibility across the entire environment and all the way down to OS health particularly the underlying daemons, services etc on the Power PC cluster. In this case making use of NMIS’s Service management and plugin capabilities to monitor the IBM systems.

The team were making use of NMIS’s server management functions to collect state and performance data from several Nagios plugins for the PowerPC servers. NMIS and opEvents were successfully alerting the team to the fact that SVC replication was failing occasionally by sending notifications via SMS and Email to the right teams. The team were responding to these by following a process to restart the SVC service on the machines, of course this was usually in the middle of the night! They needed a way to automate this remediation task quickly so here is what they did in about 20 minutes to complete the work and without spending money.

First they read the Docs on Automating Event Actions in opEvents 

Next they looked at the SVC replication events in opEvents by looking at the details tab for one of the previous events. It was decided they only wanted this triggered if the alert was “Service Down” rather than “Service Degraded” and they only wanted it to happen if the service was down on the Primary site not if it was down on the Secondary site. In the details tab of the event they noted the following event attributes:
event => Service Down
element => svc_lun_inservice
host => primary_cluster.example.com

Next they tested a shell script they had used for restarting the svc service remotely, this was simply three remote ssh commands they had been issuing by hand, they placed it on the NMIS server in:
/usr/local/nmis8/bin/restart_svc.sh

Final piece of the puzzle was to call the script when the event details matched.

So editing the EventActions.nmis file
/usr/local/omk/conf/EventActions.nmis

The following was added:

1. A script action – added to the ‘script’ section of EventActions.nmis (copying one of the example scripts as an example) they created the follwing:
'script' => {
'restart_svc' => {
arguments => 'node.host',                           # this script takes the nodes IP or hostname
exec => '/usr/local/nmis8/bin/restart_svc.sh',
output => 'save',                 # save a copy of the script's stdout which echos what it is doing
stderr => "save",
exitcode => "save",
max_tries => 2,
},

2. A matching rule – added near the top of EventActions.nmis in the appropriate location (again the easiest was copying and editing one of the existing entries).
'20' => {
IF => ' event.event eq "Service Down" and event.element eq "svc_lun_inservice" and node.host eq "primary_node.example.com" ' ,
THEN => 'script.restart_svc ()', # note this name matches the the name in the script section above
BREAK => 'false'
},

Finally they needed a way to test it without actually doing anything or breaking anything. So they first edited the restart script slightly so it only echoed the key commands it would issue on the server. They then found a copy of the service down event in /usr/local/nmis/logs/event.log copied it to the clipboard changed the datestamp(epoch) to now and appended it back to the log
"echo {the editied event line} >> event.log"
and watched opEvents for the event to appear.

They then looked at the actions section of the event in the GUI and were able to see the script had fired and were able to see everything the script had done, from it’s output to it’s exit code to it’s stderr messages.

Finally they changed the script so it actually ran the commands and went home.

That night the svc replication failed – they didn’t get emailed or smsed about this time though as the system repaired itself immediately and before the 5 minute escalation time had passed. Job done.

In the meantime after a month of further diagnostics with the network carrier, the data-centre provider the SAN vendor and a team of others they found the underlying issue for the LUN replication, a timeout problem related to cross site Fibre Channel and the additional backup loads happening at night.  Couple of timers changed and all good.

 

Get the E-book here

Uncategorized

Why Managed Service Providers Should Consider a Self-hosted RMM Solution Over Software As a Service

With a growing dependence on the internet, many small and medium-sized organisations are opting out of managing their own networks and handing the reigns over to Managed Service Providers.  Managed services is a fast-growing industry that is expected to reach $193.34 billion with 12.5% CAGR by 2019 and Opmantek has long been recognised as a leading self-hosted RMM software provider to some of the biggest players in the industry.

In recent times, there has been a shift in the market to Software as a Service (SaaS) purchasing and many vendors now offer cloud-based solutions as a ‘simple’ software play for MSP’s. However, we have found that our customers require flexibility, and cloud is a one size fits most service that is not capable of supporting all network devices.  Every day we are getting more and more enquiries from MSP’s looking to regain control of their network by taking back ownership of their network management system.

Here are the top reasons our customers prefer on-premise or private cloud hosted Opmantek software over SaaS.

100% visibility and control

SaaS-based remote management systems are often a good option for micro MSP’s because of the ease of deployment and monthly subscription payments for each monitored instance that make budgeting easy until your client base starts to grow. The devices under management become more obscure as the size of the networks you are managing increase.  That’s when you start to lose network visibility, the event, health and performance data starts to lose quality and additional licensing costs associated with more complex network architectures begin to emerge.

Opmantek Software can be deployed in the cloud or on-premise but because you retain ownership of the database and have access to the source code at the core of NMIS, you maintain architectural, device and cost control.

 

Flexible integration

It is unlikely that any business will be implementing network management software for the first time when they come to search for a solution.  It is highly likely that they will already have a number of products performing different functions within their network environment and they may not wish to, or be able to, replace them all at once.

Opmantek supports a huge range of integration options through REST APIs (HTTP(S)), batch operations, information provided in JSON files or CSV forms.

Service Now API integration is tried and tested in both directions for incident feeds and CMDB asset feeds as well.

Opmantek offers end to end solutions, you can implement them in stages (or only implement some of the features) and be assured that whatever other platforms you are working with, you will have the ability to integrate and feed data between your systems and maintain complete visibility in your Opmantek dashboards.

Tailored Business Requirements and Ownership of data

One size fits all rarely works for large network environments, but it is this cookie cutter system approach that is standard with SaaS RMM software providers.

“Configuration over customization” has long been a core value of our development team – rather than creating new versions of software for individual customers, we co-create software features with our customers and make them available and configurable for the hundreds of thousands of organisations that have used Opmantek’s software over the years.  Almost every conceivable business requirement is catered for using configurations. Fault configuration, for example, can be adjusted in hugely flexible ways with our modelling policy engine, enabling one to make use of any metadata about a device to selectively apply fault and performance management.  opEvents enables event management, correlation, escalation, network remediation automation, network fault analysis automation to be configured in an infinitely flexible manner to suit customer difference and your NOC.

Unlimited Scalability

With the number of connected devices within every organization increasing exponentially over recent years the ability to grow and scale to meet the needs of a business, without eroding profits for managed service providers is becoming more and more critical.  As Managed Service Businesses grow, many SaaS providers force their users to compromise on the richness of data and troubleshooting capabilities by requesting paid upgrades to store more historical data.

At Opmantek we provide unparalleled scalability and ease of scaling.

We are highly efficient in our polling so you gain the most from each poller with thousands of devices per server instance even with full-featured polling. We also enable you to use as many pollers as you wish using opHA, we haven’t seen a limit of how many devices or elements one can manage on a single system. MongoDB provides cost-effective storage of as much data as you choose to store, making trending information and machine learning capabilities far more effective.

We are regularly approached to replace other major vendors products because they are unable to scale, enormously costly to scale or they lose functionality as they do.

Want to learn more about how Opmantek’s RMM solutions can increase your network visibility, deliver unmatched automation and save money for your Managed Service organization?  Click here to request a personalised demo from one of our engineers today!

Uncategorized

Enhancing Event Management Using Live Real World Data

Overview

When dealing with thousands of events coming into your NOC from many locations and different customers, operators are relying on getting useful information which will help them to make sense of the events pouring through the NOC.

Using opEvents, it is relatively easy to bring just about any data source into your event feed so that the Operations team has improved context for what is happening and ultimately what might be the root cause of the network outage they are currently investigating.

Using Twitter Feeds for Event Management

If you look into Twitter, you will find many Government and other organisations using Twitter to issue alerts and make announcements. A little bit of Googling and I found some excellent Twitter feeds for severe weather, general weather and earthquake tweets. By monitoring for these in opEvents, the result is that you have tweets visualized in your overall Event Management view.

opEvents MGMT View - 700

Useful Twitter Feeds

Severe Weather

Weather Tweet

Earthquake Tweet

Listening to Twitter Feeds

There are several ways to listen to Twitter feeds. The quickest one for me was to use Node-RED, something I use for Home Automation and IoT like applications.  Configuring Node-RED with the feed data above and then creating an opEvents JSON event was very straightforward.

Node Red Configuration View - 700

The code included in the node “Make Event” is below. It creates a JSON document with the correct payload which is a compatible opEvents JSON event (which are a really great way to deal with events), then writes it to the file:

if ( msg.lang === "en" ) {
// initialise payload to be an object.
details = msg.payload;
event = msg.topic;
timenow = Date.now();
msg.filename = "/data/json-events/event-"+timenow+".json";
msg.payload = {
node: "twitter",
event: event,
element: "Sentiment: " + msg.sentiment.score,
details: details,
sentiment_score: msg.sentiment.score
};
return msg;
}

Getting Twitter Events into opEvents

Now we have a well-formed JSON document with the necessary fields, opEvents will consume that once told which directory to look into.

I added the following to the opCommon.nmis in the section opevents_logs and restarted the opEvents daemon, opeventsd.

'nmis_json_dir' => [
'/data/json-events'
],

The result can be seen well in opEvents when you drill into the “twitter” node (you could, of course, call this node anything you like, e.g. “weather” or “earthquake”).

opEvents Twitter Feed - 700

Clicking on one of the weather events with a high sentiment score (more on that in a second), you can see more details about this event and what impact it might have.  Unfortunately we have a Tropical Cyclone in North Queensland at the moment; hopefully, no one will be injured.

opEvents Event View - 700

Enriching the Tweet with a Sentiment Score

The sentiment score is a heuristic which calculates how positive or negative some text is, i.e., what is the sentiment of that text.  The text analysis looks for keywords and computes a score, then in opEvents, we use this score to set the priority of the event so that we can better see the more critical weather events because the sentiment of those tweets will be negative.

In the opEvents, EventActions.nmis I included some event policy to set the event priority based on the sentiment score which was an event property carried across from Node-RED.  This carries through the rest of opEvents automagically.

'15' => {
IF => 'event.sentiment_score =~ /\d+/',
THEN => {
'5' => {
IF => 'event.sentiment_score > 0',
THEN => 'priority(2)',
BREAK => 'false'
},
'10' => {
IF => 'event.sentiment_score == -1',
THEN => 'priority(3)',
BREAK => 'false'
},
'20' => {
IF => 'event.sentiment_score == -2',
THEN => 'priority(4)',
BREAK => 'false'
},
'30' => {
IF => 'event.sentiment_score == -3',
THEN => 'priority(5)',
BREAK => 'false'
},
'40' => {
IF => 'event.sentiment_score < -3',
THEN => 'priority(8)',
BREAK => 'false'
},
},
BREAK => 'false'
},

Because opEvents uses several techniques to make integration easy, I was able to get the tweets into the system in less than one hour (originally I was monitoring tweets about the Tour de France), then I spent a little more time looking for interesting weather tweets and refining how the events were viewed (another hour or so).

Summing Up

If you would like an event management system which can easily integrate with any type of data from virtually any source into your workflow, then opEvents could be the right solution for you.  As a bonus, you can watch the popularity of worldwide sporting events like the Tour de France.

Monitoring Tour de France Tweets with opEvents

opEvents Tour de France View - 700
Uncategorized

Marriott data breach: wake-up call for companies storing customer data

How using the Gartner cyber security CARTA model can help secure customer data

The disclosure by hotel chain, Marriott, that the personal details of up to 500 million guests may have been compromised is a cyber security wake-up call for companies that store customer details—including in the cloud.

The potential theft of millions of passport details  ̶ reported on Friday, 30 November  ̶  could prove expensive. According to US magazine, Fortune, Marriott will offer to reimburse customers the cost, if fraud has been committed and customers need new passports.

For companies that store customers’ financial and personal details, the breach highlights two key issues that need to be addressed in corporate cyber security policies.

First, cyber prevention requires vigilance. The Marriott breach was detected more than two years after it first occurred. This is a sobering thought for chief information officers. Just because your systems and people have not detected a breach, that doesn’t guarantee that a breach hasn’t occurred.

The second issue is agility. Cyber security is a continuous arms race between cyber security professionals and attackers. The cloud is now extending that arms race into new dimensions. To stay secure, companies have to be fast-paced and stay pro-active. This involves a change in mindset.

Proactive mindset the key to cyber prevention

But what practical steps should your company take to avoid a similar breach? Most important is, don’t wait for a cyber security alert: look into new ways of detecting any breaches that may already have occurred.

And don’t rest easy. If you are a major corporate, it is safest to assume you are constantly being attacked—and that some attacks will succeed.

Four-step process to mitigate risk

To mitigate and manage similar cyber security risks, we recommend a cyber response process built around four key steps:

  1. Prevention. Review firewalls and update controls to comply with the latest threat assessments. This includes a rigorous assessment of cloud-systems security.
  2. Detection. Understand the control tools that identify attacks, and continually review them as you move more functions and data into the cloud.
  3. Remediation. Work out now how you will respond if you discover a breach. This includes a customer-communications strategy.
  4. Restoration. Figure out how you can restore a secure environment quickly if you discover that your data  ̶  or your customers’ data  ̶  has been compromised.

This four-step process is built on a methodology put together by Gartner, called the ‘Continuous Adaptive Risk and Trust Assessment’ (CARTA). Gartner provides a great 60-minute introduction to this approach, accessible with registration.

To stay secure, though, the key will always be vigilance. As companies move more functions and databases into the cloud, malware designers will refine their attacks. A continuous re-assessment of cyber prevention tactics will prove the most effective strategy in this ongoing cyber arms race. ​Talk to Roger and his team of experts today on +61 2 9409 7000 to find out more about protecting your business.

Uncategorized

Keeping cyber-attackers out of your supply chain

Globalisation, new technologies and digital business models are transforming the supply chain. Many businesses rely on organisations and individuals in different regions or countries to own the processes, materials or expertise used to provide a product or service.

However, malicious individuals or groups are increasingly aware that any supply chain is only as strong as its weakest link. If just one participant in a supply chain is lax about security, all businesses and individuals involved may be at risk.

Malicious parties may exploit weaknesses to steal valuable intellectual property, disrupt the creation or delivery of products and services, or threaten businesses or individuals for financial gain.

The United States National Institute of Standards and Technology (NIST) highlighted the importance of a cyber-secure supply chain in its Cybersecurity Framework. The latest version of the Framework – which provides voluntary guidance for organisations to better manage and reduce cyber-security risks – incorporates additional descriptions about how to manage supply chain cybersecurity.

Furthermore, a recent KPMG report points out “organisations that understand and manage the breadth of their interconnected supply chains and their points of vulnerability and weaknesses are better placed to prevent and manage issues.”

So what measures businesses can take to reduce cyber-security risks to their supply chains? Here are some steps that business owners and managers may consider taking:

  • Provide security expertise and resources to all participants in their supply chain.
  • Review participants’ processes for addressing technology vulnerabilities that attackers may exploit.
  • Check participants’ processes and technologies for dealing with infections by malicious software (malware).
  • Determine whether background checks are conducted on all workers involved in the business’s supply chain.
  • Review processes used to ensure all components used in the supply chain are legitimate and free of malware or vulnerabilities.

By implementing these and other measures through a comprehensive supply chain cyber security plan – that is itself part of an integrated approach to cyber security and physical security – businesses can minimise the risk of infiltration and compromise by attackers. If you would like to learn more, please contact us at info@firstwave.com.au.

Uncategorized

Uso de Compliance Management Como hoja de Tareas

Es extremadamente crucial para una red ser configurada apropiadamente , no tan solo para cumplir con la legislación pertinente, sino también para garantizar la entrega de la más alta calidad de servicio. Se necesitan controles en la infraestructura de TI que evalúa si cumplen con los conjuntos de reglas que se implementan. Esto se estaba volviendo cada vez más difícil con el alcance de la infraestructura de TI que ahora se requiere para mantener los estrictos SLA, sin embargo, ahora esos controles pueden pasar de ser una tarea manual a un proceso automatizado.

opConfig posee un motor de cumplimiento integrado muy poderoso que se puede utilizar para auditar redes y garantizar que todos los dispositivos cumplan con un determinado conjunto de políticas. El producto incorpora las mejores prácticas de CISCO-NSA como conjunto de políticas de cumplimiento predeterminado, pero agregar sus propias políticas personalizadas es un proceso fácil de hacer. La página de la Comunidad tiene todos los recursos que necesitará para crear sus propias políticas o editar algunas políticas existentes (encuentrelo aquí).

Sin embargo, el enfoque de este resumen es qué hacer con la información que se proporciona una vez que estas políticas están en su lugar. Hay dos formas clave de procesar esta información y hacer que su red vuelva a ser compatible, dependiendo de cuántos dispositivos se requieran reparar.

Sin embargo, el enfoque de este resumen es qué hacer con la información que se proporciona una vez que estas políticas son aplicacadas. Hay dos formas clave de procesar esta información y hacer que su red vuelva a ser compatible, dependiendo de cuántos dispositivos se requieran corregir.

El primero generalmente se usa si ha heredado un problema de cumplimiento, a través de fusiones y adquisiciones, por ejemplo, donde una gran cantidad de dispositivos no son compatibles. El mejor proceso para esta instancia puede ser enviar nuevas configuraciones a todos los dispositivos. Esto puede llevar más tiempo que las correcciones de un solo elemento, pero existe el conocimiento de que cada dispositivo se configurará en la misma línea base. Los envíos de configuración se han explicado en nuestra página de la Comunidad y también describen un excelente ejemplo (ubicado aquí).

Esto conduce a la ocurrencia más común donde el sistema de auditoría ha notado pequeños cambios en un dispositivo. El informe de cumplimiento se puede automatizar para que se ejecute todas las mañanas antes de la hora de inicio programada de un equipo y generar un informe de los dispositivos que no cumplen. Muchos ingenieros de redes utilizarán esto como una hoja de tareas para el día o la mañana, el reporte en un monitor y la CLI requerida en otro. A medida que hayan completado las tareas, su entorno será más compatible y se incrementarán los niveles de servicio.

opConfig Compliance Task Sheet - 700

El resultado del informe se puede ver en la imagen superior. Esto ejemplifica lo que buscará el motor de cumplimiento de opConfig. La categoría hit / miss se refiere a las políticas que se prueban. Si hay un punto de configuración que sea verificable para la política, esto resultará en un hit. Si no hay nada disponible, habrá un miss (tenga en cuenta que un hit o miss no implica que exista una falla, está detallando que hubo un protocolo de prueba exitoso). La segunda columna se refiere a Excepciones y aprobaciones, una excepción requerirá cambiar la configuración en un dispositivo, OK denota que el dispositivo está actuando de acuerdo con lo que requiere la política. Si desea obtener más información sobre estos temas, visite los foros de la comunidad en los enlaces anteriores o contáctenos usando los siguientes enlaces.

Uncategorized