Una GuĂ­a Para Profesionales de TI Sobre AutomatizaciĂłn de Procesos de Red

Esta guía está diseñada para los gerentes de TI que buscan implementar la automatización de procesos de red en su organización.

Puntos clave:

  • Centrarse en buenas prácticas operativas.
  • Escogiendo las tareas correctas.
  • Manejo de problemas comunes a travĂ©s de la automatizaciĂłn.
  • Mapeo del proceso de automatizaciĂłn.
  • Ahorro de tiempo.
  • Lista de VerificaciĂłn.

La guĂ­a habla de el mejor enfoque para la gestiĂłn del cambio y la aceptaciĂłn del equipo, proporciona un marco metodolĂłgico para usar cuando se considera la automatizaciĂłn de una tarea manual en un entorno de red y los pasos a seguir para identificar un caso de prueba efectivo para tu organizaciĂłn.

Uncategorized

CĂłmo escalar el Monitoreo de Red de Manera Efectiva

Esta guía está diseñada para gerentes de TI que buscan escalar su monitoreo de red en su organización.

 

Puntos clave:

  • Los pros y los contras de escalar agregando miembros del personal.
  • Los pros y los contras de escalar cambiando los procesos.
  • Riesgos de escalar su red.
  • Tiempo medio entre fallas (MTBF)
  • Tiempo medio de resoluciĂłn (MTTR)

La guĂ­a analiza el mejor enfoque para aumentar tus capacidades de monitoreo y aumentar tu capacidad de generar ingresos, al tiempo que garantiza que los costos de hacer negocios no se inflen al mismo ritmo.

Uncategorized

Inteligencia redefinida, NMIS 9

NMIS consolida múltiples herramientas en un solo sistema, listas para que las usen los ingenieros de redes.  Escalable, flexible, abierto, fácil de implementar y mantener, son cualidades que caracterizan a NMIS como el sistema de administración de redes que sustenta las operaciones de más de 100,000 organizaciones en todo el mundo, convirtiéndolo en uno de los sistemas de administración de redes de código abierto más utilizados en la actualidad, por las ventajas que representa y que a continuación se detallan:

 

  • NMIS supervisa el estado y el rendimiento del entorno de TI de una organizaciĂłn, ayuda en la identificaciĂłn y rectificaciĂłn de fallas y proporciona informaciĂłn valiosa para que los departamentos de TI planifiquen los cambios de infraestructura y su inversiĂłn.
  • Se ha implementado globalmente en redes desde tan solo 5 dispositivos hasta cientos de miles, de los cuales más de 10,000 modelos están disponibles con los proveedores.
  • Aumenta tu eficiencia con la automatizaciĂłn a travĂ©s del potente motor NMIS 9. Ideal para los MSP, NMIS 9 de Opmantek, el cual resolverá cualquier problema de escala mediante una arquitectura flexible y la integraciĂłn con las herramientas existentes.

El aprovechar las características y ventajas del NMIS más rápido de la historia, con importantes mejoras listas para potenciar la administración de tu red en una sola plataforma, traerá consigo beneficios para tu empresa como:

  • Completa flexibilidad de arquitectura con más nodos por servidor que nunca
  • Consolida todas tus otras herramientas y automatiza su operaciĂłn
  • IncorporaciĂłn de Big Data con MongoDB reemplazando la base de datos del sistema de archivos NMIS 8
  • Mayor soporte para el almacenamiento centralizado de datos, lo que significa más disponibilidad de datos.
  • Entregado en una soluciĂłn pre-configurada lista para usar, para una implementaciĂłn rápida
  • Totalmente compatible con mĂłdulos comerciales para ampliar la plataforma.
  • Soporte comercial completo disponible

¡Si estas interesado en esta nueva versión de NMIS no dudes en contactarnos, dando click aquí!

Uncategorized

Saca Provecho de tu Teléfono Móvil Para Trabajar a Distancia.

Desde hace mucho tiempo las computadoras de escritorio y laptops han sido el eje central del trabajo a distancia, sin embargo, los smartphones de hoy en dĂ­a nos permiten un amplio alcance de funciones en la palma de nuestras manos.

Comúnmente solo los utilizábamos para estar en contacto con nuestros colaboradores, revisar nuestros correos e incluso responder algunos mensajes breves; actualmente han tomado mayor relevancia al utilizarlos como una herramienta de trabajo de gran valor e independencia.

Por ejemplo, se pueden tener reuniones desde cualquier lugar, interactuar con los miembros de nuestro equipo superando a través de las video llamadas uno de los retos más frecuentes en esta temporada de confinamiento obligado. Y precisamente es en esta área en donde se desenvuelven mejor los smartphones, ya que no se requiere de un escritorio o silla de trabajo para entrar a la reunión, además de que nos permite estar presentes en múltiples canales al revisar emails, contestar whatsapps y tener video conferencias, en un solo sitio.

También en nuestros smartphones podemos editar documentos y de una forma rápida corregir algunos datos en un archivo.

Otro factor de gran importancia es el valor que nos ofrecen las redes móviles, ya que nos permiten están conectados prácticamente en todas partes y siempre pendientes de lo que está sucediendo en nuestra empresa.

Si eres un prestador de servicios de telecomunicaciones no dudes en conocer nuestras herramientas de gestiĂłn, para evitar fallas en el servicio.

Las empresas se han visto obligadas a reinventarse para trabajar remotamente y es gracias a esto, que gran parte de la economía de los países ha podido seguir activa aún en las fases más críticas de confinamiento, por lo que cuando volvamos a la nueva normalidad es seguro que no vamos a regresar al antiguo modelo de trabajo y adoptaremos un modelo híbrido que incluya trabajo a distancia y de modo presencial, el cual podremos hacer más eficiente contando con estas herramientas.

Uncategorized

7 Steps to Network Management Automation & Engineer Sleep Insurance

Quietly, somewhere in an office downtown, bearings designed to last for 25k hours have been running non-stop for over forty-three-thousand. The fan was cheaply made by machine from components sourced over several years across a dozen providers. It sat boxed for weeks before it was installed in the router chassis, which itself was boxed-up. Two months at sea, packed tight in a shipping container, then more months bounced around and shuffled from truck to warehouse, and back to a parcel delivery. Finally, the device was configured, boxed and shipped to its final installation point. Stuffed into a too tight closet with no air circulation this mission critical router been running non-stop for the past five-years. It’s a miracle really that it worked this long.

Fan speed was the first thing to be affected by the bearing failure.

Building friction on the fan’s impeller shaft caused the amperage draw to increase to compensate and maintain rotational speed. When the amperage draw maxed out, rotations per minute (RPM) dropped. With the slower fan speed came less airflow, with lower airflow the chassis temperature increased.

Complex devices, like routers, require low operating temperatures. The cooler it is, the easier it is for electrons to move. As the chassis temperature increased the router experienced issues processing the data packets traversing the interfaces. At first it was an error here or there, then routine traffic routing ran into problems and the router began discarding packets. From there things got much worse.

It’s late Saturday evening and your weekend has been restful so far. A night out with your significant other, a movie and dinner. It’s late now and you’re ready for bed when your phone chirps. The text message is short;

Device: Main Router

Event: Chassis high temperature with high discard output packets

Action Taken: Rerouted traffic by increasing OSPF cost

Action Required: Fan speed low, amperage high. Engineer investigate for repair/replacement.

A fan went bad, what’s next?

The system had responded as you would – it rerouted traffic off the affected interface preventing a possible impact to system operation. Adding a note to your calendar to investigate the router first thing Monday morning you turned in for a good night’s sleep.

Our Senior Engineer in Asia-PAC, Nick Day, likes to refer to Opmantek’s solutions as “engineer sleep insurance”. Coming from a background in managed service providers I can appreciate the situation. Equipment always breaks on your vacation time, often when the on-call engineer is as far away as possible, and with little useful information from the NMS. This was a prime scenario we used when building out our Operational Process Automation (OPA) solution.

Building a Solution

Leveraging the combined ability of opTrend to identify operational parameters outside of trended norms, opEvents correlates events and automates remediation. With the addition of opConfig configuration changes to network devices are then able to be automated. Operational Process Automation (OPA) builds on this statistical analysis and rules-based heuristics, to automate troubleshooting and remediation of network events. This in turn reduces the negative impact on user experience.

Magicians never reveal their secrets…but we’ll make an exception.

Now let’s see how this was accomplished using the above example. At its roots opTrend is a statistical analysis engine. opTrend collects performance data from NMIS, Opmantek’s fault and performance system and determines what is normal operation. Looking back over several weeks, usually twenty-six, opTrend determines what is normal for each parameter it processes. It does this hour by hour, considering each day of the week individually. So, Monday morning 9-10am has its own calculation, which is separate from 3-4pm Saturday afternoon. By looking across several weeks opTrend can normalize things like holidays and vacation time.

Once a mean for each parameter is determined opTrend then calculates a statistical deviation for the parameter and creates a window of three standard deviations above and below the mean. Any activity above or below these windows triggers an opTrend event into NMIS. These events can be in addition to those generated by NMIS’s Thresholding and Alert system, or in place of.

In the example above, opTrend would have seen the chassis temperature exceed the normal window of operation. Had fan speed and/or amperage also been processed by opTrend (it is not by default but can be configured to be if desired) these would have reported as a low fan speed, and high amperage).

This event from opTrend would have been sent to NMIS, then shared with opEvents for processing. A set of rules, or Event Actions, looked for events that could be caused by high temperature; often related to interface packet errors or discards. With wireless devices (WiFi and RF) this may affect signal strength and connection speed. A similar result could be handled using a Correlation Rule, which would group multiple events across a window of time into a new parent event. Both methods are relevant and have their own pros and cons.

opEvents now uses the high temperature / high discards event to start a troubleshooting routine. This may include directing opConfig to connect to the device via SSH and execute CLI commands to collect additional troubleshooting information. The result of these commands can have their own operational life – being evaluated for error conditions, firing off new events and themselves starting Event Actions.

Let’s review the process flow:

  1. NMIS collects performance data from the device, including fan speed, temperature and interface performance metrics.
  2. opTrend processes the collected performance data from NMIS and determines what is normal/abnormal behavior for each parameter.
  3. Events are generated by opTrend in NMIS, which are then shared with opEvents.
  4. opEvents receives events from opTrend identifying out of normal temperature and interface output discards. These events are then correlated into a single synthetic event, given a higher priority, and evaluated for action
  5. An Event Action rule matches for a performance impacting event on a Core device running a known OS. This calls opConfig to initiate Hourly and Daily configuration backups, then execute a configuration change to increase the OSPF cost on the interface forcing traffic to be rerouted off this interface.
  6. opEvents also opens a helpdesk ticket via a RESTful API, then texts the on-call technician with the actions taken, and recommended follow-on activities.
  7. Once traffic across the interface drops the discards error will clear, generating an Up-Notification text to the on-call technician.

 

This is an example of what we would consider a medium complexity automation. It is comprised of several Opmantek solutions, each configured (most automatically) to work together. These seven solutions share and process fault and performance information, correlate resulting events, apply a single set of event actions to gather additional information and configure around the event. When applying solution automations, we advocate a crawl-walk-run methodology where you start by collecting troubleshooting information (crawl), then automate simple single-step remediations (walk), then slowly deploy multi-path remediations with control points (run).

Contact Us & Start Automating Your Network Management

Contact our team of experts here if you would like to know about how this solution was developed, or how Operational Process Automation can be leveraged to save on manhours and reduce Mean Time to Resolve (MTTR).

Uncategorized

How to Manage Capacity, Before it Becomes a Problem.

Capacity Management is the proactive management of any measurable finite resource.

This blog will help you with a simple to follow outline on how to properly manage capacity, so if you ever have to resolve capacity issues, you are ahead of the curve and ready to implement remediation.

Capacity management has been considered by many as difficult to achieve. But all worthwhile achievements take discipline to execute and accomplish. So, with careful consideration, monitoring and planning you can ensure that it becomes manageable and deliverable.

Don’t forget that as part of any new deployment or upgrade, and as budget allows, additional demand should be incorporated into the design, with additional capacity ready to service the new capacity peaks. The new peak load is accounted for and new baselines are created.

Analysis Paralysis

The overall concept is that you don’t create reports just to create reports. People might read them once and never again. But as it’s automated, they will continue being sent and remain unopened, filtered or archived. This is not the result you want.

The behaviour you want to drive is for people to use your reports. So, you create reports that drive actions. For example, node health reports can provide checklists to drive daily troubleshooting, flag maintenance check-ups, apply upkeep maintenance or repair of devices. Use daily event reports to help the engineering team understand what the normal background noise and static is across your network or to drive a cleanup. Then of course weekly or monthly reports. For example, a WAN/interface report to support bandwidth and equipment investment might only need to be produced monthly, but a faster growing capacity consumption resource should be produced weekly.

Detecting capacity issues through threshold management.

The problem with capacity issues is that they can present themselves in so many different ways, with the result that something isn’t working the way it was, or should be. Just like what I talked about in my blog on bandwidth congestion , a user will report that “some application” doesn’t work like it did yesterday, a capacity threshold alarm has escalated. If you want to learn about root cause analysis, check out Marks video here –> MARKS WEBINAR.

Using Opmantek Products to manage capacity

Add your devices to NMIS (and while you’re at it, ensure that you have a naming convention to follow, have all your SNMP done and your network documented)

  1. IP, Name and Community String
  2. Assign roles to devices (use the in built Core, Distribution, Access)

Preparing Visibility

  1. Set up regular reports using opReports
    1. If you manage a network choose the network reports
    2. If you manage servers use the capacity report
    3. If you manage servers and networks do steps a + b
    4. Set up the scheduling – Have them emailed once a week in time for your planning and performance review session.
  2. Set up capacity Dashboards, Use TopN views in opCharts
    1. Add TopN and Network Maps to your view (good practise)
    2. Create charts for your most important resources

 

Simple Alarming and Notifications

  1. Enable notifications for critical resource capacity issues – Start with Critical and Fatal only out of this list Normal/Warning/Minor/Major/ Critical/Fatal.

Add more later as you gain insight.

  1. Set up email notification to the right teams based on the Role (Core, Distribution Access) or Type of device (Server, Router, Switch) devices for Threshold events to be sent.

Trending – for predictive capacity planning

  1. Enable opTrend to find anomalies in usage (events) and resources which are continuously trending outside of normal (Billboard)
    1. Notify on critical opTrend threshold events.
    2. Review opTrend Top of The Pops Billboard at your regular capacity review meetings.

Simple steps when managing capacity issues as incidents.

While not ideal, issues/incidents seen at the helpdesk could potentially originate from a change that took place on the network or in the environment. In a real world, even the best change management implementation or outage may cause a capacity issue somewhere and trigger an alarm.

Ask. What has changed? Has something in the environment changed?

Typically a capacity threshold breach is an indicator of:

  1. A new service added?
  2. A new demand?
  3. A network change?
  4. Some other change?
  5. A finite asset reaching a predetermined capacity

Approaches to Baselining for Monitoring and Support:

Look at all your resources and review and categorise your resource types, .e.g Internet Connections, Site links etc.  For each category conclude some baseline usage levels as percentages (Fatal , Critical, Major etc) which will be your starting baseline. It is critical to know your baseline as all your threshold alarms will be triggered at the levels you set and your Notifications of Threshold Alarms want to only be for the more serious alarms. You don’t want to “cry wolf.”

Consider grouping your resources, for example: Core, Application, DMZ, Edge, Branch, Internet Links, General WAN etc.

And within each group, consider the following resources you want to monitor:

CPU, Memory, Bandwidth Utilisation

Start by using general thresholds for each based on the peak demands you have seen.

These are your proactive warnings that will send an alarm to your management platform. You may want to set some escalation rules for the resource for example:

85% – 95% → Major → Alarm Notification (business hours) → to the capacity team

>95%+ → Critical → Alarm Notification (24×7) → helpdesk/NOC

Using the trend analysis provided by opTrend, you can identify very Anomalous usage (it’s low when it should normally high at that time of day) or pro-actively look at resources consistently trending up or down vs their normal levels. Hence ahead of time we can start reviewing the resource for appropriate modification (upgrade, downgrade, offloading work etc). As the network continues to grow and support new services, the baseline will change over time (sliding baseline), thus capacity issues may “creep up” on you as alarm thresholds may not be breached all the time to send an alert. It is important to look at the baseline “rate of change” over time as well to determine capacity needs (ex. 10% change over a one week timeframe).  When planning to increase capacity, be sure to allow for the procurement and provisioning time.

I mentioned the sliding baseline and tracking rate of change of the baseline so the capacity issues don’t “creep up”

Uncategorized