15 May 2020

How to Manage Capacity, Before it Becomes a Problem.

Capacity Management is the proactive management of any measurable finite resource.

This blog will help you with a simple to follow outline on how to properly manage capacity, so if you ever have to resolve capacity issues, you are ahead of the curve and ready to implement remediation.

Capacity management has been considered by many as difficult to achieve. But all worthwhile achievements take discipline to execute and accomplish. So, with careful consideration, monitoring and planning you can ensure that it becomes manageable and deliverable.

Don’t forget that as part of any new deployment or upgrade, and as budget allows, additional demand should be incorporated into the design, with additional capacity ready to service the new capacity peaks. The new peak load is accounted for and new baselines are created.

Analysis Paralysis

The overall concept is that you don’t create reports just to create reports. People might read them once and never again. But as it’s automated, they will continue being sent and remain unopened, filtered or archived. This is not the result you want.

The behaviour you want to drive is for people to use your reports. So, you create reports that drive actions. For example, node health reports can provide checklists to drive daily troubleshooting, flag maintenance check-ups, apply upkeep maintenance or repair of devices. Use daily event reports to help the engineering team understand what the normal background noise and static is across your network or to drive a cleanup. Then of course weekly or monthly reports. For example, a WAN/interface report to support bandwidth and equipment investment might only need to be produced monthly, but a faster growing capacity consumption resource should be produced weekly.

Detecting capacity issues through threshold management.

The problem with capacity issues is that they can present themselves in so many different ways, with the result that something isn’t working the way it was, or should be. Just like what I talked about in my blog on bandwidth congestion , a user will report that “some application” doesn’t work like it did yesterday, a capacity threshold alarm has escalated. If you want to learn about root cause analysis, check out Marks video here –> MARKS WEBINAR.

Using Opmantek Products to manage capacity

Add your devices to NMIS (and while you’re at it, ensure that you have a naming convention to follow, have all your SNMP done and your network documented)

IP, Name and Community String
Assign roles to devices (use the in built Core, Distribution, Access)

Preparing Visibility

Set up regular reports using opReports
1. If you manage a network choose the network reports
2. If you manage servers use the capacity report
3. If you manage servers and networks do steps a + b
4. Set up the scheduling – Have them emailed once a week in time for your planning and performance review session.
Set up capacity Dashboards, Use TopN views in opCharts
1. Add TopN and Network Maps to your view (good practise)
2. Create charts for your most important resources

Simple Alarming and Notifications

Enable notifications for critical resource capacity issues – Start with Critical and Fatal only out of this list Normal/Warning/Minor/Major/ Critical/Fatal.

Add more later as you gain insight.

Set up email notification to the right teams based on the Role (Core, Distribution Access) or Type of device (Server, Router, Switch) devices for Threshold events to be sent.

Trending – for predictive capacity planning

Enable opTrend to find anomalies in usage (events) and resources which are continuously trending outside of normal (Billboard)
1. Notify on critical opTrend threshold events.
2. Review opTrend Top of The Pops Billboard at your regular capacity review meetings.

Simple steps when managing capacity issues as incidents.

While not ideal, issues/incidents seen at the helpdesk could potentially originate from a change that took place on the network or in the environment. In a real world, even the best change management implementation or outage may cause a capacity issue somewhere and trigger an alarm.

Ask. What has changed? Has something in the environment changed?

Typically a capacity threshold breach is an indicator of:

A new service added?
A new demand?
A network change?
Some other change?
A finite asset reaching a predetermined capacity

Approaches to Baselining for Monitoring and Support:

Look at all your resources and review and categorise your resource types, .e.g Internet Connections, Site links etc. For each category conclude some baseline usage levels as percentages (Fatal , Critical, Major etc) which will be your starting baseline. It is critical to know your baseline as all your threshold alarms will be triggered at the levels you set and your Notifications of Threshold Alarms want to only be for the more serious alarms. You don’t want to “cry wolf.”

Consider grouping your resources, for example: Core, Application, DMZ, Edge, Branch, Internet Links, General WAN etc.

And within each group, consider the following resources you want to monitor:

CPU, Memory, Bandwidth Utilisation

Start by using general thresholds for each based on the peak demands you have seen.

These are your proactive warnings that will send an alarm to your management platform. You may want to set some escalation rules for the resource for example:

85% – 95% → Major → Alarm Notification (business hours) → to the capacity team

>95%+ → Critical → Alarm Notification (24×7) → helpdesk/NOC

Using the trend analysis provided by opTrend, you can identify very Anomalous usage (it’s low when it should normally high at that time of day) or pro-actively look at resources consistently trending up or down vs their normal levels. Hence ahead of time we can start reviewing the resource for appropriate modification (upgrade, downgrade, offloading work etc). As the network continues to grow and support new services, the baseline will change over time (sliding baseline), thus capacity issues may “creep up” on you as alarm thresholds may not be breached all the time to send an alert. It is important to look at the baseline “rate of change” over time as well to determine capacity needs (ex. 10% change over a one week timeframe). When planning to increase capacity, be sure to allow for the procurement and provisioning time.

I mentioned the sliding baseline and tracking rate of change of the baseline so the capacity issues don’t “creep up”