Ask a business or technology leader how much downtime their company can tolerate, and they’ll answer “none” almost every single time. But what does “none” actually mean? More importantly, how much will “none” cost?
The fact is, hardware is GOING to fail… and the ways in which a piece of hardware or system can fail is almost endless. Whether it’s a software bug, hardware failure, power outage, environmental issue, a failure in the cable plant or a natural disaster, all of these conspire to disrupt your IT systems and impact your business. Many of these threats can be mitigated, some can’t, and others may be too costly to remediate. What’s important is that we understand and address what can be mitigated and have a plan of action for the items that cannot be mitigated.
The Business Implications
Before we get started in planning high availability, we need to ensure that there is an actual business problem to solve. The first step in developing a strategy for high availability, is to understand what impact an outage of various systems would have on our ability to do business.
Consider a small real estate company with lots of remote offices around the country, supporting mostly virtual workers and providing meeting space. The only thing they REALLY use the network for is background checks, credit checks, and email. Facebook, YouTube, Snapchat and the others don’t count.
So what happens if the network goes down? Today, the workers can just use their smartphones to access their apps, make phone calls, and check email until normal operations are restored. Inconvenient? Yes. Business impacting problem? [Shrugs] Not really. Being able to restore operations by the next business day should be more than adequate for this scenario.
On the other end of the spectrum, let’s consider a global call center where there are five hundred agents receiving calls 24/7? If either the IP telephony or applications that the call agents use are not available, the business stops and so does the revenue. Now we have clearly identified a business problem to solve. The potential lost revenue will serve as a guide of how much we can spend to mitigate a potential outage. This is critical as technical teams can often get lost in the technology. It’s important that remedy is not worse than the affliction.
So the question of “How much availability do I need?” is kind of like asking how long a piece of string is – it COMPLETELY depends on the impact the outage would have to your company.
How To Measure
Now let’s talk about how availability is measured? This table shows a widely accepted methodology for measuring downtime. Most have probably heard of five 9’s as a standard for measuring availability. In order to claim five 9’s, all of the outages for a system added up can total no more than 5.26 minutes in a year – assuming a 24 x 7 operation. If we can reduce the combined downtime for a system to no more than 31.5 seconds annually, we would have moved to six 9’s.
Note that most companies would be thrilled just to achieve three 9’s or a total downtime of 8.76 hours for system, annually. When I mention system, it’s shorthand for business system. An obvious business system would be IP Telephony and there are many supporting systems that combine to ensure that the phone on your desk works. Therefore, it’s critical that we measure downtime in any component that affects a part or the entire system for our measurements.
Some companies do not count scheduled maintenance as downtime. Technically, this is cheating unless you are taking advantage of a business shutdown – such as over a holiday or holiday weekend. Certainly not one of the perks of being in IT! If your business truly runs 24 x 7, we need to ensure that the design allows for in-service maintenance and replacement of critical elements. For example, stacking switches are not going to support this but a chassis will.
Did you notice that 100% uptime didn’t show up in the table? That’s because a transition from a primary system to a backup system takes time. Assuming redundant hardware and considering the different architectures, designs and protocols available, a failover from a primary system to a secondary system can take anywhere from never, to minutes to milliseconds.
Since there is no such thing as a system that will never fail, and we know it will take some period of time to restore operations, 100% is not possible.
In order to achieve more 9’s we will need to leverage redundancy and resiliency in our architecture, design and protocols. Sure this sounds obvious but most focus on redundant servers, routers, firewall, and switches and miss the supporting layers that support these devices. Consider the following:
- Are there two UPS devices, with two separate power feeds that take two separate paths?
- Do we have batteries to buffer a power outage?
- What about an extended power outage: Do we have a generator that can supply power for an extended period?
- Can it be refueled?
Power outages tend to be one of the most common outages and these quickly reveal any misses in the solution design; bringing down our very expensive redundant hardware. Yeah, we also call this a very bad day.
Let’s dig a little deeper. Are the redundant devices kept in physically separate rooms, buildings or even geographic areas? After a cement slug and water from a hole drilled in the ceiling took out two Catalyst 6500 switches in a wiring closet, It’s sure on my mind. We lost a quarter of the building for a day and it was a mad scramble. I live in Florida, so planning for a hurricane often means ensuring we have redundant systems in another geographic area that can provide business continuity if a hurricane or other disaster hits.
And finally, what about the cable plant: Do we have separate physical paths both within and between buildings? Why? because construction, water, fire, squirrels, backhoes and cable trenchers are all intent on taking your cable plant out. I know, at this point, you are probably wondering how I sleep at night?
We are really only scratching the surface. But, this should give you an idea of how critical it is to consider all of the layers that make up a business system. It’s awfully embarrassing to see very expensive redundant IT hardware rendered useless due to a power failure and it happens all the time. Remember, it’s critical to ensure that any risk mitigation you engage in is directly tied to the potential business impact we are mitigating.
Here’s the moral of the story: We need to understand ALL of the elements that could impact our deployment, and then determine which ones we can effectively mitigate. So if we can’t afford, or even get permitting to dig under the road to create a second physical data path to our building, we could rig a Build-to-Building WiFi connection at considerably less cost.
The reality is that budgets are limited so when given lemons make lemonade. Be creative and mitigate what we CAN and understand what we CANNOT mitigate. This allows us to create contingency plans that work around these events. I mean God forbid we have to use paper and pencil, but it may be an effective short term solution that’s pretty cost effective. Design matters and the architecture, design and hardware selected will have a tremendous impact on how available your final business system is. At the end of the day, thinking outside the box and leveraging creative solutions will deliver an optimal deployment with a minimal investment.