Click here for the Audio Podcast of this Blog
For as long as I can remember, we built an engineered networks for “3 R’s”. Resilience, Redundancy, and Reliability. Following the simple rule would protect you from the 50-year flood, the 100-year storm, and many other “rainy day outages”. When the network was up and running, humming away and performing nominally, it was considered a sunny day. Systems were online, everything was running well within specification, and network administrators would sit and babysit their huge collection of silicon and copper wires.
Every once in a while, the skies would become cloudy, network elements would fail, conductivity would be lost, and the data center would run at something less than full capacity. While this was certainly something that needed to be addressed, stress levels remained tolerable as that engineered Resiliency, Redundancy, and Reliability were all there allowing data to be processed with little to any notice outside of those directly responsible for the systems uptime.
Before the evolution of the Internet, and the acceptance of cloud-based services in massive data centers, most facilities manage their own data centers where they had full control over the building, environmentalists, and even diverse carrier network connectivity. Despite this, the IT “Big Bang” (a.k.a. the Internet) occurred in considerably shook up the model. Massive data centers sprung up around the country out in the middle of farmlands that housed thousands and thousands of servers, virtual machines, and facilities for nearly every industry.
There was no mistaking it, the cloud was here and everyone was in it. Performance and capacities rivaled that of localized data centers, and with the proper design, a mesh environment could be established where even if a portion of the network did go off-line, several other nodes were standing by the ready to pick up the slack. Too many, we finally reached a utopia of computing power, and more and more critical applications were perfectly comfortable sitting in the public or quasi-public cloud.
Most of the time, I try to keep my thinking simplistic. I like to go back to the basics and understand the fundamentals of just about anything that I do. I believe if you truly understand, at a very deep level, how a certain process operates then when that process fails your equipped with the capabilities to properly troubleshoot, repair it, and, most importantly, design around a similar failure in the future.
The most recent victim over the New Year’s holiday was the CenturyLink network. News reports over the weekend noted that areas of the country including Idaho New Mexico and Minnesota were affected as well as residential services in 35 total states. 911 services were also affected across the country which prompted nationwide alerts to cell phone users advising them to utilize local 10 digit numbers in case of an emergency. Initial signs of the outage were detected around 1 AM Pacific time Thursday morning with the resolution being achieved by approximately 6 PM Pacific time on Friday for a total of about 41 hours.
Five nines reliability??
When we build networks, we strive for five nines reliability, or 99.999% uptime calculated on an annual basis. Mathematically this works out to be just over five minutes of outage allowed per year. Based on 41 hours of disruption in the CenturyLink network, they’re starting off a brand-new year already down to just 2 1/2 nines, or 99.531 by my calculation.
So let’s look a little deeper into this particular failure, keeping in mind that the full root cause analysis is still likely a week or so away.
In many cases, resilience is not a specific thing. Resilience is the ability to step back into action when any particular outage occurs. It doesn’t define what that action should be, only that it was quickly identified and remediated. So while it’s a bit nebulous, you might say that resiliency is likely one of the most important pieces of any recovery plan. Contingencies are expected, spare parts are readily available, and monitoring tools have been deployed to quickly isolate problems, as well as the training and skill sets of personnel to utilize those tools and carry out any remediation tasks.
When we talk about reliability, we talk about the confidence level that a particular component will perform nominally. For example, an incandescent light bulb may operate for up to 2000 hours, but new LED replacement lamps are routinely quoted as 50,000 hours of operation, making those lamps 25 times “more reliable”. In telecommunications networks, if you cannot increase in individual components reliability, utilizing a high-availability, the active-active model can ultimately achieve the same goal. If one processor fails, the other processor is already running taking over the operation. This shouldn’t be confused with active – standby, where there is still a disruption, although minimal, as the secondary processor comes online. The critical component here is the detection of the failure and the redirection to the secondary processor.
There is always strength in numbers. Redundancy goes hand-in-hand with both resiliency and reliability. Nothing ever lasts forever, especially electronic components. We try to calculate an MTBF (mean time before failure), however, those numbers are usually unrealistic for day-to-day operations as there are many contributing factors at the subcomponent level that could cause catastrophic failure.
The magic to running a solid and stable network is to closely manage, monitor, and statistically analyze every possible metric that there is. When a failure or does occur, careful root cause analysis must be undertaken to determine what in fact failed, but it doesn’t end there. Taking it a step further and understanding the key indicators that were present prior to that failure are going to be what help you proactively divert that failure in the future. Burn me once, shame on you. Burn me twice, shame on me. Sunny day outages are the new uptime threat.