A few of you may remember, back in July 1969, what was then
be most famous, and furthest, Long distance phone call ever made. As for the
rest of you, you are now Googling of phones even existed that long ago!
I can assure you that they did, and on July 20, 1969, then President
Nixon spoke with crew members Neil Armstrong and Edwin “Buzz” Aldrin
via telephone-radio transmission, with the President in the Oval Office and the Apollo XI
astronauts Neil Armstrong and Edwin “Buzz” Aldrin while they were on
the surface of the Moon.
Of course, that call originated on landline circuits, that is upconverted to a satellite link and then beamed into outer space on the Goldstone Deep Space Network. In many ways, this radio transmission is capable of voice and data, similar to any terrestrial based radio transmission. We’ve modern advances in communications, just like we have Wi-Fi here on the surface, the International Space Station (ISS) is also connected.
The magic of VoIP allows any IP-based telephone to exist no
matter where the connectivity is coming from. That being said, it was really no
amazing feat to put an IP phone inside the IIS, which apparently was done a few
years ago. Unfortunately, IP phones don’t live on their own, they need to
register and connect with a call server that provides trunk resources to the
outside world. Once again, our space based VoIP phone follows this same rule,
and is connected to an IP telephony system inside NASA headquarters.
As many people do, when calling international numbers people
forget to dial the zero in the 011 International prefix. On the ISS phone, one
of the astronauts recently dialed ‘9’ for an outside line, forgot the ‘0’, and
then dialed ‘1 1’ followed by an international number. Of course, being a KARI’S
LAW compliant telephone system, as soon as the system processed 911, the call
was sent to public safety triggering internal alarms along the way.
Fortunately, everyone realized it was just an accident, and
there was no emergency launch of a police cruiser to intercept the IIS in orbit!
So what’s the lesson learned? 911 needs to work everywhere, including “up there”!
But, it might be a good time to put in a Little missile prevention programming J
For as long as I can remember, we built an engineered networks for “3 R’s”. Resilience, Redundancy, and Reliability. Following the simple rule would protect you from the 50-year flood, the 100-year storm, and many other “rainy day outages”. When the network was up and running, humming away and performing nominally, it was considered a sunny day. Systems were online, everything was running well within specification, and network administrators would sit and babysit their huge collection of silicon and copper wires.
Every once in a while, the skies would become cloudy,
network elements would fail, conductivity would be lost, and the data center
would run at something less than full capacity. While this was certainly
something that needed to be addressed, stress levels remained tolerable as that
engineered Resiliency, Redundancy, and Reliability were all there allowing data
to be processed with little to any notice outside of those directly responsible
for the systems uptime.
Before the evolution of the Internet, and the acceptance of cloud-based services in massive data centers, most facilities manage their own data centers where they had full control over the building, environmentalists, and even diverse carrier network connectivity. Despite this, the IT “Big Bang” (a.k.a. the Internet) occurred in considerably shook up the model. Massive data centers sprung up around the country out in the middle of farmlands that housed thousands and thousands of servers, virtual machines, and facilities for nearly every industry.
There was no mistaking it, the cloud was here and everyone was in it. Performance and capacities rivaled that of localized data centers, and with the proper design, a mesh environment could be established where even if a portion of the network did go off-line, several other nodes were standing by the ready to pick up the slack. Too many, we finally reached a utopia of computing power, and more and more critical applications were perfectly comfortable sitting in the public or quasi-public cloud.
Most of the time, I try to keep my thinking simplistic. I
like to go back to the basics and understand the fundamentals of just about
anything that I do. I believe if you truly understand, at a very deep level,
how a certain process operates then when that process fails your equipped with
the capabilities to properly troubleshoot, repair it, and, most importantly,
design around a similar failure in the future.
The most recent victim over the New Year’s holiday was the CenturyLink network. News reports over the weekend noted that areas of the country including Idaho New Mexico and Minnesota were affected as well as residential services in 35 total states. 911 services were also affected across the country which prompted nationwide alerts to cell phone users advising them to utilize local 10 digit numbers in case of an emergency. Initial signs of the outage were detected around 1 AM Pacific time Thursday morning with the resolution being achieved by approximately 6 PM Pacific time on Friday for a total of about 41 hours.
Five nines reliability??
When we build networks, we strive for five nines reliability, or 99.999% uptime calculated on an annual basis. Mathematically this works out to be just over five minutes of outage allowed per year. Based on 41 hours of disruption in the CenturyLink network, they’re starting off a brand-new year already down to just 2 1/2 nines, or 99.531 by my calculation.
So let’s look a little deeper into this particular failure,
keeping in mind that the full root cause analysis is still likely a week or so
In many cases, resilience is not a specific thing. Resilience is the ability to step back into action when any particular outage occurs. It doesn’t define what that action should be, only that it was quickly identified and remediated. So while it’s a bit nebulous, you might say that resiliency is likely one of the most important pieces of any recovery plan. Contingencies are expected, spare parts are readily available, and monitoring tools have been deployed to quickly isolate problems, as well as the training and skill sets of personnel to utilize those tools and carry out any remediation tasks.
When we talk about reliability, we talk about the confidence level that a particular component will perform nominally. For example, an incandescent light bulb may operate for up to 2000 hours, but new LED replacement lamps are routinely quoted as 50,000 hours of operation, making those lamps 25 times “more reliable”. In telecommunications networks, if you cannot increase in individual components reliability, utilizing a high-availability, the active-active model can ultimately achieve the same goal. If one processor fails, the other processor is already running taking over the operation. This shouldn’t be confused with active – standby, where there is still a disruption, although minimal, as the secondary processor comes online. The critical component here is the detection of the failure and the redirection to the secondary processor.
There is always strength in numbers. Redundancy goes hand-in-hand with both resiliency and reliability. Nothing ever lasts forever, especially electronic components. We try to calculate an MTBF (mean time before failure), however, those numbers are usually unrealistic for day-to-day operations as there are many contributing factors at the subcomponent level that could cause catastrophic failure.
The magic to running a solid and stable network is to closely manage, monitor, and statistically analyze every possible metric that there is. When a failure or does occur, careful root cause analysis must be undertaken to determine what in fact failed, but it doesn’t end there. Taking it a step further and understanding the key indicators that were present prior to that failure are going to be what help you proactively divert that failure in the future. Burn me once, shame on you. Burn me twice, shame on me. Sunny day outages are the new uptime threat.