2,278 Days of Waiting

Back in the spring of 2012, I had the privilege of presenting at the Illinois Institute of Technology’s Real-Time Communication conference in Chicago. In the session, I presented my construct for emergency-services location information delivery, in a new over the top model that did not require a next-generation 911 network or ESInet. Although many laughed at the thought, Dr. Henning Schulzrinne, a noted professor at Columbia University, and one of the authors of several Session Initiation Protocol RFC’s invited me to present my ideas at a Federal Communications Commission Workshop on Upcoming Test Bed to Improve Indoor Location Accuracy for Wireless 911 Calls.

Of course, I agreed and made my way down to Washington DC where I delivered my presentation. I laid out my over-the-top delivery methodology for additional data, where I effectively bypassed the voice carrier networks using the Internet and releasing Ma Bell’s grasp and control of emergency services location data, and it’s strongarm binding to pre-existing static location records and phone numbers.

While many saw the value of my architecture, of course, there were a few naysayers. None the less, the idea itself was simple and quickly solved the problem of getting data from the origination point to the resources that needed it; instead of storing the information in a carrier-hosted database, where the subscriber would not only have to pay for storage but maintenance as well as updates. By placing a static, but unique, pointer in the carrier database, any queries could be redirected back to the origination network. Not only would this remove the excessive costs charged by the 911 location database providers, but the actual information in the database would also now be available in real-time, and be the most current available.

If any updates occurred, such as location change or a change of descriptive information, these would only be needed in the internal copy of the database. With this being owned and managed by the enterprise and entirely under their control, this model was much more efficient than the carrier-based model. The only piece missing was the connection to the PSAP, however by publishing the URL to the data in the Enterprise, Public Safety was able to reach out to the data when needed. The functional element on the enterprise side of this model filled the role of feeding the URL data and proved to be a practical and efficient solution. Based on this model, Avaya had the SENTRY™ emergency call management platform developed by 911 Secure, LLC as well as the associated integration modules. Now, enterprise networks could prepare for NG 911 services that were going to arrive shortly.

The entire premise for this architecture was because the connection between the originating network and the public safety answer point was an analog circuit capable of voice communications only. What was missing and remained absent for the next 6 years 2 months and 26 days, was a secure, high-speed connection between the origination and the destination.

Earlier this year, RapidSOS announce the interoperability released with iOS 12 telephones. When those devices placed an emergency call, the location payload stored in the device would be transmitted to the NG 911 Additional Data Repository (ADR) being provided by RapidSOS. PSAP’s could access the repository by a standard query, once being vetted, and were able to retrieve the location of devices that originated emergency calls within their service area. Just a short time afterward, Google announced similar capabilities, also utilizing the RapidSOS repository. Within months of the availability, over 2000 PSAPs added the capability to their centers, covering nearly 70% of the population of the US.

Full disclosure, for the past five years I have held a non-compensated position as a technical advisory board member to RapidSOS. Because of this, I saw firsthand the value this service brought to the table with this new national repository. Since RapidSOS started to ingest data from any source through their published APIs, I immediately went to work with the software engineers at 911 Secure, LLC, the developers behind SENTRY™, and had them create an integration module allowing the enterprise also to contribute location and additional data with emergency calls. Within a few short weeks, they delivered a working model placing data in the RapidSOS sandbox, and the search began for an Avaya customer to be part of a live pilot program.

Fortunately, Shelby County Tennessee, a long time Avaya customer, was in the process of upgrading their CS1K communications platform to the latest Avaya Aura. Over on the public safety side, Shelby County 911 had just implemented the embedded RapidSOS capability in there Motorola VESTA™ platform, as well as the RapidSOS Lite web-based functionality in the PSAP serving the County facility. After presenting our use case to both parties, we began installation of SENTRY™ just before the holidays.

Finally, the day of reckoning came. On January 18, 2019, 2278 days after I presented my over-the-top architecture to The Federal Communications Commission, a live call to 911 was placed from the Shelby County Buildings Department, answered at the Shelby County 911 Center, where they received Voice, precise location, and additional data in the form of floor plans. There it was, we made public safety technology history. I couldn’t help but feel a sense of pride, as we changed the game forever, and proved that NG911 was not only possible but a reality.

This past year I had the honor of being in the Oval Office with Hank Hunt as the President signed Kari’s Law into the law of the land, I was invited to be part of the Haleyville Alabama 50th year 911 Day celebration and be a Grand Marshal with several of my good friends in the Town parade. Now, I was a part of telecommunications history as Avaya, 911 Secure, RapidSOS, Shelby County Buildings, and Shelby County 911 worked in concert to enable the very first emergency call delivering NG 911 additional data to the PSAP.

Not only will this technology help save lives but provide desperately needed location details to public safety first responders, as well as critical multimedia such as video and still pictures in the event of an emergency.

Follow me on Twitter: @Fletch911
Read my Avaya blogs: http://Avaya.com/Fletcher

Plan 911 From Outer Space

A few of you may remember, back in July 1969, what was then be most famous, and furthest, Long distance phone call ever made. As for the rest of you, you are now Googling of phones even existed that long ago!

I can assure you that they did, and on July 20, 1969, then President Nixon spoke with crew members Neil Armstrong and Edwin “Buzz” Aldrin via telephone-radio transmission, with the  President in the Oval Office and the Apollo XI astronauts Neil Armstrong and Edwin “Buzz” Aldrin while they were on the surface of the Moon.

Of course, that call originated on landline circuits, that is upconverted to a satellite link and then beamed into outer space on the Goldstone Deep Space Network. In many ways, this radio transmission is capable of voice and data, similar to any terrestrial based radio transmission. We’ve modern advances in communications, just like we have Wi-Fi here on the surface, the International Space Station (ISS) is also connected.

The magic of VoIP allows any IP-based telephone to exist no matter where the connectivity is coming from. That being said, it was really no amazing feat to put an IP phone inside the IIS, which apparently was done a few years ago. Unfortunately, IP phones don’t live on their own, they need to register and connect with a call server that provides trunk resources to the outside world. Once again, our space based VoIP phone follows this same rule, and is connected to an IP telephony system inside NASA headquarters.

As many people do, when calling international numbers people forget to dial the zero in the 011 International prefix. On the ISS phone, one of the astronauts recently dialed ‘9’ for an outside line, forgot the ‘0’, and then dialed ‘1 1’ followed by an international number. Of course, being a KARI’S LAW compliant telephone system, as soon as the system processed 911, the call was sent to public safety triggering internal alarms along the way.

Fortunately, everyone realized it was just an accident, and there was no emergency launch of a police cruiser to intercept the IIS in orbit! So what’s the lesson learned? 911 needs to work everywhere, including “up there”! But, it might be a good time to put in a Little missile prevention programming J

Sunny Day Outages . . . Uptime Threats

Click here for the Audio Podcast of this Blog

For as long as I can remember, we built an engineered networks for  “3 R’s”. Resilience, Redundancy, and Reliability. Following the simple rule would protect you from the 50-year flood, the 100-year storm, and many other “rainy day outages”. When the network was up and running, humming away and performing nominally, it was considered a sunny day. Systems were online, everything was running well within specification, and network administrators would sit and babysit their huge collection of silicon and copper wires.

Every once in a while, the skies would become cloudy, network elements would fail, conductivity would be lost, and the data center would run at something less than full capacity. While this was certainly something that needed to be addressed, stress levels remained tolerable as that engineered Resiliency, Redundancy, and Reliability were all there allowing data to be processed with little to any notice outside of those directly responsible for the systems uptime.

Before the evolution of the Internet, and the acceptance of cloud-based services in massive data centers, most facilities manage their own data centers where they had full control over the building, environmentalists, and even diverse carrier network connectivity. Despite this, the IT “Big Bang” (a.k.a. the Internet) occurred in considerably shook up the model. Massive data centers sprung up around the country out in the middle of farmlands that housed thousands and thousands of servers, virtual machines, and facilities for nearly every industry.

There was no mistaking it, the cloud was here and everyone was in it. Performance and capacities rivaled that of localized data centers, and with the proper design, a mesh environment could be established where even if a portion of the network did go off-line, several other nodes were standing by the ready to pick up the slack. Too many, we finally reached a utopia of computing power, and more and more critical applications were perfectly comfortable sitting in the public or quasi-public cloud.

Most of the time, I try to keep my thinking simplistic. I like to go back to the basics and understand the fundamentals of just about anything that I do. I believe if you truly understand, at a very deep level, how a certain process operates then when that process fails your equipped with the capabilities to properly troubleshoot, repair it, and, most importantly, design around a similar failure in the future.

The most recent victim over the New Year’s holiday was the CenturyLink network. News reports over the weekend noted that areas of the country including Idaho New Mexico and Minnesota were affected as well as residential services in 35 total states. 911 services were also affected across the country which prompted nationwide alerts to cell phone users advising them to utilize local 10 digit numbers in case of an emergency. Initial signs of the outage were detected around 1 AM Pacific time Thursday morning with the resolution being achieved by approximately 6 PM Pacific time on Friday for a total of about 41 hours.

Five nines reliability??

When we build networks, we strive for five nines reliability, or 99.999% uptime calculated on an annual basis. Mathematically this works out to be just over five minutes of outage allowed per year. Based on 41 hours of disruption in the CenturyLink network, they’re starting off a brand-new year already down to just 2 1/2 nines, or 99.531 by my calculation.

So let’s look a little deeper into this particular failure, keeping in mind that the full root cause analysis is still likely a week or so away.

RESILIENCE

In many cases, resilience is not a specific thing. Resilience is the ability to step back into action when any particular outage occurs. It doesn’t define what that action should be, only that it was quickly identified and remediated. So while it’s a bit nebulous, you might say that resiliency is likely one of the most important pieces of any recovery plan. Contingencies are expected, spare parts are readily available, and monitoring tools have been deployed to quickly isolate problems, as well as the training and skill sets of personnel to utilize those tools and carry out any remediation tasks.

RELIABILITY

When we talk about reliability, we talk about the confidence level that a particular component will perform nominally. For example, an incandescent light bulb may operate for up to 2000 hours, but new LED replacement lamps are routinely quoted as 50,000 hours of operation, making those lamps 25 times “more reliable”. In telecommunications networks, if you cannot increase in individual components reliability, utilizing a high-availability, the active-active model can ultimately achieve the same goal. If one processor fails, the other processor is already running taking over the operation. This shouldn’t be confused with active – standby, where there is still a disruption, although minimal, as the secondary processor comes online. The critical component here is the detection of the failure and the redirection to the secondary processor.

REDUNDANCY

 There is always strength in numbers. Redundancy goes hand-in-hand with both resiliency and reliability. Nothing ever lasts forever, especially electronic components. We try to calculate an MTBF (mean time before failure), however, those numbers are usually unrealistic for day-to-day operations as there are many contributing factors at the subcomponent level that could cause catastrophic failure.

The magic to running a solid and stable network is to closely manage, monitor, and statistically analyze every possible metric that there is. When a failure or does occur, careful root cause analysis must be undertaken to determine what in fact failed, but it doesn’t end there. Taking it a step further and understanding the key indicators that were present prior to that failure are going to be what help you proactively divert that failure in the future. Burn me once, shame on you. Burn me twice, shame on me. Sunny day outages are the new uptime threat.

Follow me on Twitter http://twitter.com/fletch911

WordPress.com.

Up ↑