Press PLAY to listen to the Podcast on SoundCloud
The 9-1-1 Network is there 24 hours a day, 7 days a week and 365 days a year. Public Safety touts 5 9’s reliability, resiliency, and a redundant network, to be ready when the worse happens. Why is it then, when the worse does happen it turns out to be a very bad day for all?
Exactly 3 years ago during the last week of October in 2012, Hurricane Sandy hit the Northeastern US, and it was a bad week with central offices flooded, networks out of service, and battery backups failing to react, or not providing the uptime that were designed to provide. Much of the PSTN in the Northeast was offline, including many of our 9-1-1 centers.
Just before midnight on April 9th in 2014, the Pacific Northwest experienced a 911 outage that affected a total of 83 PSAPs. This included five PSAPs in Florida, South Carolina, and Pennsylvania that all relied on a common 911 routing service. The root cause here wasn’t Mother Nature. This was a classic “sunny day” outage—one that did not result from an extraordinary disaster or other unforeseeable catastrophes. This outage was caused by a database overflow that prevented new calls from getting a critical record identifier to track them in the network, and, therefore, the calls failed to route and terminate properly.
Yet again, we have another ‘9-1-1 Glitch’, this time in Western Pennsylvania. According to KDKA in Pittsburgh, the problem was the result of a computer communication fail-safe that failed to do its job. Approximately after about 2 hours, the unnamed software’s vendor was able to eventually fix the bug. But there are a few minor details that are still missing. What exactly was the reported ‘bug’ and who is this mystery software vendor? The relevance is that CenturyLink was involved with the Pacific Northwest outage, as well as the Western PA outage. If there is a similarity between the April 2014 outage and the October 2015 outage, the public has a right to know and understand that. It should also be a red flag to any other environment that uses a similar topology.
We like to think our 9-1-1 networks have 5 x 9’s reliability. But to achieve that, you are only allowed to have 5.26 minutes of downtime a year. with this outage being reportedly 2 hours long, that would mean out of the 525600 minutes in a year, they were up for only 525,480 of those minutes, which is 99.97% of the time, or just under 4 x 9’s. Maybe 5 x 9’s is not high enough to strive for? Or maybe we are not holding our 9-1-1 vendors to strict enough SLA’s?
Based on the news reports, there was some problem with the primary system going into a state of partial failure. Since it was not completely failed, the backup never kicked in. This, in itself, is also a failed Active-Standby design. High availability systems today are designed with Active-Active processing, there is no switchover or failover time. The best example of this is any commercial airliner today. Although there are 2 engines on the plane, it is perfectly capable of flying on a single engine. Should one engine fail in flight, the plane can safely navigate and land at an alternate airport for repairs. Both engines are running from takeoff to touchdown, they don’t leave one off in case the first one fails.
I am hoping the true story behind this outage is made public, and soon. Public Safety Administrators have the lives of millions in their hands, and they want to do the best they can, and follow industry best practices. Unfortunately, it seems these system failures are becoming systemic, and while the FCC is stepping in with their Task Force on Optimal PSAP Architecture, I hope that is not too little too late.
Mark J. Fletcher, ENP is the Chief Architect for Worldwide Public Safety Solutions at Avaya. As a seasoned professional with nearly 30 years of service, he provides the strategic roadmap and direction of Next Generation Emergency Services in both the Enterprise and Government portfolios at Avaya. In 2014, Fletcher was made a member of the NENA Institute Board in the US, and co-chair of the EENA NG112 Committee in the EU, where he provides insight to State and Federal legislators globally driving forward both innovation and compliance.o