When the Single Point of Failure Actually Fails
While the heavy, wet snow continues to fall and cling to the power and fiber optic lines in our area, today’s Internet outage was not due to the first real storm this winter. Being fully in the cloud, any Internet outage could be a disaster, bringing business to a halt. In reality, the “single point of failure” really isn’t. True, we do not have multiple routers. Nor do we have multiple broadband connections. What we do have, is the ability to work over any form of Internet connection. Here is our case study (still in progress).
Late yesterday afternoon, our trusted Cisco 5505 stopped working. Poof. Red Status light on; activity lights on the embedded switch ports blinking; no traffic. A few reboots and a few attempted hard resets later, we are still not working. A quick call and discussion, and our Cisco guru tells us “it’s a brick”. Covered by warranty and a solid support/service plan, a new unit will arrive in several days. In the meantime, we must continue to service our customers.
Quick Fix
The immediate response is to get our staff connected to the Internet in any way possible. A few mobile hotspots activated on our phones and one MiFi device booted up, and we are back in business. Performance is acceptable, not great, and we will plow through our data plan, but we are in business with only a few minutes disruption.
Interim Fix
Our FiOS service enters our office through a service unit that converts the Fiber to Gigabit Ethernet. We split this signal through a switch to 2 routers — one provided by our VoiP service and the FiOS router/cable modem that comes with our service. The now dead Cisco ASA plugs into the FiOS router.
Why two routers in sequence? Having 2 routers in sequence creates a physical DMZ: a network that can receive traffic from inside and the outside while letting us stop traffic from going all the way out or coming all the way in. It’s “old school” as virtual DMZs are the trend. We use the DMZ and the FiOS router for a guest network and wireless. Guests can gain access to a physical or wireless connection while staying completely outside our secure network. The Cisco ASA, at the secure end of the DMZ, manages our inbound traffic, NAT, and legacy DMZ services (let over from the days when we had a few systems on-premise and needed remote access). Our secure WiFi runs off a Cisco/Linksys WAP inside the secure border of the ASA router.
With a few minutes of work, we reconfigured the FiOS router, removing the DMZ and mimicking the settings and security configured in the ASA. Moving a few wires, we are up and running until the new ASA comes in.
Lessons Learned
Our focus has always been on the FiOS service as the single point of failure at greatest risk. Outages have traditionally been short and as we have been able to adapt by using hotspots, MiFi, and working from home or other locations, we have not seen the need to bring in another ISP as an alternate service. The ASA failing was never really a consideration. The box is not yet out of warranty and our prior Cisco routers lasted much longer than the 5 year extended warranty (we upgraded for features, not out of necessity).
Not having seen this scenario coming, we had to rebuild the FiOS router from scratch. Going forward, we have now saved this “emergency configuration” for future use. Once our new Cisco ASA arrives, we will create an emergency configuration that will let us remove the FiOS router from the network. Finally, we will build a configuration for the Cisco/Linksys WAP, as this has routing features and could replace the FiOS router in a pinch.
The biggest lesson, however, is the value of a cloud-based infrastructure with respect to business continuity. Storm or no storm, hardware failure or not, we know that we will always have options to keep our business up and running. Even when the “single point of failure” happens to fail.