Posts

When FiOS becomes “Frequent Interruption of Service”


We love the cloud.  We work and live in the cloud. So, when our Verizon FiOS service became unreliable, we had a plan.

The upside of cloud services is that they are available from anywhere you have Internet service.  The down-side is that downtime means disruption.

We Have Options

Redundancy is the most common option considered to avoid Internet service disruptions. In our location, we could add a Charter business connection along side our Verizon FiOS service, and reconfigure our Cisco ASA 5505 to load balance and manage fail-overs. The upside: we avoid outages and service quality issues; the downside is the setup cost and recurring fees.

For now, we have taken a tactical approach. Our service issues are frequent, but short in duration (the side effect of DDOS attacks elsewhere on the network).  When quality drops or the service is down completely, we switch to our MiFi and smartphone HotSpots for Internet access.  By disabling WiFi and relying on the 4G LTE service, we run a softphone client on our iPhones and Android phones, providing full telephony capability and coverage.

While we may deploy a redundant carrier in the future, the flexibility of our cloud-based services and our cellular based Internet access are sufficient in the short term.  Having options, and being able to work, is all part of our Business Continuity planning.

 

Incompetence 16; Microsoft 0

 

Last week, Microsoft’s new Outlook.com service suffered its second major outage since its launch earlier this year.  The most recent outage, a 16 hour fiasco impacting Outlook.com, Hotmail, and SkyDrive users, was due to an botched firmware update resulting in overheating servers in one of its data centers.  As reported in PC World, the switch-over to alternate servers also failed.

This outage follows a 9 1/2 hour Outlook.com outage in February that Microsoft acknowledge on Twitter but neglected to not on its status dashboard.  February also saw a major Azure outage, caused when Microsoft failed to renew and install new SSL security certificates (a mistake they also made one year earlier).  In November, the Office 365 service was down for most of a day when Microsoft was unable to allocate adequate resources.

These strings of outages, all due to operational errors and architectural limitations, raise serious questions about Microsoft’s ability to manage a multi-tenant data center.

They also raise questions about the Microsoft’s integrity with respect to marketing and customer expectations.  While Microsoft promotes Office 365 and it’s other services as redundant, these outages demonstrate that service reliability is facility-dependent.

 

Microsoft Azure Fail! Will Customers Bail?

 

Once again, a flagship Microsoft cloud service blows through the Service Level Agreement like a blizzard through the Midwest.  Th February 22nd outage, impacting all Azure users worldwide, lasted more than 12 hours.

The culprit:  Microsoft failed to purchase and replace expiring SSL certificates.  In other words, Microsoft neglected to renew one of the most basic components that secure the Azure service.

As noted on RedmondMag.com

“Furious customers wanted to know how something as simple as renewing a SSL cert could fall through the cracks. Even worse, how could that become a single point of failure capable of bringing down the entire service throughout the world?”

Once again, an operational error puts thousands of customers  in the dark.  And this outage is one in a string of major service outages, including:

Microsoft described the issue as “A breakdown in our procedures”.  If not for the disruption and financial impact for thousands for companies, this statement might be considered almost comical.  Ironically, a different certificate error was behind a major Azure outage in February 2012.

To put this in perspective, how would you respond if your internal IT department had Microsoft’s track record of catastrophic failure?

 

It is difficult to trust that Microsoft has the operational maturity and rigor to design and manage multi-tenant, hosted services.  The Azure outage, and others like it, demonstrate immaturity, negligence, or incompetence.  Do the reasons matter given the frequency and impact?  With certificate outages on two subsequent annual renewal terms, it is hard to believe that Microsoft is learning from its mistakes.

 

Outlook.com Goes Dark This Time: Can Microsoft Run Cloud Services?

 

As reported by ZDnet on the Feb 25th, Microsoft’s new Outlook.com service suffered an outage lasting more than seven (7) hours.  Many customers could not log in, and those that could experiences significant performance issues.

Even more disturbing, Microsoft did not acknowledge the outage until over 4 hours into the incident, via Twitter.  And,  7 hours into the outage, the outlook.com status page failed to note the outage at all.

This outage follows two Office 365 Outages totaling more than 9.5 hours of down-time in November, 2012.

While Microsoft has not commented on the cause of the Outlook.com outage, their apology to customers back in November disclosed that Microsoft cannot dynamically add and allocate resources to their infrastructure.  The best they can do is improve their ability to recovery (related: Microsoft’s Apology Says Volumes about Office 365 Outages).

With a history of operational failures and acknowledged limitations in the underlying architecture, one has to wonder how well Microsoft is able to manage multi-tenant services.  Will the pattern of failures lead to a lack of trust?

 

Microsoft’s Apology Says Volumes about Office 365 Outages

 

It should be no secret that Microsoft’s Office 365 service continues to experience the types and frequency of outages that plagued its predecessor cloud service, BPOS.  While the outages receive little press coverage (they are frequent enough that they are not newsworthy?) , customers feel the impact.

In response to outages on Nov 8 and Nov 13, Microsoft sent customers a formal letter of apology (read it here).

Most disturbing to Office 365 customers is what Microsoft’s apology says about the quality and capabilities of Microsoft and the Office 365 platform.

With respect to the Nov 8th outage, Microsoft states the following:

“Office 365 utilizes multiple anti-virus engines to identify and clean virus messages from our customers’ inboxes. Going forward, we have built and implemented better recovery tools that allow us to remediate these situations much faster, and we are also adding some additional architectural safeguards that automatically remediate issues of this general nature.”

What this says is that, at times, significant virus traffic makes it to the email servers, and Microsoft has technology to remediate this problem by scanning servers and removing these messages from inboxes.  This is troublesome for a few reasons:

  • Best practice is to prevent viruses from reaching email servers, as any inbox remediation system allows the possibility that a virus is activated by a user before being cleaned.
  • Remediation of this problem has been manually driven and that automating the process is still in development
  • Remediation of virus infections dramatically impacts performance, up to the level of an outage.
  • Microsoft has not yet built an infrastructure that is capable of preventing virus infections, and continues to be focused on remediation.

With respect to the Nov 13th outage, Microsoft states:

“This service incident resulted from a combination of issues related to maintenance, network element failures, and increased load on the service.”

Microsoft acknowledges that they perform maintenance that can interrupt customer services outside of maintenance windows and that the Office 365 architecture lacks sufficient redundancy.  Microsoft is also admitting that the Office 365 infrastructure does not have sufficient capacity to handle peak demand loads and does not allow for automatic activation and allocation of resources based on demand.

In response to these outages, Microsoft promises the following:

“Significant capacity increases are already underway and we are also adding automated handling on these type of failures to speed recovery time.”

In essence, Microsoft cannot  predict or manage capacity, so they are throwing resources at the problem.   More importantly, Microsoft is not fixing the architecture in order to prevent load-based failures — they are automating how they respond to failures.

In other words:  Microsoft expects future Office 365 outages;  So, too, should Office 365 customers.

 

Friday Thought: All Outages are Not Equal

Last week Google Docs experienced an outage lasting about 30 minutes.  Almost immediately, the “reconsider the cloud” articles and blogs began to appear.   Articles like this one on Ars Technica, immediately lump the Google Docs outage with other cloud outages, including Amazon’s outage earlier this year and the on-going problems with Microsoft’s BPOS and Office365 services.

And well no outages are good, they are not all the same.  In most cases, the nature of the outages and their impact reflect the nature of the architecture and the service provider.

  • The Google Docs outage was caused by a memory error and was exposed by an update.  Google acknowledged the error and resolved the issue in under 45 minutes.
  • Amazon’s outage was a network failure that took an entire data center off-line.  Customer that signed up for redundancy were not impacted.
  • Microsoft’s flurry of outages, including a 6 hour outage that took Microsoft almost 90 minutes to fully acknowledge, appear to be related to DNS, load, and other operational issues.

Why is it important to understand the cause and nature of the outage?  With this understanding, you can provide rational comparisons between cloud and in-house systems and between vendors.

Every piece of software has bugs and some bugs are more serious than others.  Google’s architecture enables Google to roll forward and roll back changes rapidly across their entire infrastructure.  The fact that a problem was identified and corrected in under an hour is evidence of the effectiveness of their operations and architecture.

To compare Google to in-house systems, Microsoft releases bug fixes and updates monthly which generally require server reboots.  Depending on the size and use of each server (file/print, Exchange, etc), multiple reboots may be necessary and reboots can run well over an hour.  In the last two years, over 50% of all “patch Tuesday” releases have been followed up with updates, emergency patches, or hot-fixes with the recommendation of immediate action.  Fixing a bug in one of Microsoft’s releases can take from hours to days.  Comparatively, under an hour is not so shabby.

When looking across cloud vendors, the nature of the outage is also important.  Amazon customers that chose not to pay extra for redundancy knowingly assumed a small risk that their systems could become unavailable due to a large error or event.  Just like any IT decision, each business must make a cost/benefit analysis.

Customers should understand the level of redundancy provided with their service and the extra costs involved to ensure better availability.

The most troubling of the cloud outages are Microsoft’s.  Why?  Because the causes appear to relate to an inability to manage a high-volume, multi-tenant infrastructure.  Just like you cannot watch TV without electricity, you cannot run online services (or much of anything on a computer) without DNS.  That Microsoft continues to struggle with DNS, routing, and other operational issues leads me to believe that their infrastructure lacks the architecture and operating procedures to prove reliable.

Should cloud outages make us wary? Yes and no.  Yes to the extent that customers should understand what they are buying with a cloud solution — not just features and functions, but ecosystem.  No, to the extent that when put in perspective, cloud solutions are still generally proving more reliable and available than in-house systems.

 

 

Microsoft BPOS Outage Gets Little Attention

When it comes to Microsoft’s BPOS service, we often hear grumblings from the user community about downtime.  Yet, there is little or no press coverage.  Beyond the frequency of scheduled maintenance windows required for patches and security updates, we hear complaints about performance-related outages and emergency maintenance windows for fixes and updates to scheduled patches.

On Monday, August 23rd, Reuters reported that Microsoft’s US Data Centers experiences an outage lasting more than 2 hours … well beyond the 99.9% SLA touted by Microsoft for the BPOS and other hosted services.

That last Gmail outage made headlines on CNN and USA Today.   The Microsoft BPOS outage received relatively little coverage beyond a core set of reports in the IT press.  Maybe there is a difference in expectations?