Posts

Friday Thought: All Outages are Not Equal

Last week Google Docs experienced an outage lasting about 30 minutes.  Almost immediately, the “reconsider the cloud” articles and blogs began to appear.   Articles like this one on Ars Technica, immediately lump the Google Docs outage with other cloud outages, including Amazon’s outage earlier this year and the on-going problems with Microsoft’s BPOS and Office365 services.

And well no outages are good, they are not all the same.  In most cases, the nature of the outages and their impact reflect the nature of the architecture and the service provider.

  • The Google Docs outage was caused by a memory error and was exposed by an update.  Google acknowledged the error and resolved the issue in under 45 minutes.
  • Amazon’s outage was a network failure that took an entire data center off-line.  Customer that signed up for redundancy were not impacted.
  • Microsoft’s flurry of outages, including a 6 hour outage that took Microsoft almost 90 minutes to fully acknowledge, appear to be related to DNS, load, and other operational issues.

Why is it important to understand the cause and nature of the outage?  With this understanding, you can provide rational comparisons between cloud and in-house systems and between vendors.

Every piece of software has bugs and some bugs are more serious than others.  Google’s architecture enables Google to roll forward and roll back changes rapidly across their entire infrastructure.  The fact that a problem was identified and corrected in under an hour is evidence of the effectiveness of their operations and architecture.

To compare Google to in-house systems, Microsoft releases bug fixes and updates monthly which generally require server reboots.  Depending on the size and use of each server (file/print, Exchange, etc), multiple reboots may be necessary and reboots can run well over an hour.  In the last two years, over 50% of all “patch Tuesday” releases have been followed up with updates, emergency patches, or hot-fixes with the recommendation of immediate action.  Fixing a bug in one of Microsoft’s releases can take from hours to days.  Comparatively, under an hour is not so shabby.

When looking across cloud vendors, the nature of the outage is also important.  Amazon customers that chose not to pay extra for redundancy knowingly assumed a small risk that their systems could become unavailable due to a large error or event.  Just like any IT decision, each business must make a cost/benefit analysis.

Customers should understand the level of redundancy provided with their service and the extra costs involved to ensure better availability.

The most troubling of the cloud outages are Microsoft’s.  Why?  Because the causes appear to relate to an inability to manage a high-volume, multi-tenant infrastructure.  Just like you cannot watch TV without electricity, you cannot run online services (or much of anything on a computer) without DNS.  That Microsoft continues to struggle with DNS, routing, and other operational issues leads me to believe that their infrastructure lacks the architecture and operating procedures to prove reliable.

Should cloud outages make us wary? Yes and no.  Yes to the extent that customers should understand what they are buying with a cloud solution — not just features and functions, but ecosystem.  No, to the extent that when put in perspective, cloud solutions are still generally proving more reliable and available than in-house systems.

 

 

Tuesday Take-Away: The True Role of the SLA

As you look towards cloud solutions for more cost effective applications, infrastructure, or services, you are going to hear (and learn) a lot about Service Level Agreements, or SLAs.  Much of what you will hear is a big debate about the value of SLAs and what SLAs offer you, the customer.

Unfortunately, the some vendors are framing the value of their SLAs based on the compensation customers receive when the vendor fails to meet their service level commitments.  The best example of this attitude is Microsoft’s comparison of its cash payouts to Google’s SLA that provides free days of service.  Microsoft touts its cash refunds as a better response to failure.  Why any company would send out a marketing message that begins with “When we fail …” is beyond me.  But, that is a subject for another post someday.

That said, Microsoft and its customers that are comforted by the compensation, are totally missing the point of the SLA in the first place.  Any compensation for excessive downtime is irrelevant with respect to the actual cost and impact on your business.  And unless a vendor is failing miserably and often, the compensation itself is not going to change the vendor’s track record.

The true rule of the SLA is to communicate the vendor’s commitment to providing you with service that meets defined expectations for Performance, Availability, and Reliability (PAR).  The SLA should also communicate how the vendor defines and sets priorities for problems and how they will respond based on those priorities.  A good SLA will set expectations and define the method of measuring if those expectations are met.

Continuing with the Microsoft and Google example.  Microsoft sets an expectation that you will have downtime.  While the downtime is normally scheduled in advance, it may not be.  Google, in contrast, sets an expectation that you should have no downtime, ever.   The details follow.

Microsoft’s SLA is typical in that it excludes maintenance windows, periods of time the system will be unavailable for scheduled or emergency maintenance.  While Microsoft does not schedule these windows at a regular weekly or monthly time frame, they do promise to give you reasonable notice for maintenance windows.  The SLA, however, allows Microsoft to declare emergency maintenance windows with little or no maintenance.

In August 2010, Microsoft’s BPOS service had 6 emergency maintenance windows, totaling more than 10 hours, in response to customers losing connectivity to the service, along with 30 hours of scheduled maintenance windows.  In line with Microsoft’s SLA, customers experienced more than 40 hours of downtime that month, which is within the boundaries of the SLA and its expectations.  On August 17, 2011, Microsoft experienced a data center failure that resulted in loss of Exchange access for its Office365 customers in North America for as long a five hours.  The system was down for 90 minutes before Microsoft acknowledged this as an outage.

Google’s SLA sets and expectation for system availability 24x7x365, with no scheduled downtime for maintenance and no emergency maintenance windows.

The difference in SLAs sets a very different expectation and makes a statement about how each vendor builds, manages, and provides the services you pay for.

When comparing SLAs, understand the role of maintenance windows and other “exceptions” that give the vendor an out.  Also, look at the following.

  • Definitions for critical, important, normal, and low priority issues
  • Initial response times for issues based on priority level
  • Target time to repair for issues based on priority level
  • Methods of communicating system status and health
  • Methods of informing customers of issues and actions/results

Remember, if you need to use the compensation clause, your vendor has already failed.

 

 

 

How Secure is YOUR Cloud?

The Microsoft Marketing Machine is in overdrive touting the security of Microsoft Business Productivity Online Suite (BPOS), Exchange Online, and their other online services.  Much of the hype is in response to Google’s recent announcement that Google Apps Premier Edition has received FISMA Certification along with both SAS 70 Type I and II certifications.

As of August 26, 2010, Microsoft’s own FAQs for their online services acknowledges the lack of security certifications.

The Standard version of the Business Productivity Online Standard Suite will be seeking a SAS 70 Type II audit attesting to the effectiveness of Microsoft’s internal controls. While our U.S. datacenters maintain a SAS 70 Type II for the physical controls of each facility, the Services (Live Meeting, EHS, Exchange Online, SharePoint Online and Office Communications Online) themselves do not. Live Meeting maintains both the CyberTrust Service Provider Certification and the CyberTrust Application Certification, which surpasses the control requirements for SOX. The Business Productivity Online Standard Suite Standard implementation is scheduled to undergo the CyberTrust certification within the next couple of months.

Microsoft BPOS Outage Gets Little Attention

When it comes to Microsoft’s BPOS service, we often hear grumblings from the user community about downtime.  Yet, there is little or no press coverage.  Beyond the frequency of scheduled maintenance windows required for patches and security updates, we hear complaints about performance-related outages and emergency maintenance windows for fixes and updates to scheduled patches.

On Monday, August 23rd, Reuters reported that Microsoft’s US Data Centers experiences an outage lasting more than 2 hours … well beyond the 99.9% SLA touted by Microsoft for the BPOS and other hosted services.

That last Gmail outage made headlines on CNN and USA Today.   The Microsoft BPOS outage received relatively little coverage beyond a core set of reports in the IT press.  Maybe there is a difference in expectations?

Microsoft BPOS Outages: Cloud vs. Hosted Server

Last week, Microsoft experienced several outages of its Business Productivity Online Suite (BPOS).   As ZDNet noted, with so few users, nobody really noticed (as opposed to every Google performance  or service issue making headlines).

In our opinion, the outage tells a much more important story — the difference between a hosted server and a cloud-based solution.  MS BPOS runs as hosted servers on shared physical servers.  In effect, Microsoft is installing their servers on hardware the same way you would install them as virtual servers on shared hardware.  Microsoft is honest in that none of the services run in a replicated or redundant way.  With the exception of email, for which users should be able to send, receive and access 30 days history, if your virtual server or physical server has troubles, you are out of luck.

The implications are serious.  Without redundant services or data, any failure puts you, the customer, at risk for data loss.  Imagine a server failure that corrupts an underlying SharePoint database.  Access to documents, wiki’s, and other content can easily be lost.  As Microsoft offers no clear mechanism for backing up data, data you place in BPOS is likely at greater risk than keeping it on in-house systems.

Granted, Microsoft’s big customers (like Coca Cola) can negotiate for special services.  For the rest of the user community, at $120 per user per year, you would just be out of luck.

Here is a blog posting praising the virtues of BPOS and possible backup strategies.  Clearly, the author does not get it.  Why would you trust your data on a service, like Sharepoint, where after a disaster impacting your servers hosted by Microsoft, you are likely to wait 6 days to get data that is 7 days old from the point of failure?  That is effectively a 13 day gap in information.

Fundamentally, MS BPOS misses the mark.  Either Microsoft doesn’t understand the needs of businesses or they are unable of providing the level of service smart businesses require.