It should be no secret that Microsoft’s Office 365 service continues to experience the types and frequency of outages that plagued its predecessor cloud service, BPOS. While the outages receive little press coverage (they are frequent enough that they are not newsworthy?) , customers feel the impact.
In response to outages on Nov 8 and Nov 13, Microsoft sent customers a formal letter of apology (read it here).
Most disturbing to Office 365 customers is what Microsoft’s apology says about the quality and capabilities of Microsoft and the Office 365 platform.
With respect to the Nov 8th outage, Microsoft states the following:
“Office 365 utilizes multiple anti-virus engines to identify and clean virus messages from our customers’ inboxes. Going forward, we have built and implemented better recovery tools that allow us to remediate these situations much faster, and we are also adding some additional architectural safeguards that automatically remediate issues of this general nature.”
What this says is that, at times, significant virus traffic makes it to the email servers, and Microsoft has technology to remediate this problem by scanning servers and removing these messages from inboxes. This is troublesome for a few reasons:
- Best practice is to prevent viruses from reaching email servers, as any inbox remediation system allows the possibility that a virus is activated by a user before being cleaned.
- Remediation of this problem has been manually driven and that automating the process is still in development
- Remediation of virus infections dramatically impacts performance, up to the level of an outage.
- Microsoft has not yet built an infrastructure that is capable of preventing virus infections, and continues to be focused on remediation.
With respect to the Nov 13th outage, Microsoft states:
“This service incident resulted from a combination of issues related to maintenance, network element failures, and increased load on the service.”
Microsoft acknowledges that they perform maintenance that can interrupt customer services outside of maintenance windows and that the Office 365 architecture lacks sufficient redundancy. Microsoft is also admitting that the Office 365 infrastructure does not have sufficient capacity to handle peak demand loads and does not allow for automatic activation and allocation of resources based on demand.
In response to these outages, Microsoft promises the following:
“Significant capacity increases are already underway and we are also adding automated handling on these type of failures to speed recovery time.”
In essence, Microsoft cannot predict or manage capacity, so they are throwing resources at the problem. More importantly, Microsoft is not fixing the architecture in order to prevent load-based failures — they are automating how they respond to failures.
In other words: Microsoft expects future Office 365 outages; So, too, should Office 365 customers.