# Wednesday, May 8, 2019

Keeping a computer system available all or almost all the time is a challenge.

Sometimes software patches or upgrades need to be installed on a server. Sometimes, old hardware needs to be replaced. Sometimes, hardware unexpectedly fails. Sometimes power to building or part of a building fails.

All these things can contribute to downtime - some of it planned and some of it unplanned.

Monitoring, redundancy, and planning all reduce the risk of downtime in Azure.

Many resources in Azure are written in triplicate. Only one copy of that data or service is live at any given time. The other two exist in case the live copy becomes unavailable. If this happens, Azure will automatically route requests to one of these "backup" copies. The live copy is sometimes called a "hot" copy, while the 2 redundant backups are sometimes referred to as "cold" copies.

This works well during planned software and hardware upgrades.  The cold copies' servers are upgraded first; then, new requests are routed to one of the upgraded cold copies, making it the hot copy, before the original hot copy is upgraded. Azure maintains something called "Update Domains" to help manage this. Systems in separate Update Zones will not be shut down for upgrades simultaneously, in order to avoid downtime.

Unexpected downtime is harder to manage. This is typically caused by hardware or software failure or a failure of a system, such as a power supply on which a service depends. All hardware fails at some point, so this must be dealt with.

To handle these failures, Azure continuously monitors its systems to determine when a failure occurs. When a failure on a hot copy is detected, requests are routed to a cold copy; then, a new copy of the service or data is deployed onto available hardware in order to maintain 2 redundant cold copies. Redundant copies of a service are kept in different parts of a datacenter, so that they don't rely on a single point of failure. These independent parts of the data center are known as "Fault Domains" because a fault in one Fault Domain will not affect services in the other Fault Domains.

As a result of these practices, Azure can guarantee a certain level of uptime for each of its paid services. The level is dependent on the service and is usually expressed in terms of percentage uptime. Azure guaranteed uptimes range from 99.5% to 99.99%. This guaranteed uptime percentage is known as a "Service Level Agreement" or "SLA"

You can view the current uptime guarantee for each Azure service here.

An uptime of 99.5% would be down a maximum of 1.83 days per year and an uptime of 99.99% would be down a maximum of 52.6 minutes per year.

Azure guarantees this by agreeing to credit all or part of a customer's charges if the uptime target is not met in any given month. The exact credit amount depends how much the target is missed.

As of this writing, here are the guaranteed uptimes for each Azure service.

Service Uptime Notes
Active Directory 99.90%
Active Directory B2C 99.90%
AD Domain Service 99.90%
Analysis Service 99.90%
API Management 99.90%
App Service 99.50%
Application Gateway 99.50%
Application Insights 99.90%
Automation 99.90%
DevOps 99.90%
Firewall 99.50%
Front Door Service 99.99%
Lab Services 99.90%
Maps 99.90%
Databricks 99.50%
Backup 99.50%
BizTalk Services 99.90%
Bot Service 99.90%
Cache 99.90%
Cognitive Services 99.90%
CDN 99.90%
Cloud Services 99.50% Assumes at least 2 instances
VMs 99.50% Assumes at least 2 instances
VMs 99.90% Assumes Premium storage
CosmosDB 99.99%
Data Catalog 99.90%
Data Explorer 99.90%
Data Lake Analytics 99.90%
Data Lake Storage Gen1 99.90%
DDoS Protection 99.99%
DNS 100.00%
Event Grid 99.99%
Event Hubs 99.90%
ExpressRoute 99.50%
Azure Functions 99.50%
HockeyApp 99.90%
HDInsight 99.90%
IoT Central 99.90%
IoT Hub 99.90%
Key Valut 99.90%
AKS 99.50%
Log Analytics 99.90%
Load Balancer 99.99%
Logic Apps 99.90%
ML Studio 99.95%
Media Services 99.90%
Mobile Services 99.90%
Azure Monitor 99.90%
Multi-Factor Authentication 99.90%
MySQL 99.99%
Network Watcher 99.90%
PostgreSQL 99.99%
Power BI Embedded 99.90%
SAP HANA on Azure Large Instances 99.99%
Scheduler 99.99%
Azure Search 99.90%
Security Center 99.90%
Service Bus 99.90%
SignalR Service 99.90%
Site Recovery 99.90%
SQL Database 99.99%
SQL Data Warehouse 99.90%
SQL Server Stretch Database 99.90%
Storage Accounts 99.99% 99.9% for Cold Storage
StorSimple 99.90%
Stream Analytics 99.90%
Time Series Insights 99.90%
Traffic Manager 99.99%
Virtual WAN 99.95%
VS App Center 99.90%
VPN Gateway 99.95%
VPN Gateway for VPN or ExpressRoute 99.90%
Information Protection 99.90%
Win10 IoT Core Svcs 99.90%
VMWare Solution 99.90%

Services like Azure Backup and Azure Functions, which can be easily retried, have the lowest guaranteed uptime.

The highest guaranteed uptimes are reserved for mission-critical services, such as DNS and Traffic Manager, along with all the database and storage offerings.

Free services are not listed here, as they almost never have a guaranteed uptime. Even if they did, there is nothing to credit to the account.

Azure has systems in place to guarantee high availability and reliability and Microsoft has enough confidence in those systems to guarantee a predictable level of uptime and base that guarantee on monetary credits.

Comments are closed.