Keeping a computer system available all or almost all the time is a challenge.
Sometimes software patches or upgrades need to be installed on a server. Sometimes, old hardware needs to be replaced. Sometimes, hardware unexpectedly fails. Sometimes power to building or part of a building fails.
All these things can contribute to downtime - some of it planned and some of it unplanned.
Monitoring, redundancy, and planning all reduce the risk of downtime in Azure.
Many resources in Azure are written in triplicate. Only one copy of that data or service is live at any given time. The other two exist in case the live copy becomes unavailable. If this happens, Azure will automatically route requests to one of these "backup" copies. The live copy is sometimes called a "hot" copy, while the 2 redundant backups are sometimes referred to as "cold" copies.
This works well during planned software and hardware upgrades. The cold copies' servers are upgraded first; then, new requests are routed to one of the upgraded cold copies, making it the hot copy, before the original hot copy is upgraded. Azure maintains something called "Update Domains" to help manage this. Systems in separate Update Zones will not be shut down for upgrades simultaneously, in order to avoid downtime.
Unexpected downtime is harder to manage. This is typically caused by hardware or software failure or a failure of a system, such as a power supply on which a service depends. All hardware fails at some point, so this must be dealt with.
To handle these failures, Azure continuously monitors its systems to determine when a failure occurs. When a failure on a hot copy is detected, requests are routed to a cold copy; then, a new copy of the service or data is deployed onto available hardware in order to maintain 2 redundant cold copies. Redundant copies of a service are kept in different parts of a datacenter, so that they don't rely on a single point of failure. These independent parts of the data center are known as "Fault Domains" because a fault in one Fault Domain will not affect services in the other Fault Domains.
As a result of these practices, Azure can guarantee a certain level of uptime for each of its paid services. The level is dependent on the service and is usually expressed in terms of percentage uptime. Azure guaranteed uptimes range from 99.5% to 99.99%. This guaranteed uptime percentage is known as a "Service Level Agreement" or "SLA"
You can view the current uptime guarantee for each Azure service here.
An uptime of 99.5% would be down a maximum of 1.83 days per year and an uptime of 99.99% would be down a maximum of 52.6 minutes per year.
Azure guarantees this by agreeing to credit all or part of a customer's charges if the uptime target is not met in any given month. The exact credit amount depends how much the target is missed.
As of this writing, here are the guaranteed uptimes for each Azure service.
Service | Uptime | Notes |
---|---|---|
Active Directory | 99.90% | |
Active Directory B2C | 99.90% | |
AD Domain Service | 99.90% | |
Analysis Service | 99.90% | |
API Management | 99.90% | |
App Service | 99.50% | |
Application Gateway | 99.50% | |
Application Insights | 99.90% | |
Automation | 99.90% | |
DevOps | 99.90% | |
Firewall | 99.50% | |
Front Door Service | 99.99% | |
Lab Services | 99.90% | |
Maps | 99.90% | |
Databricks | 99.50% | |
Backup | 99.50% | |
BizTalk Services | 99.90% | |
Bot Service | 99.90% | |
Cache | 99.90% | |
Cognitive Services | 99.90% | |
CDN | 99.90% | |
Cloud Services | 99.50% | Assumes at least 2 instances |
VMs | 99.50% | Assumes at least 2 instances |
VMs | 99.90% | Assumes Premium storage |
CosmosDB | 99.99% | |
Data Catalog | 99.90% | |
Data Explorer | 99.90% | |
Data Lake Analytics | 99.90% | |
Data Lake Storage Gen1 | 99.90% | |
DDoS Protection | 99.99% | |
DNS | 100.00% | |
Event Grid | 99.99% | |
Event Hubs | 99.90% | |
ExpressRoute | 99.50% | |
Azure Functions | 99.50% | |
HockeyApp | 99.90% | |
HDInsight | 99.90% | |
IoT Central | 99.90% | |
IoT Hub | 99.90% | |
Key Valut | 99.90% | |
AKS | 99.50% | |
Log Analytics | 99.90% | |
Load Balancer | 99.99% | |
Logic Apps | 99.90% | |
ML Studio | 99.95% | |
Media Services | 99.90% | |
Mobile Services | 99.90% | |
Azure Monitor | 99.90% | |
Multi-Factor Authentication | 99.90% | |
MySQL | 99.99% | |
Network Watcher | 99.90% | |
PostgreSQL | 99.99% | |
Power BI Embedded | 99.90% | |
SAP HANA on Azure Large Instances | 99.99% | |
Scheduler | 99.99% | |
Azure Search | 99.90% | |
Security Center | 99.90% | |
Service Bus | 99.90% | |
SignalR Service | 99.90% | |
Site Recovery | 99.90% | |
SQL Database | 99.99% | |
SQL Data Warehouse | 99.90% | |
SQL Server Stretch Database | 99.90% | |
Storage Accounts | 99.99% | 99.9% for Cold Storage |
StorSimple | 99.90% | |
Stream Analytics | 99.90% | |
Time Series Insights | 99.90% | |
Traffic Manager | 99.99% | |
Virtual WAN | 99.95% | |
VS App Center | 99.90% | |
VPN Gateway | 99.95% | |
VPN Gateway for VPN or ExpressRoute | 99.90% | |
Information Protection | 99.90% | |
Win10 IoT Core Svcs | 99.90% | |
VMWare Solution | 99.90% | |
Services like Azure Backup and Azure Functions, which can be easily retried, have the lowest guaranteed uptime.
The highest guaranteed uptimes are reserved for mission-critical services, such as DNS and Traffic Manager, along with all the database and storage offerings.
Free services are not listed here, as they almost never have a guaranteed uptime. Even if they did, there is nothing to credit to the account.
Azure has systems in place to guarantee high availability and reliability and Microsoft has enough confidence in those systems to guarantee a predictable level of uptime and base that guarantee on monetary credits.