Usually, we need to have the right answers to build the right solution. But, to get these answers, we need to ask the right questions.

The Microsoft Azure Well-Architected Framework helps us to frame those questions.

The framework divides Azure concepts into five categories, which they refer to as "pillars". Just as physical pillars that can hold up a physical structure, these pillars hold up the design of your cloud application. Each pillar covers a broad area of web architecture and helps you formulate questions and answers about that area. The pillars (in no particular order) are:

  • Cost Optimization
  • Operational Excellence
  • Performance Efficiency
  • Reliability
  • Security

Let's briefly discuss what each one covers

Cost Optimization

Azure resources can provide, but they cost money. This pillar helps you maximize the value for the price you are paying: Are you getting sufficient value for money outlayed and are there ways to save money, while still meeting your needs?

Some ways you can increase your cost optimization are:

  • Look for wasted resources, such as services that are running during times they are not used
  • Disks that are not attached to a Virtual Machine
  • Deploying more capacity than a service needs

Monitoring can help you discover these inefficiencies, as can Azure Advisor.

Reliability

Reliability is the ability to keep an application or service running, to anticipate failures, and to have a plan to recover quickly from those failures.

Typically, a cloud application focuses reliability efforts on the ability to quickly recover from failure, rather than preventing failure. Often, we quantify a reliability target in terms of a Service Level Agreement (SLA), which is the promised percentage of time our service will be available. For example, our SLA may promise 99.99% uptime (often referred to as "4 9's), which promises the application will be down only .01% of the time, which equates to less than an hour per year. However, reliability can also refer to maintaining a level of performance in terms of speed and features. Degradation of either of these decreases reliability.

We can increase reliability by providing redundancy to reduce a single point of failure, failovers in case of trouble, and

Despite your best efforts, it is likely that your system may go down unexpectedly and that you may lose some data. Therefore, it is essential that you have a detailed and tested plan to restore both the system and the data.

Azure takes care of some of this for us via such things as Update Domains, Availability Sets, Availability Zones, and built-in backup tools; but it is still up to us to opt into these services and configure them.

It is important to recognize that our application may be dependent on other services, so we need to consider the reliability of those services when considering the SLA of our app.

Operational Excellence

Operational Excellence refers to the ability to deploy an application reliably and to verify that deployment. Automated deployment is your friend here. Infrastructure as Code tools, such as ARM templates allow you to declare the state of an environment after a deployment. Automated build tools, such as Continuous Integration and Continuous Deployment pipelines in GitHub and Azure DevOps allow you to consistently build, test, and deploy your code to an environment in a repeatable way. Using these tools, we can automate the initial deployment of our application and manage subsequent releases, even rolling back a release if something goes wrong.

Testing is crucial to ensure quality before deploying code. Monitoring can assure that it remains in a good state and alerts you when it is not.

These processes can and should be automated as much as possible to simplify assurance of Operational Excellence.

Security

Security is an important pillar when architecting any application, whether or not it is in Azure. It is also one of the more complex topics in software development.

A key principle to keep in mind when designing your application is "Zero Trust" - never assume you can trust the person or account accessing your application.

A good approach is to recognize the layers that a user must get through to access your data and to add protection at each layer.

Azure implements some security for you by default, such as encryption of data in storage accounts. But you must be aware of potential areas of attack and defend against them. Tools like Azure AD Single Sign, Privileged Identity Management, and Azure Key Vault allow you to implement a secure solution.

Performance efficiency

Planning for the demands of your application can help you determine how much capacity you need. However, application demand tends not to be constant over time. It may vary by day of the month or week or by hour of the day. It may even vary in response to expected or unexpected events. Ideally, we will constantly adjust our application's capacity based on the demand over time.

There are two ways to increase capacity: Scaling up and scaling out. Scaling up refers to using a more powerful virtual machine. Scaling out refers to deploying more instances of your application.

Crossover

These categories are pillars, but they are not silos. Addressing one pillar can have implications in other pillars. For example, there are cost considerations for almost everything we do. And how we implement Operational Excellence may affect our Performance Efficiency. And monitoring is a key to almost all these pillars.

When designing an Azure application, it is important to consider each pillar and note that some tradeoffs may be necessary.

Next steps

For more information, visit the Azure Well-Architected Framework home page.

Microsoft Assessments contains an Azure Well-Architected Review that walks you through a series of questions to determine how well you have addressed some or all of the pillars.

In this article, we described the Microsoft Azure Well-Architected Framework and briefly covered each of the five pillars covered by this framework.