Metrics, Logs, Traces, and Spans in Application Monitoring

Monitoring an application or service is essential to keeping it running smoothly and continuously improving it.

It is useful to understand some of the ways that we can monitor our app. Some important concepts are metrics, logs, traces, and spans.

Metrics and Logs

Metrics are numerical values that describe some aspect of a system at a particular point in time. They are lightweight and capable of supporting near real-time scenarios. Examples include the amount of memory or storage used, disk access time, and response time. The purpose of a metric is to inform a user about the health and operation of the system at a point in time. These metrics, such as the average CPU usage each hour, can be pre-aggregated. Metrics tend to be lightweight, so they can be reported and stored quickly and are often collected regularly. They are ideal for use in simple logic, such as sending an email when the temperature of a machine exceeds a dangerous level.

Logs store events and activities. Log entries can contain different kinds of data organized into records with different sets of properties for each type. These records tend to be well-structured and verbose but are usually much larger than metrics data because they contain more detailed information. Logs are not pre-aggregated. You would use querying, reporting, and other tools to perform aggregations. Telemetry, such as events and traces, are stored as logs in addition to performance data to be combined for analysis. Log data comes from an application, its underlying service, the platform, or the operating system.

So, which one should we use? The answer is: both!

Use Metrics to determine the health of a system. When an unusual metric indicates a problem, query the logs to determine the reason for that problem.

Traces and Spans

A trace shows a continuous view of a single request in an application. Imagine a button on a web page that calls an API that calls a library that queries a database and returns the resulting data up the stack. If a user clicks the button, each component in the stack can log some information. Each log entry can contain a correlation ID to indicate they are all part of the same request. Filtering the logs on this correlation ID allows us to view details of the request as it moves through the system. Each component's log entries are known as a "span," while the collection of log entries for the entire request is a "trace." This trace will differ from when another user clicks the button (or if the same user clicks it later.) Each trace will have a different correlation ID, allowing us to distinguish one request from another. Storing the correlation ID is particularly helpful when isolating problems that occur sporadically. We can see if parameters are passed properly or if there are factors in one layer that contribute to the error.

Conclusion

In this article, I covered some key application and service monitoring concepts. In future articles, I will show tools for monitoring in Microsoft Azure.