SQL 101 - Part 4: Aggregating data

NOTE:

For demos in this article, we will use a table named Customer that contains 7 columns:

Name	Data Type
FirstName	nvarchar(50)
LastName	nvarchar(50)
StreetAddress	nvarchar(255)
City	nvarchar(255)
State	char(2)
ZipCode	nvarchar(10)
TotalSales	money

Afer adding a few rows to the table, the data looks like this.

CustID	FirstName	LastName	StreetAddress	City	State	ZipCode	TotalSales
1	David	Giard	123 Oxford Ct	Erlanger	KY	40111	500.00
2	Magic	Johnson	456 Hollywood Blvd	Lansing	MI	45222	1500.00
3	Bubba	Smith	789 Killbubbakill St	Baltimore	MD	10111	1000.00
4	Ron	Mason	501 E Grand River Ave	Lansing	MI	45333	2000.00
5	Steve	Smith	900 Belle St	Detroit	MI	48888	5000.00
6	Ryan	Miller	1 Shutout Ct	Buffalo	NY	32323	250.00
7	Brad	Van Pelt	99 Linebaker Ln	Owosso	MI	47777	4000.00

In the last article in this series, I expleined the basic functionality of the SELECT statement. I showed how to select columns from a table and sort or filter the results.

In this article, I'll show some more things you can do with the SELECT statemnts.

Sometimes we want our results to aggregate data. When aggregating data, we use functions that consolidate multiple rows and return the result of this aggregate function applied to many rows. The most common aggregate functions I use that SQL Server supports are:

Function	Descripiton
MAX	The maximum value of a column across all included rows
MIN	The minimum value of a column across all included rows
SUM	The sum of all values in a column across all included rows
AVG	The arithmetic average of all values in a column across all included rows
COUNT	The number of included rows

The MIN, MAX, SUM, and AVG functions accept a parameter - the name of the column on which to calculate these values. The column must hold a numeric data type, such as an INT, FLOAT or MONEY.

You may pass a column name as a parameter to the COUNT function but it doesn't matter which column because - in SQL Server - every column appears exactly once in each row and the COUNT function is used to count rows. By convention, we pass "*" as the parameter to the COUNT aggregate function to represent all rows.

A few examples will help clarify this. Since our table contains only one numeric column - TotalSales - we will use this in most of our aggregate functions.

The following query returns the maximum value of the TotalSales column.


SELECT MAX (TotalSales) AS MaxSales
    FROM Customer

Here are the results - one row with one column containing the highest numerical value in the TotalSales column.

MaxSales

5000.00

Similarly, the following query returns the sum of all values in the TotalSales column.

SELECT SUM (TotalSales) AS SumSales
    FROM Customer

This returns 14,750 which is the sum of 1000 + 1500 + 1000 + 2000 + 5000 + 250 + 4000

SumSales

14750.00

We can filter before applying an aggregate function. If we only want to sum of TotalSales for customers in michigan, we simply add a WHERE clause to our query.

SELECT SUM (TotalSales) AS SumSales
    FROM Customer
    WHERE State = 'MI'

This returns 12,500, which is 1500 + 2000 + 5000 + 4000, or the sum of the TotalSales column for only those customers in Michigan.

SumSales

12500.00

As mentioned before, we do not need to specify a particular column for the COUNT function.

SELECT COUNT (*) AS NumCusts
    FROM Customer

returns the a row and column containing number 7, which is how many rows are in our table.

So far, each query we have written has returned only one row. Often, however, we want to calculate an aggregation for each distinct value in a column or columns. We can do this by adding a GROUP BY clause to the query. For example, we may want to see the SUM of TotalSales for each state.

SELECT 
        State, 
        SUM (TotalSales) AS SumSales
    FROM Customer
    GROUP BY State

The above query returns a row for each distinct value in the State column and calculates the sum of TotalSales of all rows corresponding to that state.

State	SumSales
KY	500.00
MD	1000.00
MI	12500.00
NY	500.00

It's important to note that, when using the GROUP BY clause, you cannot return a column that is not part of the grouping. So

SELECT 
        State, 
        City, 
        SUM (TotalSales) AS SumSales
    FROM Customer
    GROUP BY State

results in an error because we are trying to return the city column, but we are not grouping on that column. Because a given state can have multiple cities, SQL does not know which one to display for the row returned.

You can group on multiple columns as in the following query

SELECT 
        State, 
        City, 
        SUM (TotalSales) AS SumSales
    FROM Customer
    GROUP BY State, City

In this case, we get a row with a sum for each combination of state and city.

State	City	SumSales
KY	Erlanger	500.00
MD	Baltimore	1000.00
MI	Lansing	1500.00
MI	Okemos	2000.00
MI	Owosso	4000.00
NY	Buffalo	250.00

By using the GROUP BY clause on a large table, we may end up with so many rows that it becomes difficult to find relevant data. Sometimes, we are only interested in those times when the aggregate value exceeds some threshhold. In these cases, it would be nice to only show aggregate rows that exceed that threshhold. This sounds like a good place to use a filter. Unfortunately, we cannot use the WHERE clause to accomplish this task because the WHERE clause filters data before the aggregation. We have to wait until after calculating the aggregate values becuase it is the aggregate values on which we want to filter. The HAVING clause is used to filter on aggregate values.

We can run our query to get the TotalSales sum for each state, but show only those states that have total sales of more than 2000, using the following query

SELECT 
        State, 
        SUM (TotalSales) AS SumSales
    FROM Customer
    GROUP BY State
    HAVING SUM (TotalSales) >= 1000

In this case, we don't see the row for Kentucky and New York because they had total sales summing less than $1000

State	SumSales
MD	1000.00
MI	112500.00

In this article, we showed how to use T-SQL's grouping and aggregate functions to return summary data from a database.