In my last article, I defined the basic concepts of database, table, column and row. Using these constructs, you can organize data into a rectangular format. This paradigm often works really well, because

  • You can group related information into a single container (a table)
  • Each row represents a single entity (such as a customer, employee, or invoice) and
  • Each column represents an attribute of the entity (such as FirstName, LastName, or TotalSales).

Using this model, we can create a table containing information about a customer's purchases. Each row in this item might represent a single purchase.

When a customer purchases an item, we probably would want to store some information about that purchase. These bits of information about each purchase are attributes of the purchase and are therefore candidates for columns. Below are examples of the information we might want to save about a customer's purchase.

  • Date of Purchase
  • Customer First Name
  • Customer Last Name
  • Customer Street Address
  • Customer City
  • Customer Zip Code
  • Item Purchased
  • Quantity Purchased
  • Price per Item

We can create a table CustomerPurchase with a column for each of the above attributes and begin populating with data each time a customer purchases something. The data would look something like this:

PurchaseDate Customer
FirstName
Customer
LastName
Customer
StreetAddress
Customer
City
Customer
ZipCode
ItemPurchased Quantity PricePerItem
2/26/2009 John Smith 123 Elm Bigg City 48222 Lamp 1 40
2/26/2009 Bill Jones 456 Maple Smallville 48333 Chair 2 100
2/26/2009 Mary Brown 789 Oak Middleton 48444 Table 1 50

This model seems to capture the information we want. Do you see any problems with it?

What happens if a customer orders more than one item? If John Smith purchases a Chair in addition to his Lamp, we can just add another row to the table, like so.

PurchaseDate Customer
FirstName
Customer
LastName
Customer
StreetAddress
Customer
City
Customer
ZipCode
ItemPurchased Quantity PricePerItem
2/26/2009 John Smith 123 Elm Bigg City 48222 Lamp 1 40
2/26/2009 Bill Jones 456 Maple Smallville 48333 Chair 2 100
2/26/2009 Mary Brown 789 Oak Middleton 48444 Table 1 50
2/26/2009 John Smith 123 Elm Bigg City 48222 Chair 1 100
2/27/2009 John Smith 123 Elm Bigg City 48222 Table 1 50

But notice that now we are storing John Smith's name and address multiple times.  Assuming John Smith will never change his name, this is a waste of space.  Granted, this isn't very much wasted space when we have only a few orders, but imagine a system with thousands of customers and millions of orders.  Do you really want all that redundant information cluttering up your database?

Also, imagine that we want to correct an error in the spelling of John's name.  With the current model, we must correct that error three times due to the redundant storage.

To address these issues, we can normalize the data.  Data normalization refers to structuring our data in order to remove redundancy. 

In our example, we accomplish this by creating a table of customers with the following structure

  • FirstName
  • LastName
  • StreetAddress
  • City
  • ZipCode

and and moving the customer data to this table - one row per customer.

FirstName LastName StreetAddress City ZipCode
John Smith 123 Elm Bigg City 48222
Bill Jones 456 Maple Smallville 48333
Mary Brown 789 Oak Middleton 48444
 

Then we add an extra column to the ustomerPurchase tab table.  This new column is special in that the value in it will uniquely identify each row - in other words, no two rows will have the same value.  This unique column goes by many names but we will call it a Primary Key here.  In this case, the Primary Key column will be named "CustomerID" and will hold an integer.

CustomerID FirstName LastName StreetAddress City ZipCode
1 John Smith 123 Elm Bigg City 48222
2 Bill Jones 456 Maple Smallville 48333
3 Mary Brown 789 Oak Middleton 48444
 

Now we can go back to the ustomerPurchase tab table, and replace the columns that describe customer with a column to hold the CustomerID.  This replacement column is known as a "Foreign Key".  It references a Primary Key in another table and is used to point to a single unique record in that other table.

PurchaseDate CustomerID ItemPurchased Quantity PricePerItem
2/26/2009 1 Lamp 1 40
2/26/2009 2 Chair 2 100
2/26/2009 3 Table 1 50
2/26/2009 1 Chair 1 100
2/27/2009 1 Table 1 50
 

This is all we need because, given the CustomerID, we can look in the Customer table, find the record for that customer and get all information about that customer.

This concept of using a key value to point to a row in another table is known as a relationship.  We say that the Customer table is related to the CustomerPurchase tab table. 

This type of relationship is known as a one-to-many relationship, every customer may have many orders.  In this type of relationship the table with one row is known as the parent and the table with (potentially) many rows is known as the child table.  

This relationship is typically represented by a drawing similar to the one below.

Organizing data in this way can make storage of that data far more efficient and flexible.