Thursday, May 12, 2011

All graphs, charts and other data visualization pictures consist of “ink”. At one time, “ink” referred to physical ink because at one time all charts were printed on paper. Now, we can think of ink as anything drawn on either paper or the screen, even if that drawing is never printed to a sheet of paper.

Data-Ink is that part of the ink that represents the actual data. Another way to think of data ink is: the ink that, if we erased it, would reduce the amount of information in the graphic.

So, if only some of the ink represents data, what is the rest of the ink? The rest of the ink is taken up with metadata, redundant data, and decorations.
Generally, the more data-ink in a graphic, the more effective that graphic will be. Tufte defines the “Data-Ink Ratio” as [The Amount of Data-Ink in a graphic] divided by [The total Ink in the Graphic]. When creating charts and graphics, our goal should be to maximize the Data-Ink Ratio, within reason.
Consider the single data point represented by a bar chart in Figure 5a.

Figure 5a

The value of that point is represented by the following
•    The height of the vertical line along the left side of the bar;
•    The height of the vertical line along the right side of the bar;
•    The height of the horizontal line along the top of the bar;
•    The height of the the colored area within the bar;
•    The height of the number atop the bar; and
•    The value of the number atop the bar.

Six different elements in this graph all represent the same value – a tremendous amount of redundant data. This graph has a very low Data-Ink Ratio.

The problem is even worse if we make the bar chart 3-dimensional as in Figure 5b.

Figure 5b

Let’s look at an example of a graph with a low Data-Ink Ratio and try to fix it. Figure 5c reports some linear data points on a surface that looks like graph paper.

Figure 5c

In this figure, the dark gridlines compete with data points for the viewer’s attention. We can eliminate some of these gridlines and lighten the others to reduce the Data-Ink Ratio and make the data more obvious.

Figure 5d

Spreadsheet makers discovered this a long time ago when they decided to lighten the borders between cells in order to make these borders (metadata) less obvious than the numbers inside the cells (data). In the case of this graph, we probably don’t need gridlines at all. Eliminating them entirely (Figure 5e) reduces the Data-Ink Ratio with no further loss of information.

Figure 5e

If we look around the remaining parts of the graph, we can find more non-Data-Ink that is a candidate for elimination. The top and right borders certainly don’t provide any information. And the axes are just as readable if we eliminate half the numbers.

Figure 5f

Figure 5g shows a graph by chemist Linus Pauling, mapping the Atomic Number and Atomic Volume of a number of different elements.

Figure 5g

Pauling has removed the gridlines, but he has left in the grid intersections – tiny crosses that distract from the data. We can safely eliminate these crosses to reduce the Data-Ink Ratio and make the graph more readable (Figure 5h)

Figure 5h

One could argue that the dashed lines between the data points are metadata and that removing them would increase the Data-Ink Ratio. However, if we do so (Figure 5i), the graph becomes less clear, because the lines help group together elements in the same Periodic Table row.

Figure 5i

This is why our goal is to increase the Data-Ink Ratio, within reason. Sometimes it is necessary to add back some non-Data-Ink in order to enhance the graph.
Figure 5j shows another example when redundant data can enhance a graph’s readability.

Figure 5j

The top picture is the train schedule from part 2 of this series. Notice that some of the diagonal lines stop at the right edge and continue from the left edge of the chart. These are scheduled train rides that leave a station before 6AM, but don’t arrive at a destination until after 6AM. In the bottom picture, I have copied the first 12 hours of the chart and pasted it on the right, ensuring that every route line appears at least without a break.

Figure 5k

Now, if we could just get rid of those gridlines…

This is an ongoing series discussing the research of Dr. Edward Tufte on Data Visualization.

Thursday, May 12, 2011 1:30:00 PM (GMT Daylight Time, UTC+01:00)
Wednesday, May 11, 2011

Unless a graph provides context, it can fail to give a complete picture of the data it represents. For example, Figure 4a shows the deaths due to traffic accidents in Connecticut in 1955 and 1956.

Figure 4a

These periods were chosen because the state of Connecticut chose to increase the enforcements of speed limits. From the graph, it appears that this increased enforcement saved about 40 lives. However, it’s not possible to make this conclusion because we don’t know what happened prior to 1955 or after 1956. Were traffic deaths in Connecticut already on the increase before the increased enforcement? Did deaths go up again in the years following 1956? The graph during the rest of the decade could have looked like any of the following

Figure 4b

In fact, the graph looked a lot like Figure 4c, which shows traffic deaths on the rise prior to 1955 and continuing to fall after 1956.

Figure 4c

Figure 4d shows even more context for the data.

Figure 4d

In this graph, we see the number of deaths per 100,000 for the entire decade for each state contiguous to Connecticut. While traffic deaths in New York, Massachusetts, and Rhode Island tended to increase or remain steady after 1956, Connecticut’s traffic death rate went down. This context provides strong evidence that Connecticut’s speeding enforcement was effective in its goal of saving lives.

To maximize the meeting supplied by a data graphic, always provide context for that data.

This is an ongoing series discussing the research of Dr. Edward Tufte on Data Visualization.

Wednesday, May 11, 2011 1:10:00 PM (GMT Daylight Time, UTC+01:00)
Tuesday, May 10, 2011

The primary goal of creating a chart or graph should always be to accurately represent the data on which that chart is based. In his research on data visualization, Dr. Edward Tufte found numerous charts that misled viewers about the underlying data.
The first example is from the annual report of a mining company and it illustrates net income of that company during a 5-year stretch.

Figure 3a

The income (or loss) each year is represented by a tall, vertical bar. What is not obvious from this picture is that the company lost \$11,014 in 1970. That loss is represented in the picture by a tall bar because the company chose an arbitrary baseline of about -\$4 million. Showing \$0 on the graph would have made it far more credible. It’s difficult for me to imagine any reason the company chose to represent this number in this way, other than to hide the loss and mislead potential investors. But if I notice something like this, I am inclined to doubt every graphic in the report.

Figure 3b shows a graphic from the New York Times that represents the average automobile fuel economy mandated by the US government each year between 1978 and 1985.

Figure 3b

The mandated miles per gallon increased each year as shown by the numbers along the right side of the drawing. The problem with this picture is that those numbers are represented by horizontal lines and those lines are not nearly proportional to the numbers. For example, the line representing 18 is 0.6 inches long, yet the line representing 27.5 is 5.3 inches long.

Tufte created a formula to quantify this kind of misleading graphic. He called it The Lie Factor. The Lie Factor is equivalent to the Size of the effect shown in the graphic, divided by the size of the effect in the data (Figure 3c)

Figure 3c

In the fuel economy example, the Data Increase is 53%, but the Graphical Increase is 783%, resulting in a Lie Factor of 14.8!

Figure 3d

Figure 3e below shows a more accurate representation of the fuel economy standards, which increase each year, but at a much less dramatic rate than shown in the NY Times graphic.

Figure 3e

Another way that a chart can distort the underlying data it is by attempting to represent 1-dimensional data points with 2-dimensional objects. Figure 3f shows an example of this.

Figure 3f

This figure shows that the percentage of doctors devoted to Family Practice dropped from 27% in 1984 to 12% in 1990. The number on the far left (27%) is a little over double the number on the far right (12%), so the picture of doctor on the left is a little more than twice as tall as the doctor on the right. The problem is that these doctors have width in addition to height and that the size of each doctor is proportional to both its width and its height. So the size of the doctor on the left is far more than twice the size of the doctor on the right. The data increases from left to right by 125%, but the picture increases by 406%, which is a lie factor of 406/125 = 3.8!
This problem is magnified when we try to represent 1-dimensional data with 3-dimensional drawings. In Figure 3g, each data point (the price of a barrel of oil in a given year) is represented by a picture of a barrel of oil.

Figure 3g

If we just looked at this as a 2D drawing, the lie factor would be about 9; But the metaphor presented by a 3D barrel causes the viewer to think about the volume capacity of each barrel. The capacity of the 1979 barrel is 27,000% more than the 1973 barrel, even though the price only increased by 554% during that time – a Lie Factor of 27,000 / 554 = 48.8!

Figure 3g has one other problem. The dollars are presented in Nominal Dollars – that is dollars that have not been adjusted for inflation. However, a dollar in 1979 was not nearly as valuable as a dollar in 1973. The data would be more realistic if it were presented in Real Dollars – dollars adjusted for inflation. Figure 3h is from the London Evening Times and shows similar data, but presents it with both Real Dollars and Nominal Dollars. You can see that the difference between the two lines is significant.

Figure 3h

In general, if you present monetary data across an extended period of time, you should adjust the monetary units for inflation during that time.

Figure 3i shows another way to mislead viewers.

Figure 3i

This bar chart shows commissions paid to travel agents by 4 different airlines during 3 consecutive periods. We can see that those commissions increased slightly from period 1 to period 2 and dropped significantly in period 3 for all 4 airlines. However, it is not at all obvious from this graph that period 3 is only 6 months long, while periods 1 and 2 are each 12 months long. It would be shocking if payments did not drop in the abbreviated period 3! This graph would be more accurate if the same units were used for all periods – either by annualizing Period 3 or by splitting the other periods into 6-month increments.

### Takeaways

The key takeaways of Graphical Integrity are
•    Make sure that images are in proportion to the data it represents
•    #Dimensions in graph = #Dimensions in data
•    Use Real dollars, instead of deflated dollars

This is an ongoing series discussing the research of Dr. Edward Tufte on Data Visualization.

Tuesday, May 10, 2011 3:12:00 PM (GMT Daylight Time, UTC+01:00)
Monday, May 9, 2011

Episode 155

Monday, May 9, 2011 3:45:00 PM (GMT Daylight Time, UTC+01:00)
Friday, May 6, 2011

Figure 2a is a hand-drawn graph created by the French engineer Ibry in 1885. It represents a schedule of train trips in France.

Figure 2a

The times are listed along the top and bottom (x-axis), and the train stations are listed along the left side (y-axis). Each train route is represented by a diagonal line. The left end point of the diagonal line represents the departure of that train with the departure station on the y-axis and the departure time on the x-axis. The right endpoint of the diagonal line tells us when and where the train arrives at its destination. Using this graph, it’s not difficult to find the schedule of all trains leaving a given station each day. For example, in Figure 2b, I’ve highlighted one train trip that leaves Paris shortly after noon and arrives in Tonnerre around 6PM.

Figure 2b

Figure 2c is a chart created by the statistician William Playfair.

Figure 2c

The strength of this graph is that it displays 3 series of data over the time period: The average wages in England, the average price of wheat in England each decade, and the reign of each monarch is shown on the same time scale, covering about 4 centuries. Presenting multiple series like this allows the viewer to quickly determine correlations between the series.

A map can be an effective data presentation tool, as evidenced by Figure 2d, which shows economic data from the 1960 census.

Figure 2d

Each map shows every county in the United States. The top map shows the concentration of very poor families in each county and the bottom map shows the concentration of very rich families. High percentages are represented by very dark shading, low percentages are represented by very light shading and the percentage of shading increases regularly with the increase of percentage. A map such as this aggregates millions of data points. Because it is so intuitive, the viewer can quickly form observations (lots of poor families in the southeastern US in 1960) and ask questions (why are there so many rich families and poor families in central Alaska?)

No discussion of historical graphical excellence would be complete without Minard’s diagram shown in Figure 2e.

Figure 2e

Tufte described this drawing – which shows Napoleon’s advance to and retreat from Moscow in the winter of 1812-1913 – as “the best statistical graph ever”. The tan line represents Napoleon’s march from the Polish-Russian border on the left to Moscow on the right, while the black line below it represents his retreat back into Poland. The width of each line represents the size of Napoleon’s army. From this information alone, we can see the disaster of this campaign – Napoleon entered Russia with 400,000 troops but arrived in a deserted Moscow with only 100,000 men. By the time he left Russia months later, he had barely 10,000 men. The retreating line is tied to a graph below showing the time and temperature during the march. The extreme cold undoubtedly was a factor in the decimation of this army. With a minimal amount of ink, this chart shows army size, location, direction of movement, time, and temperature – a startling amount of information.

In this article, we looked at some historical charts, graphs and maps that visualize data in a way that is more meaningful and more quickly grasped by the viewer than the data represented. In the next section, we will explore some common problems with visualizations.

This is an ongoing series discussing the research of Dr. Edward Tufte on Data Visualization.

Friday, May 6, 2011 1:49:00 PM (GMT Daylight Time, UTC+01:00)
Thursday, May 5, 2011

Look at the four series of data below.

I II III IV
x y   x y   x y   x y
10 8.04   10 9.14   10 7.46   8 6.58
8 6.95   8 8.14   8 6.77   8 5.76
13 7.58   13 8.74   13 12.74   8 7.71
9 8.81   9 8.77   9 7.11   8 8.84
11 8.33   11 9.26   11 7.81   8 8.47
14 9.96   14 8.1   14 8.84   8 7.04
6 7.24   6 6.13   6 6.08   8 5.25
4 4.26   4 3.1   4 5.39   19 12.5
12 10.84   12 9.13   12 8.15   8 5.59
7 4.82   7 7.26   7 6.42   8 7.91
5 5.68   5 4.74   5 5.72   8 6.89

Is there a pattern to the data in each series? How do the series relate to one another? It’s difficult to answer these questions looking only at the raw data.

However, if we display the data as 4 scatter graphs on the same page (Figure 1a), we can quickly see the pattern in each series and we can use that pattern to predict the next value in the series. We can also see outliers in series III and IV and ask questions about why those outliers occur.

[Figure 1a]

Figure 1a is a good representation of the data because it allows us to understand the data quickly and easily and because it answers questions and sparks follow-up questions about the data.

As a software developer, I spend a lot of time writing software to maintain data. There are many tools and training to help us store, update data and retrieve data. But few people talk about the best way to present data in a meaningful way.

Professor Edward Tufte of Yale University is one person who is doing research in this area and writing about it. Tufte studied graphical representations of data to find out what makes an excellent visualization and what problems occur in data visualization. He has written several books on the topic, describing guidelines to follow and common traps to avoid. In my opinion, his best book on this subject is The Visual Display of Quantitative Information (ISBN 0961392142).

This series will review Dr. Tufte ‘s research, ideas and conclusions on Data Visualization.

Over the next couple weeks, I’ll explore excellent charts created throughout history and identify what makes them so excellent; graphs that lack integrity and serve to mislead the viewer; and some guidelines that Dr. Tufte suggests for improving data visualization.

This is an ongoing series discussing the research of Dr. Edward Tufte on Data Visualization.

Thursday, May 5, 2011 4:31:00 PM (GMT Daylight Time, UTC+01:00)
Wednesday, May 4, 2011

The Kalamazoo X conference isn’t like other conferences. Although it is targeted at technical people and the audience is filled with software developers, the content presented is typically not technical. Instead, sessions highlight soft skills, such as team building and education.

Another major difference between Kalamazoo X and other conferences is the session format: The length of each presentation is limited to 30 minutes – much shorter than the 60-90 minute presentations seen at most technical conferences. This serves to keep the audience focused. It’s rare to see any audience member get up out of his or her chair and walk out of a session, partly because they will miss a significant part of it and partly because the session is always close to the end.

The final major difference is that Kalamazoo X offers only one track. This provides all attendees the same shared experience, that they can discuss and compare afterwards. One never has to choose or feel he is missing something.

This year’s conference took place last Saturday at Kalamazoo Valley Community College and featured something for everyone. Nine speakers delivered ten presentations and the day ended with a panel discussion on Interviewing. A fishbowl exercise during lunch got the crowd excited. 5 chairs were placed in the middle of the room and a topic was thrown out. The ground rules of the fish bowl were: You must be seated in one of the chairs in order to ask a question; and one chair must always be empty. Attendees entered and exited the fishbowl area frequently and the conversation grew excited as ideas fired back and forth.

Kalamazoo X is the brainchild of Michael Eaton, who envisioned a conference that fill gaps he saw in the education of software developers. Technical information is readily available to technical people from a variety of venues, but soft skill training is much more rare and this lack of training often shows up in the lack of soft skills displayed by the developer community.

Kalamazoo X is now in its third year. I have attended all three – including the one last Saturday. I have spoken at two of them. Each time, the success was evident – The room was full, the content was excellent, and the atmosphere was electric. I’ve learned about leadership from Jim Holmes, about Community from Mike Wood and Brian Prince, about self-promotion from Jeff Blankenburg, and about life from Leon Gersing.

Photos from 2011 Kalamazoo X

Photos from 2010 Kalamazoo X

Wednesday, May 4, 2011 3:20:00 PM (GMT Daylight Time, UTC+01:00)
Monday, May 2, 2011

Monday, May 2, 2011 3:48:00 PM (GMT Daylight Time, UTC+01:00)
Saturday, April 30, 2011

Below are slides from the Data Visualization talk I delivered at the Kalamazoo X conference today

Saturday, April 30, 2011 3:34:07 PM (GMT Daylight Time, UTC+01:00)
Monday, April 25, 2011

Monday, April 25, 2011 3:45:00 PM (GMT Daylight Time, UTC+01:00)