Thursday, May 12, 2011

All graphs, charts and other data visualization pictures consist of “ink”. At one time, “ink” referred to physical ink because at one time all charts were printed on paper. Now, we can think of ink as anything drawn on either paper or the screen, even if that drawing is never printed to a sheet of paper.

Data-Ink is that part of the ink that represents the actual data. Another way to think of data ink is: the ink that, if we erased it, would reduce the amount of information in the graphic.

So, if only some of the ink represents data, what is the rest of the ink? The rest of the ink is taken up with metadata, redundant data, and decorations.
Generally, the more data-ink in a graphic, the more effective that graphic will be. Tufte defines the “Data-Ink Ratio” as [The Amount of Data-Ink in a graphic] divided by [The total Ink in the Graphic]. When creating charts and graphics, our goal should be to maximize the Data-Ink Ratio, within reason.
Consider the single data point represented by a bar chart in Figure 5a.

Figure 5a

The value of that point is represented by the following
•    The height of the vertical line along the left side of the bar;
•    The height of the vertical line along the right side of the bar;
•    The height of the horizontal line along the top of the bar;
•    The height of the the colored area within the bar;
•    The height of the number atop the bar; and
•    The value of the number atop the bar.

Six different elements in this graph all represent the same value – a tremendous amount of redundant data. This graph has a very low Data-Ink Ratio.

The problem is even worse if we make the bar chart 3-dimensional as in Figure 5b.

Figure 5b

Let’s look at an example of a graph with a low Data-Ink Ratio and try to fix it. Figure 5c reports some linear data points on a surface that looks like graph paper.

Figure 5c

In this figure, the dark gridlines compete with data points for the viewer’s attention. We can eliminate some of these gridlines and lighten the others to reduce the Data-Ink Ratio and make the data more obvious.

Figure 5d

Spreadsheet makers discovered this a long time ago when they decided to lighten the borders between cells in order to make these borders (metadata) less obvious than the numbers inside the cells (data). In the case of this graph, we probably don’t need gridlines at all. Eliminating them entirely (Figure 5e) reduces the Data-Ink Ratio with no further loss of information.

Figure 5e

If we look around the remaining parts of the graph, we can find more non-Data-Ink that is a candidate for elimination. The top and right borders certainly don’t provide any information. And the axes are just as readable if we eliminate half the numbers.

Figure 5f

Figure 5g shows a graph by chemist Linus Pauling, mapping the Atomic Number and Atomic Volume of a number of different elements.

Figure 5g

Pauling has removed the gridlines, but he has left in the grid intersections – tiny crosses that distract from the data. We can safely eliminate these crosses to reduce the Data-Ink Ratio and make the graph more readable (Figure 5h)

Figure 5h

One could argue that the dashed lines between the data points are metadata and that removing them would increase the Data-Ink Ratio. However, if we do so (Figure 5i), the graph becomes less clear, because the lines help group together elements in the same Periodic Table row.

Figure 5i

This is why our goal is to increase the Data-Ink Ratio, within reason. Sometimes it is necessary to add back some non-Data-Ink in order to enhance the graph.
Figure 5j shows another example when redundant data can enhance a graph’s readability.

Figure 5j

The top picture is the train schedule from part 2 of this series. Notice that some of the diagonal lines stop at the right edge and continue from the left edge of the chart. These are scheduled train rides that leave a station before 6AM, but don’t arrive at a destination until after 6AM. In the bottom picture, I have copied the first 12 hours of the chart and pasted it on the right, ensuring that every route line appears at least without a break.

Figure 5k

Now, if we could just get rid of those gridlines…

This is an ongoing series discussing the research of Dr. Edward Tufte on Data Visualization.

Thursday, May 12, 2011 1:30:00 PM (GMT Daylight Time, UTC+01:00)