# Tuesday, May 10, 2011

The primary goal of creating a chart or graph should always be to accurately represent the data on which that chart is based. In his research on data visualization, Dr. Edward Tufte found numerous charts that misled viewers about the underlying data.
The first example is from the annual report of a mining company and it illustrates net income of that company during a 5-year stretch.


Figure 3a

The income (or loss) each year is represented by a tall, vertical bar. What is not obvious from this picture is that the company lost $11,014 in 1970. That loss is represented in the picture by a tall bar because the company chose an arbitrary baseline of about -$4 million. Showing $0 on the graph would have made it far more credible. It’s difficult for me to imagine any reason the company chose to represent this number in this way, other than to hide the loss and mislead potential investors. But if I notice something like this, I am inclined to doubt every graphic in the report.

Figure 3b shows a graphic from the New York Times that represents the average automobile fuel economy mandated by the US government each year between 1978 and 1985.


Figure 3b

The mandated miles per gallon increased each year as shown by the numbers along the right side of the drawing. The problem with this picture is that those numbers are represented by horizontal lines and those lines are not nearly proportional to the numbers. For example, the line representing 18 is 0.6 inches long, yet the line representing 27.5 is 5.3 inches long.

Tufte created a formula to quantify this kind of misleading graphic. He called it The Lie Factor. The Lie Factor is equivalent to the Size of the effect shown in the graphic, divided by the size of the effect in the data (Figure 3c)


Figure 3c

In the fuel economy example, the Data Increase is 53%, but the Graphical Increase is 783%, resulting in a Lie Factor of 14.8!


Figure 3d

Figure 3e below shows a more accurate representation of the fuel economy standards, which increase each year, but at a much less dramatic rate than shown in the NY Times graphic.


Figure 3e

Another way that a chart can distort the underlying data it is by attempting to represent 1-dimensional data points with 2-dimensional objects. Figure 3f shows an example of this.


Figure 3f

This figure shows that the percentage of doctors devoted to Family Practice dropped from 27% in 1984 to 12% in 1990. The number on the far left (27%) is a little over double the number on the far right (12%), so the picture of doctor on the left is a little more than twice as tall as the doctor on the right. The problem is that these doctors have width in addition to height and that the size of each doctor is proportional to both its width and its height. So the size of the doctor on the left is far more than twice the size of the doctor on the right. The data increases from left to right by 125%, but the picture increases by 406%, which is a lie factor of 406/125 = 3.8!
This problem is magnified when we try to represent 1-dimensional data with 3-dimensional drawings. In Figure 3g, each data point (the price of a barrel of oil in a given year) is represented by a picture of a barrel of oil.


Figure 3g

If we just looked at this as a 2D drawing, the lie factor would be about 9; But the metaphor presented by a 3D barrel causes the viewer to think about the volume capacity of each barrel. The capacity of the 1979 barrel is 27,000% more than the 1973 barrel, even though the price only increased by 554% during that time – a Lie Factor of 27,000 / 554 = 48.8!

Figure 3g has one other problem. The dollars are presented in Nominal Dollars – that is dollars that have not been adjusted for inflation. However, a dollar in 1979 was not nearly as valuable as a dollar in 1973. The data would be more realistic if it were presented in Real Dollars – dollars adjusted for inflation. Figure 3h is from the London Evening Times and shows similar data, but presents it with both Real Dollars and Nominal Dollars. You can see that the difference between the two lines is significant.


Figure 3h

In general, if you present monetary data across an extended period of time, you should adjust the monetary units for inflation during that time.

Figure 3i shows another way to mislead viewers.


Figure 3i

This bar chart shows commissions paid to travel agents by 4 different airlines during 3 consecutive periods. We can see that those commissions increased slightly from period 1 to period 2 and dropped significantly in period 3 for all 4 airlines. However, it is not at all obvious from this graph that period 3 is only 6 months long, while periods 1 and 2 are each 12 months long. It would be shocking if payments did not drop in the abbreviated period 3! This graph would be more accurate if the same units were used for all periods – either by annualizing Period 3 or by splitting the other periods into 6-month increments.

Takeaways

The key takeaways of Graphical Integrity are
•    Make sure that images are in proportion to the data it represents
•    #Dimensions in graph = #Dimensions in data
•    Use Real dollars, instead of deflated dollars


This is an ongoing series discussing the research of Dr. Edward Tufte on Data Visualization.

Tuesday, May 10, 2011 3:12:00 PM (GMT Daylight Time, UTC+01:00)