data science

One of the most commonly used charts for data visualization is the **bar chart, **which encodes numbers into the height of a bar and is typically used to compare the relative size of two or more bars. In the example below, we can observe that the red bar is twice as tall as the blue bar, appropriately displaying the relationship between the numbers 40 and 20.

This type of visualization makes sense when we start the y-axis at 0 because the data and height of the bar are in agreement. However, by modifying the starting value of the y-axis, we can skew the interpretation of the chart. For example, in the chart below, the red bar appears as being more than twice as big as the blue bar, which is misleading compared to the data. We achieve this by starting the y-axis at the value of 15. Although the data values are shown, we tend to focus first on the visuals before we process the numbers and therefore, make incorrect conclusions.

Misleading examples like the one above are frequently found in the real world. Below, you’ll see a chart that compares actual Obamacare enrollments versus the established goal. In the graph, I included three red arrows to indicate how the second bar is almost three times as big as the first bar.

However, when we read the actual numbers (shown in the chart below), we learn that the goal of 7,066,000 enrollments is only 17% larger than the 6,000,000 enrollments as of March 27th. Generally, we don’t think to do this mental calculation, given that the effort of calculating the percentage of a number is much more significant than just comparing the relative heights of the bars, which, in this case, leads us to believe that the gap between the two is much larger than 17%.

The chart was eventually corrected as shown below, which contains both bars starting at 0.

Let’s look at another example below. In 2013, the presidential election in Venezuela had two main candidates, Nicolás Maduro (on the left) and Henrique Capriles (on the right). Maduro won the election, which is based on the popular vote. What percentage of the popular vote would you say each candidate obtained?

Based on the image above, we would guess that Nicolás Maduro won by a landslide. However, the percentage of votes was displayed on the graph (as shown below), and we can see that that this was a very close race with a difference of only 1.59%.

Although the actual numbers of the results were presented, the heights of the bars do not match the data, and our brains are wired to focus on images over text. The y-axis, although not shown, does not start at 0, which provides the false impression that the difference between the percentage of votes obtained by each candidate was much larger. A more appropriate image for the election data with a y-axis starting at zero is presented below.

In this case, the height of the bars display the results of a tighter race and match the values of the data. In summary, we should be careful to note if the y-axis on bar charts start at 0; otherwise, we can be fooled into making wrong conclusions about our data.

data science
##### Metis Makes Course Report's 21 Best Bootcamps of 2020 List

By Metis • August 25, 2020

data science
##### Data Scientist Roundup: The Importance of Data Literacy in Business, Classification & Regression Trees, & Much More

By Emily Wilson • July 30, 2020

data science
##### Made at Metis: An Album Discovery Tool & A Voting Recommendation Engine

By Metis • September 25, 2020