This post was written by Alex Nathan, Co-Founder of Aventrix Analytics and part-time Metis Corporate Training Instructor.
A retail company is looking to open a new store. Given its recent investments in analytics, Eve, the director of data science, decides to provide two members of her team (Bob and Alice) with every piece of available information – such as sales and demographics on all current stores – and asks them to come up with insights on the best location for the new store.
Bob and Alice independently tackle the problem for a few weeks, and after intense amounts of data cleaning, pre-processing, and modeling, they present their findings to Eve. To her surprise, Bob and Alice came up with very different recommendations. How is this possible given that both data scientists had access to the exact same data?
As it turns out, data alone is typically not enough to arrive at any conclusions. The following “equation” explains the source of discrepancy in the above anecdote:
DATA + ASSUMPTIONS = CONCLUSIONS*
The keyword here is ASSUMPTIONS. Different people can make different assumptions about the data, which introduces a high degree of subjectivity and bias in the data analysis process. The issue, however, is not with making assumptions, but rather with how these assumptions are being made.
Three common themes surrounding problematic assumptions include:
Pretty simple, right?
Not quite, as implicitly, we made a fairly strong assumption: we assumed that the proportion of “cheaters” in 90% of the student population who did not respond to the survey is the same as in the 10% who participated in the study. It is entirely possible, however, that “cheaters” are less likely to respond to a survey of this nature because they feel shame, or because they do not believe in the confidentiality of surveys.
So what is the right answer here? Unfortunately, there isn’t one. It all depends on what realistic assumptions we can make about the 90% of non-responders. For example, if we assume that the rate of “cheaters” among non-responders is 1 in 15, then the overall percentage of “cheaters” is 0.1 * (1/30) + 0.9 * (1/15) = 6.3%, which means that our initial 3.3% estimate was off by 92%!
While this survey example showcases just a single type of assumption, there are multiple forms present throughout the procedure of collecting and analyzing data. A common task in data science is building statistical models to predict the future and inform decision making. Simply applying an off-the-shelf model to data without understanding in depth the data collection mechanism or the assumptions that the model itself imposes on the data can have a significant impact on business outcomes. A good example of predictions gone wrong is Google Flu Trends, which after consistently overestimating the number of influenza cases around the world (at times by 140%), ended up shutting down.
Making assumptions is a necessary condition if we are to reach any conclusions when dealing with data. Effective and responsible data science starts by clearly recognizing what assumptions are being made and going to great lengths to ensure their validity.
Furthermore, data scientists should take care to:
_____
* Manski C, Identification for Prediction and Decision, Harvard University Press, 2008
By Emily Wilson • November 01, 2019
By Metis • August 30, 2019
By Kimberly Fessel • September 18, 2019