This post was written by Alex Nathan, Co-Founder of Aventrix Analytics and part-time Metis Corporate Training Instructor.
A retail company is looking to open a new store. Given its recent investments in analytics, Eve, the director of data science, decides to provide two members of her team (Bob and Alice) with every piece of available information – such as sales and demographics on all current stores – and asks them to come up with insights on the best location for the new store.
Bob and Alice independently tackle the problem for a few weeks, and after intense amounts of data cleaning, pre-processing, and modeling, they present their findings to Eve. To her surprise, Bob and Alice came up with very different recommendations. How is this possible given that both data scientists had access to the exact same data?
As it turns out, data alone is typically not enough to arrive at any conclusions. The following “equation” explains the source of discrepancy in the above anecdote:
DATA + ASSUMPTIONS = CONCLUSIONS*
The keyword here is ASSUMPTIONS. Different people can make different assumptions about the data, which introduces a high degree of subjectivity and bias in the data analysis process. The issue, however, is not with making assumptions, but rather with how these assumptions are being made.
Three common themes surrounding problematic assumptions include:
Convenience: Realistic assumptions are typically more complicated and require more work.
Cognitive Biases: As humans, we are inherently biased beings, and oftentimes search for answers that confirm our worldview.
Unawareness of the Underlying Assumptions: This point can be particularly tricky, as it constitutes an unknown unknown.
Consider the following example:
A survey is asking college students whether they ever cheated on their exams. It goes without saying that the anonymity of the participants is guaranteed. The response rate of the survey is 10%. The results indicate that 1 in 30 students cheat on their exams, which means that 3.3% of the entire student body are “cheaters.”
Pretty simple, right?
Not quite, as implicitly, we made a fairly strong assumption: we assumed that the proportion of “cheaters” in 90% of the student population who did not respond to the survey is the same as in the 10% who participated in the study. It is entirely possible, however, that “cheaters” are less likely to respond to a survey of this nature because they feel shame, or because they do not believe in the confidentiality of surveys.
So what is the right answer here? Unfortunately, there isn’t one. It all depends on what realistic assumptions we can make about the 90% of non-responders. For example, if we assume that the rate of “cheaters” among non-responders is 1 in 15, then the overall percentage of “cheaters” is 0.1 * (1/30) + 0.9 * (1/15) = 6.3%, which means that our initial 3.3% estimate was off by 92%!
While this survey example showcases just a single type of assumption, there are multiple forms present throughout the procedure of collecting and analyzing data. A common task in data science is building statistical models to predict the future and inform decision making. Simply applying an off-the-shelf model to data without understanding in depth the data collection mechanism or the assumptions that the model itself imposes on the data can have a significant impact on business outcomes. A good example of predictions gone wrong is Google Flu Trends, which after consistently overestimating the number of influenza cases around the world (at times by 140%), ended up shutting down.
So, what can data scientists do about this?
Making assumptions is a necessary condition if we are to reach any conclusions when dealing with data. Effective and responsible data science starts by clearly recognizing what assumptions are being made and going to great lengths to ensure their validity.
Furthermore, data scientists should take care to:
- - Clearly communicate assumptions with stakeholders and domain experts and solicit their feedback on whether assumptions are properly identified.
- - Not dive into their analysis without getting a deep understanding of the data generation process.
- - When working with data, not stick to a single assumption. Test a variety of assumptions and observe how the conclusions of their analysis change.
* Manski C, Identification for Prediction and Decision, Harvard University Press, 2008