Which Bootcamp is Right for Your Career Goals? Explore Programs

Does the Data Speak for Itself?

By Alex Nathan • November 25, 2019

This post was written by Alex Nathan, Co-Founder of Aventrix Analytics and part-time Metis Corporate Training Instructor.

A retail company is looking to open a new store. Given its recent investments in analytics, Eve, the director of data science, decides to provide two members of her team (Bob and Alice) with every piece of available information – such as sales and demographics on all current stores – and asks them to come up with insights on the best location for the new store. 

Bob and Alice independently tackle the problem for a few weeks, and after intense amounts of data cleaning, pre-processing, and modeling, they present their findings to Eve. To her surprise, Bob and Alice came up with very different recommendations. How is this possible given that both data scientists had access to the exact same data?

As it turns out, data alone is typically not enough to arrive at any conclusions. The following “equation” explains the source of discrepancy in the above anecdote:

DATA + ASSUMPTIONS = CONCLUSIONS*

The keyword here is ASSUMPTIONS. Different people can make different assumptions about the data, which introduces a high degree of subjectivity and bias in the data analysis process. The issue, however, is not with making assumptions, but rather with how these assumptions are being made. 

Three common themes surrounding problematic assumptions include:

  1. Convenience: Realistic assumptions are typically more complicated and require more work.
  2. Cognitive Biases: As humans, we are inherently biased beings, and oftentimes search for answers that confirm our worldview.
  3. Unawareness of the Underlying Assumptions: This point can be particularly tricky, as it constitutes an unknown unknown.

    Consider the following example:

    A survey is asking college students whether they ever cheated on their exams. It goes without saying that the anonymity of the participants is guaranteed. The response rate of the survey is 10%. The results indicate that 1 in 30 students cheat on their exams, which means that 3.3% of the entire student body are “cheaters.” 

Pretty simple, right?

Not quite, as implicitly, we made a fairly strong assumption: we assumed that the proportion of “cheaters” in 90% of the student population who did not respond to the survey is the same as in the 10% who participated in the study. It is entirely possible, however, that “cheaters” are less likely to respond to a survey of this nature because they feel shame, or because they do not believe in the confidentiality of surveys. 

So what is the right answer here? Unfortunately, there isn’t one. It all depends on what realistic assumptions we can make about the 90% of non-responders. For example, if we assume that the rate of “cheaters” among non-responders is 1 in 15, then the overall percentage of “cheaters” is 0.1 * (1/30) + 0.9 * (1/15) = 6.3%, which means that our initial 3.3% estimate was off by 92%! 

While this survey example showcases just a single type of assumption, there are multiple forms present throughout the procedure of collecting and analyzing data. A common task in data science is building statistical models to predict the future and inform decision making. Simply applying an off-the-shelf model to data without understanding in depth the data collection mechanism or the assumptions that the model itself imposes on the data can have a significant impact on business outcomes. A good example of predictions gone wrong is Google Flu Trends, which after consistently overestimating the number of influenza cases around the world (at times by 140%), ended up shutting down. 

So, what can data scientists do about this? 

Making assumptions is a necessary condition if we are to reach any conclusions when dealing with data. Effective and responsible data science starts by clearly recognizing what assumptions are being made and going to great lengths to ensure their validity.

Furthermore, data scientists should take care to: 

  • - Clearly communicate assumptions with stakeholders and domain experts and solicit their feedback on whether assumptions are properly identified.
  • - Not dive into their analysis without getting a deep understanding of the data generation process.
  • - When working with data, not stick to a single assumption. Test a variety of assumptions and observe how the conclusions of their analysis change.

_____

* Manski C, Identification for Prediction and Decision, Harvard University Press, 2008


Similar Posts

business resource
VIDEO: Building a Successful Data-Driven Culture to Boost Business Value

By Carlos Russo • March 16, 2021

Metis President and Co-Founder Jason Moss recently moderated a panel discussion on Building a Successful Data-Driven Culture to Boost Business Value. Watch the recording here.

business resource
Scoping a Data Science Projects

By Damien Martin • July 07, 2021

In February, Metis Sr. Data Scientist Damien Martin wrote a post on how to foster a data literate and empowered workforce, which allows your data science team to then work on projects rather than ad hoc analyses. In this post, he explains how to carefully scope those data science projects for maximum impact and benefit.

business resource
Expand Your Data Science Toolkit with Data Engineering

By Carlos Russo • April 16, 2021

Big data is growing exponentially. To keep up with it, data engineering — a discipline focused on collecting, funneling, and organizing big data into accessible data pipelines — is in urgent demand. Data scientists and other data professionals can fill the gap by extending their capabilities into the world of data engineering with the Data Engineering for Data Scientists Course by Metis Corporate Training. In this course, data science professionals will learn advanced programming, database management, distributed computing, and cloud engineering.