Starts Thursday! Intro to Data Science View Course

Does the Data Speak for Itself?

By Alex Nathan • November 25, 2019

This post was written by Alex Nathan, Co-Founder of Aventrix Analytics and part-time Metis Corporate Training Instructor.

A retail company is looking to open a new store. Given its recent investments in analytics, Eve, the director of data science, decides to provide two members of her team (Bob and Alice) with every piece of available information – such as sales and demographics on all current stores – and asks them to come up with insights on the best location for the new store. 

Bob and Alice independently tackle the problem for a few weeks, and after intense amounts of data cleaning, pre-processing, and modeling, they present their findings to Eve. To her surprise, Bob and Alice came up with very different recommendations. How is this possible given that both data scientists had access to the exact same data?

As it turns out, data alone is typically not enough to arrive at any conclusions. The following “equation” explains the source of discrepancy in the above anecdote:


The keyword here is ASSUMPTIONS. Different people can make different assumptions about the data, which introduces a high degree of subjectivity and bias in the data analysis process. The issue, however, is not with making assumptions, but rather with how these assumptions are being made. 

Three common themes surrounding problematic assumptions include:

  1. Convenience: Realistic assumptions are typically more complicated and require more work.
  2. Cognitive Biases: As humans, we are inherently biased beings, and oftentimes search for answers that confirm our worldview.
  3. Unawareness of the Underlying Assumptions: This point can be particularly tricky, as it constitutes an unknown unknown.

    Consider the following example:

    A survey is asking college students whether they ever cheated on their exams. It goes without saying that the anonymity of the participants is guaranteed. The response rate of the survey is 10%. The results indicate that 1 in 30 students cheat on their exams, which means that 3.3% of the entire student body are “cheaters.” 

Pretty simple, right?

Not quite, as implicitly, we made a fairly strong assumption: we assumed that the proportion of “cheaters” in 90% of the student population who did not respond to the survey is the same as in the 10% who participated in the study. It is entirely possible, however, that “cheaters” are less likely to respond to a survey of this nature because they feel shame, or because they do not believe in the confidentiality of surveys. 

So what is the right answer here? Unfortunately, there isn’t one. It all depends on what realistic assumptions we can make about the 90% of non-responders. For example, if we assume that the rate of “cheaters” among non-responders is 1 in 15, then the overall percentage of “cheaters” is 0.1 * (1/30) + 0.9 * (1/15) = 6.3%, which means that our initial 3.3% estimate was off by 92%! 

While this survey example showcases just a single type of assumption, there are multiple forms present throughout the procedure of collecting and analyzing data. A common task in data science is building statistical models to predict the future and inform decision making. Simply applying an off-the-shelf model to data without understanding in depth the data collection mechanism or the assumptions that the model itself imposes on the data can have a significant impact on business outcomes. A good example of predictions gone wrong is Google Flu Trends, which after consistently overestimating the number of influenza cases around the world (at times by 140%), ended up shutting down. 

So, what can data scientists do about this? 

Making assumptions is a necessary condition if we are to reach any conclusions when dealing with data. Effective and responsible data science starts by clearly recognizing what assumptions are being made and going to great lengths to ensure their validity.

Furthermore, data scientists should take care to: 

  • - Clearly communicate assumptions with stakeholders and domain experts and solicit their feedback on whether assumptions are properly identified.
  • - Not dive into their analysis without getting a deep understanding of the data generation process.
  • - When working with data, not stick to a single assumption. Test a variety of assumptions and observe how the conclusions of their analysis change.


* Manski C, Identification for Prediction and Decision, Harvard University Press, 2008

Similar Posts

business resource
New Burtch Works Report Available Now: Salaries in Data Science & Predictive Analytics

By Metis • August 25, 2020

In Burtch Works' annual report, Salaries of Data Scientists & Predictive Analytics Professionals 2020, get a comprehensive look at salary data, demographic information, and hiring marketing analysis for data science and predictive analytics leading up to and during the ongoing pandemic.

business resource
Javed Ahmed Discusses the Competition Between Banks and Tech Companies in WSJ Article

By Shaunna Randolph • September 24, 2020

Metis Corporate Training Senior Data Scientist Javed Ahmed was quoted in the Wall Street Journal discussing the pressure banks experience from fintech and big tech companies.

business resource
VIDEO: Recorded Talk - How Machine Learning is Changing Finance with Javed Ahmed

By Metis • August 20, 2020

Watch a recording of Metis Sr. Data Scientist Javed Ahmed's talk on How Machine Learning is Changing Finance at the new Wake Forest University Financial Services and Fintech Hub.