Watch our on-demand lecture on SVMs featuring Alice Zhao:  Get Recording 

Scoping a Data Science Project

By Damien Martin • April 22, 2019

Photo by Kelly Sikkema via Unsplash

This post was written by Damien Martin, Sr. Data Scientist on the Corporate Training team at Metis.

In a previous article, we discussed the benefits of up-skilling your employees so they could investigate trends within data to help find high-impact projects. If you implement these suggestions, you will have everyone thinking about business problems at a strategic level, and you will be able to add value based on insight from each person’s specific job function. Having a data literate and empowered workforce allows the data science team to work on projects rather than ad hoc analyses.

Once we have identified an opportunity (or a problem) where we think that data science could help, it is time to scope out our data science project.


The first step in project planning should come from business concerns. This step can typically be broken down into the following subquestions:

  • - What is the problem that we want to solve?
  • - Who are the key stakeholders?
  • - How do we plan to measure if the problem is solved?
  • - What is the value (both upfront and ongoing) of this project?

There is nothing in this evaluation process that is specific to data science. The same questions could be asked about adding a new feature to your website, changing the opening hours of your store, or changing the logo for your company.

The owner for this stage is the stakeholder, not the data science team. We are not telling the data scientists how to accomplish their goal, but we are telling them what the goal is.

Is it a data science project?

Just because a project involves data doesn't make it a data science project. Consider a company that wants a dashboard that tracks a key metric, such as weekly revenue. Using our previous rubric, we have:

    We want visibility on sales revenue. 

    Primarily the sales and marketing teams, but this should impact everyone. 

    A solution would have a dashboard indicating the amount of revenue for each week. 

    $10k + $10k/year 

Even though we may use a data scientist (particularly in small companies without dedicated analysts) to write this dashboard, this isn't really a data science project. This is the sort of project that can be managed like a typical software engineering project. The goals are well-defined, and there isn't a lot of uncertainty. Our data scientist just needs to write the queries, and there is a "correct" answer to check against. The value of the project isn't the amount we expect to spend, but the amount we are willing to spend on creating the dashboard. If we have sales data sitting in a database already, and a license for dashboarding software, this might be an afternoon's work. If we need to build the infrastructure from scratch, then that would be included in the cost for this project (or, at least amortized over projects that share the same resource).

One way of thinking about the difference between a software engineering project and a data science project is that features in a software project are often scoped out separately by a project manager (perhaps in conjunction with user stories). For a data science project, determining the "features" to be added is a part of the project.

Scoping a data science project: Failure IS an option

A data science problem might have a well-defined problem (e.g. too much churn), but the solution might have unknown effectiveness. While the project goal might be "reduce churn by 20 percent", we don't know if this goal is achievable with the information we have.

Adding additional data to your project is typically expensive (either building infrastructure for internal sources, or subscriptions to external data sources). That's why it is so crucial to set an upfront value to your project. A lot of time can be spent generating models and failing to reach the targets before realizing that there is not enough signal in the data. By keeping track of model progress through different iterations and ongoing costs, we are better able to project if we need to add additional data sources (and price them appropriately) to hit the desired performance goals.

Many of the data science projects that you try to implement will fail, but you want to fail quickly (and cheaply), saving resources for projects that show promise. A data science project that fails to meet its target after 2 weeks of investment is part of the cost of doing exploratory data work. A data science project that fails to meet its target after 2 years of investment, on the other hand, is a failure that could probably be avoided.

When scoping, you want to bring the business problem to the data scientists and work with them to make a well-posed problem. For example, you may not have access to the data you need for your proposed measurement of whether the project succeeded, but your data scientists could give you a different metric that might serve as a proxy. Another element to consider is whether your hypothesis has been clearly stated (and you can read a great post on that topic from Metis Sr. Data Scientist Kerstin Frailey here).

Checklist for scoping

Here are some high-level areas to consider when scoping a data science project:

  • Evaluate the data collection pipeline costs
    Before doing any data science, we need to make sure that data scientists have access to the data they need. If we need to invest in additional data sources or tools, there can be (significant) costs associated with that. Often, improving infrastructure can benefit several projects, so we should amortize costs amongst all these projects. We should ask:

    • - Will the data scientists need additional tools they don't have?
    • - Are many projects repeating the same work?

      Note: If you do add to the pipeline, it is probably worth making a separate project to evaluate the return on investment for this piece.
  • Rapidly make a model, even if it is simple
    Simpler models are often more robust than complicated. It is okay if the simple model doesn't reach the desired performance.

  • Get an end-to-end version of the simple model to internal stakeholders
    Ensure that a simple model, even if its performance is poor, gets put in front of internal stakeholders as soon as possible. This allows rapid feedback from your users, who might tell you that a type of data that you expect them to provide is not available until after a sale is made, or that there are legal or ethical implications with some of the data you are trying to use. In some cases, data science teams make extremely quick "junk" models to present to internal stakeholders, just to check if their understanding of the problem is correct.

  • Iterate on your model
    Keep iterating on your model, as long as you continue to see improvements in your metrics. Continue to share results with stakeholders.

  • Stick to your value propositions
    The reason for setting the value of the project before doing any work is to guard against the sunk cost fallacy.

  • Make space for documentation
    Hopefully, your organization has documentation for the systems you have in place. You should also document the failures! If a data science project fails, give a high-level description of what seemed to be the problem (e.g. too much missing data, not enough data, needed different types of data). It is possible that these problems go away in the future and the problem is worth addressing, but more importantly, you don't want another group trying to solve the same problem in two years and coming across the same stumbling blocks.

Maintenance costs

While the bulk of the cost for a data science project involves the initial set up, there are also recurring costs to consider. Some of these costs are obvious because they are explicitly billed. If you require the use of an external service or need to rent a server, you receive a monthly bill for that ongoing cost.

But in addition to these explicit costs, you should consider the following:

  • - How often does the model need to be retrained?
  • - Are the results of the model being monitored? Is someone being alerted when model performance drops? Or is someone responsible for checking the performance by visiting a dashboard?
  • - Who is responsible for monitoring the model? How much time per week is this expected to take?
  • - If subscribing to a paid data source, how much is that per billing cycle? Who is monitoring that service’s changes in cost?
  • - Under what conditions should this model be retired or replaced?

The expected maintenance costs (both in terms of data scientist time and external subscriptions) should be estimated up front.


When scoping a data science project, there are several steps, and each of them have a different owner. The evaluation stage is owned by the business team, as they set the goals for the project. This involves a careful evaluation of the value of the project, both as an upfront cost and the ongoing maintenance.

Once a project is deemed worth pursuing, the data science team works on it iteratively. The data used, and progress against the main metric, should be tracked and compared to the initial value assigned to the project.

Need help?

Metis offers training to upskill your technical team in data science and machine learning, as well as trainings to help executives become more fluent in understanding data and its value. If you have managers who would like training on managing data science projects, please get in touch with our corporate training team, or fill out this form.

Similar Posts

business resource
VIDEO: Recorded Talk - Analytics Beyond Excel with Kevin Birnbaum

By Emily Wilson • August 06, 2020

Watch a recording of Metis Sr. Data Scientist Kevin Birnbaum's talk on Analytics Beyond Excel at the new Wake Forest University Financial Services and Fintech Hub.

business resource
VIDEO: An AI4 Panel Discussion on The State of AI in Banking

By Metis • September 23, 2020

Metis Sr. Data Scientist Javed Ahmed recently took part in a panel discussion about The State of AI in Banking during an online Ai4 event. He and the other panelists talked about upskilling, challenges related to COVID-19, and more. Watch the recorded panel discussion here.

business resource
Variance-Infused Thinking

By Tony Yiu • August 10, 2020

Variance is all around us. It impacts every decision and outcome; but unless we go out of our way to envision, it often passes by unnoticed until it’s too late. In this post, Data Scientist Tony Yiu explores how failing to recognize the role of variance in outcomes often blinds us to the true state of the world.