Thursday: Livestream Bootcamp Info Session. RSVP Now

Scoping a Data Science Project

By Damien Martin • April 22, 2019

Photo by Kelly Sikkema via Unsplash

This post was written by Damien Martin, Sr. Data Scientist on the Corporate Training team at Metis.

In a previous article, we discussed the benefits of up-skilling your employees so they could investigate trends within data to help find high-impact projects. If you implement these suggestions, you will have everyone thinking about business problems at a strategic level, and you will be able to add value based on insight from each person’s specific job function. Having a data literate and empowered workforce allows the data science team to work on projects rather than ad hoc analyses.

Once we have identified an opportunity (or a problem) where we think that data science could help, it is time to scope out our data science project.

Evaluation

The first step in project planning should come from business concerns. This step can typically be broken down into the following subquestions:

  • - What is the problem that we want to solve?
  • - Who are the key stakeholders?
  • - How do we plan to measure if the problem is solved?
  • - What is the value (both upfront and ongoing) of this project?

There is nothing in this evaluation process that is specific to data science. The same questions could be asked about adding a new feature to your website, changing the opening hours of your store, or changing the logo for your company.

The owner for this stage is the stakeholder, not the data science team. We are not telling the data scientists how to accomplish their goal, but we are telling them what the goal is.

Is it a data science project?

Just because a project involves data doesn't make it a data science project. Consider a company that wants a dashboard that tracks a key metric, such as weekly revenue. Using our previous rubric, we have:

  • WHAT IS THE PROBLEM?
    We want visibility on sales revenue. 

  • WHO ARE THE KEY STAKEHOLDERS?
    Primarily the sales and marketing teams, but this should impact everyone. 

  • HOW DO WE PLAN TO MEASURE IF SOLVED?
    A solution would have a dashboard indicating the amount of revenue for each week. 

  • WHAT IS THE VALUE OF THIS PROJECT?
    $10k + $10k/year 

Even though we may use a data scientist (particularly in small companies without dedicated analysts) to write this dashboard, this isn't really a data science project. This is the sort of project that can be managed like a typical software engineering project. The goals are well-defined, and there isn't a lot of uncertainty. Our data scientist just needs to write the queries, and there is a "correct" answer to check against. The value of the project isn't the amount we expect to spend, but the amount we are willing to spend on creating the dashboard. If we have sales data sitting in a database already, and a license for dashboarding software, this might be an afternoon's work. If we need to build the infrastructure from scratch, then that would be included in the cost for this project (or, at least amortized over projects that share the same resource).

One way of thinking about the difference between a software engineering project and a data science project is that features in a software project are often scoped out separately by a project manager (perhaps in conjunction with user stories). For a data science project, determining the "features" to be added is a part of the project.

Scoping a data science project: Failure IS an option

A data science problem might have a well-defined problem (e.g. too much churn), but the solution might have unknown effectiveness. While the project goal might be "reduce churn by 20 percent", we don't know if this goal is achievable with the information we have.

Adding additional data to your project is typically expensive (either building infrastructure for internal sources, or subscriptions to external data sources). That's why it is so crucial to set an upfront value to your project. A lot of time can be spent generating models and failing to reach the targets before realizing that there is not enough signal in the data. By keeping track of model progress through different iterations and ongoing costs, we are better able to project if we need to add additional data sources (and price them appropriately) to hit the desired performance goals.

Many of the data science projects that you try to implement will fail, but you want to fail quickly (and cheaply), saving resources for projects that show promise. A data science project that fails to meet its target after 2 weeks of investment is part of the cost of doing exploratory data work. A data science project that fails to meet its target after 2 years of investment, on the other hand, is a failure that could probably be avoided.

When scoping, you want to bring the business problem to the data scientists and work with them to make a well-posed problem. For example, you may not have access to the data you need for your proposed measurement of whether the project succeeded, but your data scientists could give you a different metric that might serve as a proxy. Another element to consider is whether your hypothesis has been clearly stated (and you can read a great post on that topic from Metis Sr. Data Scientist Kerstin Frailey here).

Checklist for scoping

Here are some high-level areas to consider when scoping a data science project:

  • Evaluate the data collection pipeline costs
    Before doing any data science, we need to make sure that data scientists have access to the data they need. If we need to invest in additional data sources or tools, there can be (significant) costs associated with that. Often, improving infrastructure can benefit several projects, so we should amortize costs amongst all these projects. We should ask:

    • - Will the data scientists need additional tools they don't have?
    • - Are many projects repeating the same work?

      Note: If you do add to the pipeline, it is probably worth making a separate project to evaluate the return on investment for this piece.
  • Rapidly make a model, even if it is simple
    Simpler models are often more robust than complicated. It is okay if the simple model doesn't reach the desired performance.

  • Get an end-to-end version of the simple model to internal stakeholders
    Ensure that a simple model, even if its performance is poor, gets put in front of internal stakeholders as soon as possible. This allows rapid feedback from your users, who might tell you that a type of data that you expect them to provide is not available until after a sale is made, or that there are legal or ethical implications with some of the data you are trying to use. In some cases, data science teams make extremely quick "junk" models to present to internal stakeholders, just to check if their understanding of the problem is correct.

  • Iterate on your model
    Keep iterating on your model, as long as you continue to see improvements in your metrics. Continue to share results with stakeholders.

  • Stick to your value propositions
    The reason for setting the value of the project before doing any work is to guard against the sunk cost fallacy.

  • Make space for documentation
    Hopefully, your organization has documentation for the systems you have in place. You should also document the failures! If a data science project fails, give a high-level description of what seemed to be the problem (e.g. too much missing data, not enough data, needed different types of data). It is possible that these problems go away in the future and the problem is worth addressing, but more importantly, you don't want another group trying to solve the same problem in two years and coming across the same stumbling blocks.

Maintenance costs

While the bulk of the cost for a data science project involves the initial set up, there are also recurring costs to consider. Some of these costs are obvious because they are explicitly billed. If you require the use of an external service or need to rent a server, you receive a monthly bill for that ongoing cost.

But in addition to these explicit costs, you should consider the following:

  • - How often does the model need to be retrained?
  • - Are the results of the model being monitored? Is someone being alerted when model performance drops? Or is someone responsible for checking the performance by visiting a dashboard?
  • - Who is responsible for monitoring the model? How much time per week is this expected to take?
  • - If subscribing to a paid data source, how much is that per billing cycle? Who is monitoring that service’s changes in cost?
  • - Under what conditions should this model be retired or replaced?

The expected maintenance costs (both in terms of data scientist time and external subscriptions) should be estimated up front.

Summary

When scoping a data science project, there are several steps, and each of them have a different owner. The evaluation stage is owned by the business team, as they set the goals for the project. This involves a careful evaluation of the value of the project, both as an upfront cost and the ongoing maintenance.

Once a project is deemed worth pursuing, the data science team works on it iteratively. The data used, and progress against the main metric, should be tracked and compared to the initial value assigned to the project.

Need help?

Metis offers training to upskill your technical team in data science and machine learning, as well as trainings to help executives become more fluent in understanding data and its value. If you have managers who would like training on managing data science projects, please get in touch with our corporate training team, or fill out this form.


Similar Posts

business resource
Why You Should Consider Outsourcing Your Next Training Project

By Douglas Noll • April 19, 2019

If you’ve ever been responsible for, or had a degree of influence in, building a training program for your organization, you’ve likely been faced with a potential dilemma: Can you build the training in-house, or do you need external assistance via a training partner? In this post, read the case for the latter.

business resource
The Impact Hypothesis: The Keystone to Transformative Data Science

By Kerstin Frailey • March 22, 2019

Too often we assume that good data science translates to effective data science, even though it's untrue. This assumption has killed many would-be successful projects. In this post, Sr. Data Scientist Kerstin Frailey introduces the impact hypothesis, or, how to critically scope and communicate how a project will drive impact. Doing this will transform the way data science drives your business.

business resource
Democratizing Data: How Training Your Team Leads to Better Projects & Happier Data Scientists

By Damien Martin • February 20, 2019

Democratizing data means more than just enabling your employees to make queries on the data. It means helping them develop the skills to read graphs, think about relevant scales, and interpret what the data is saying. In this post, Sr. Data Scientist Damien Martin shares how training your team leads to better projects and happier data scientists.