One Day at Bootcamp [All Cities] - Next Two Weekends! RSVP

Best Practices for Applying Data Science Techniques in Consulting Engagements (Part 1): Introduction and Data Collection

By Jonathan Balaban • December 05, 2017

This is part 1 of a 3-part series written by Metis Sr. Data Scientist Jonathan Balaban. In it, he distills best practices learned over a decade of consulting with dozens of organizations in the private, public, and philanthropic sectors.

Credit: Lánluas Consulting Introduction

Data Science is all the rage; it seems like no industry is immune. IBM recently predicted that 2.7 million open roles will be advertised by 2020, many in generally untapped sectors. The internet, digitization, surging data, and ubiquitous sensors allow even ice cream parlors, surf shops, fashion boutiques, and humanitarian organizations to quantify and capture every minutia of business operations.

If you’re a data scientist considering the freelance lifestyle, or a seasoned consultant with strong technical chops thinking of running your own engagements, opportunities abound! Yet, caution is in order: in-house data science is already a challenging endeavor, with the proliferation of algorithms, confusing higher-order effects, and challenging implementation among the ever-present obstacles. These problems compound with the higher pressure, faster timeframes, and ambiguous scope typical of a consulting effort.


This series of posts is my attempt to distill best practices learned over a decade of consulting with dozens of organizations in the private, public, and philanthropic sectors.

I’m also in the throes of an engagement with an undisclosed client who supports numerous overseas humanitarian projects through hundreds of millions in funding. This NGO manages partners and stakeholder organizations, thousands of traveling volunteers, and over a hundred staff across four continents. The amazing staff manages projects and generates key data that tracks community health in third-world countries. Every engagement brings new lessons, and I’ll also share what I can from this unique client.

Throughout, I attempt to balance my unique experience with lessons and tips gleaned from colleagues, mentors, and experts. I also hope you — my courageous readers — share your comments with me on twitter at @ultimetis

This series of posts will rarely delve into technical code…for good reason. I believe, in the past few years, we data scientists have crossed a hidden threshold. Thanks to open source, support sites, forums, and code visibility through platforms like GitHub, you can get help for almost any technical challenge or bug you’ll ever encounter. What’s bottlenecking our progress, however, is the paradox of choice and complication of process.

At the end of the day, data science is about making better decisions. While I can’t deny the mathematical beauty of SVD or multilayer perceptrons, my recommendations — and my current client’s decisions — help define the future of communities and people groups living on the ragged edge of survival.

These communities crave results, not theoretical beauty.


Data Collection

There’s a general concern among data science practitioners that hard facts are too-often ignored, and subjective, agenda-driven decisions take precedence. This is countered with the equally valid concern that business is being wrested from humans by impersonal algorithms, leading to the eventual rise of artificial intelligence and the demise of humanity. The truth — and the proper art of consulting — is to bring both humans and data to the table.

So, how to begin?

1. Start with Stakeholders

First thing first: the individual or organization writing your check is rarely ever the only entity you are accountable to. And, like a data architect creates a data schema, we must map out the stakeholders and their relationships. The smart leaders I’ve worked under perceived — through experience — the implications of their endeavor. The smartest ones carved time to personally meet and discuss potential impact.

In addition, these expert consultants collected business rules and hard data from stakeholders. Truth is, data coming from your primary stakeholder can be cherry-picked, or only measure one of numerous key metrics. Collecting a complete set gives the best light on how changes are working.

I recently had the opportunity to chat with project managers in Africa and Latin America, who gave me a transformative understanding of data I really thought I knew. And, honestly, I still don’t know everything. So I include these managers in key conversations; they bring stark reality to the table.

2. Start Early

I don’t remember a single engagement where we (the consulting team) received all the data we needed to properly start working on kickoff day. I learned quickly that no matter how tech-savvy the client is, or how vehemently data is promised, key puzzle pieces are always missing. Always.

So, start early, and prepare for an iterative process. Everything will take twice as long as promised or expected.

Get to know the data engineering team (or intern) intimately, and keep in mind that they’re often given little to no notice that extra, disruptive ETL tasks are landing on their desk. Find a cadence and method to ask small, granular questions of fields or tables that the data dictionary may not cover. Schedule deeper dives before questions arise (it’s easier to cancel than drop a last minute request on a calendar!), and — always — document your understanding, interpretation, and assumptions about data.

3. Build the Proper Structure 

Here’s an investment often worth making: learn the client data, collect it, and structure it in a way that maximizes your ability to do proper analysis! Chances are that many years ago, when someone long-gone from the company decided to build the database they did, they weren’t thinking of you, or data science.

I’ve regularly seen clients using traditional relational databases when a NoSQL or document-based approach would have served them best. MongoDB could have allowed partitioning or parallelization appropriate for the scale and speed needed. Well…MongoDB didn’t exist when the data started pouring in!

I’ve occasionally had the opportunity to “upgrade” my client as an à la carte service. This was a fantastic way to get paid for something I honestly wanted to do anyway in order to complete my primary objectives. If you see potential, broach the topic!

4. Backup, Duplicate, Sandbox 

I can’t tell you how many times I’ve seen someone (myself included) make “just this tiny little change” or run “this harmless little script," and wake up to a data hellscape. So much of data is intricately connected, automated, and dependent; this can be a fantastic productivity and quality-control boon and a perilous house of cards, all at once.

So, back everything up!

All the time!

And especially when you’re making changes!

I love the ability to create a duplicate dataset within a sandbox environment and go to town. Salesforce is great at this, as the platform regularly offers the option when you make major changes, install an application, or run root code. But even when sandbox code works perfectly, I jump into the backup module and download a manual package of key client data. Why not?


The four strategies above save my team countless hours with every engagement and give us a clean data foundation for lucid analysis. What are your best practices and lessons learned wrangling data?

In this series’ next post, I tackle the thorny subject of scoping and client expectations. You can read that here

Similar Posts

data science Cvkudf4s5gshtyhwyvrr
The Value of a Bootcamp Education (vs. Academia & MOOCs)

By Metis • July 30, 2018

Data science is booming and the appeal of positions in the field is eliciting unprecedented interest. But as many across disciplines are trying to maneuver into the field, they quickly find a skills-gap they need to overcome – and the job search will be a significant challenge even after those skills are acquired. Metis Senior Admissions Manager Josh Shaman tackled his weighty topic in a recent article for SwitchUp, through which he compares the value of an immersive bootcamp experience to university programs to massive open online classes (MOOCs).

data science Dc4ufysb6nfobodcswwq
Sr. Data Scientist Roundup: Bayesian Optimal Pricing, Neural Networks, and Using Scrum

By Emily Wilson • May 29, 2018

When our Sr. Data Scientists aren't teaching the intensive, 12-week bootcamps, they're working on a variety of other projects. This monthly blog series tracks and discusses some of their recent activities and accomplishments.

careers data science Amownjnqywx4p38xvmpa
Tips For Maintaining a Positive Attitude During Your Job Hunt

By Ashley Purdy • September 13, 2018

Finding a job is hard. Finding a job when you are transitioning into a new field, specifically data science, can be even tougher. And if you are methodical and like to do your research before jumping into a big task like a career transition you’ve probably found plenty of articles that discuss how to put a resume together, or prepare for an interview. But this is not one of those articles. Instead, I want to share with you some ways to keep you sane and motivated in your job search. Because a big part of a successful interview is showing up energized and with a positive attitude, but when those job search blues hit (and, trust me, we’ve all been there) it can be difficult to put on a happy face. So here are some things you can do to overcome your frustrations and maintain a positive attitude in your job search.