This is part 1 of a 3-part series written by Metis Sr. Data Scientist Jonathan Balaban. In it, he distills best practices learned over a decade of consulting with dozens of organizations in the private, public, and philanthropic sectors.
Credit: Lánluas Consulting
Data Science is all the rage; it seems like no industry is immune. IBM recently predicted that 2.7 million open roles will be advertised by 2020, many in generally untapped sectors. The internet, digitization, surging data, and ubiquitous sensors allow even ice cream parlors, surf shops, fashion boutiques, and humanitarian organizations to quantify and capture every minutia of business operations.
If you’re a data scientist considering the freelance lifestyle, or a seasoned consultant with strong technical chops thinking of running your own engagements, opportunities abound! Yet, caution is in order: in-house data science is already a challenging endeavor, with the proliferation of algorithms, confusing higher-order effects, and challenging implementation among the ever-present obstacles. These problems compound with the higher pressure, faster timeframes, and ambiguous scope typical of a consulting effort.
This series of posts is my attempt to distill best practices learned over a decade of consulting with dozens of organizations in the private, public, and philanthropic sectors.
I’m also in the throes of an engagement with an undisclosed client who supports numerous overseas humanitarian projects through hundreds of millions in funding. This NGO manages partners and stakeholder organizations, thousands of traveling volunteers, and over a hundred staff across four continents. The amazing staff manages projects and generates key data that tracks community health in third-world countries. Every engagement brings new lessons, and I’ll also share what I can from this unique client.
Throughout, I attempt to balance my unique experience with lessons and tips gleaned from colleagues, mentors, and experts. I also hope you — my courageous readers — share your comments with me on twitter at @ultimetis.
This series of posts will rarely delve into technical code…for good reason. I believe, in the past few years, we data scientists have crossed a hidden threshold. Thanks to open source, support sites, forums, and code visibility through platforms like GitHub, you can get help for almost any technical challenge or bug you’ll ever encounter. What’s bottlenecking our progress, however, is the paradox of choice and complication of process.
At the end of the day, data science is about making better decisions. While I can’t deny the mathematical beauty of SVD or multilayer perceptrons, my recommendations — and my current client’s decisions — help define the future of communities and people groups living on the ragged edge of survival.
These communities crave results, not theoretical beauty.
There’s a general concern among data science practitioners that hard facts are too-often ignored, and subjective, agenda-driven decisions take precedence. This is countered with the equally valid concern that business is being wrested from humans by impersonal algorithms, leading to the eventual rise of artificial intelligence and the demise of humanity. The truth — and the proper art of consulting — is to bring both humans and data to the table.
So, how to begin?
1. Start with Stakeholders
First thing first: the individual or organization writing your check is rarely ever the only entity you are accountable to. And, like a data architect creates a data schema, we must map out the stakeholders and their relationships. The smart leaders I’ve worked under perceived — through experience — the implications of their endeavor. The smartest ones carved time to personally meet and discuss potential impact.
In addition, these expert consultants collected business rules and hard data from stakeholders. Truth is, data coming from your primary stakeholder can be cherry-picked, or only measure one of numerous key metrics. Collecting a complete set gives the best light on how changes are working.
I recently had the opportunity to chat with project managers in Africa and Latin America, who gave me a transformative understanding of data I really thought I knew. And, honestly, I still don’t know everything. So I include these managers in key conversations; they bring stark reality to the table.
2. Start Early
I don’t remember a single engagement where we (the consulting team) received all the data we needed to properly start working on kickoff day. I learned quickly that no matter how tech-savvy the client is, or how vehemently data is promised, key puzzle pieces are always missing. Always.
So, start early, and prepare for an iterative process. Everything will take twice as long as promised or expected.
Get to know the data engineering team (or intern) intimately, and keep in mind that they’re often given little to no notice that extra, disruptive ETL tasks are landing on their desk. Find a cadence and method to ask small, granular questions of fields or tables that the data dictionary may not cover. Schedule deeper dives before questions arise (it’s easier to cancel than drop a last minute request on a calendar!), and — always — document your understanding, interpretation, and assumptions about data.
3. Build the Proper Structure
Here’s an investment often worth making: learn the client data, collect it, and structure it in a way that maximizes your ability to do proper analysis! Chances are that many years ago, when someone long-gone from the company decided to build the database they did, they weren’t thinking of you, or data science.
I’ve regularly seen clients using traditional relational databases when a NoSQL or document-based approach would have served them best. MongoDB could have allowed partitioning or parallelization appropriate for the scale and speed needed. Well…MongoDB didn’t exist when the data started pouring in!
I’ve occasionally had the opportunity to “upgrade” my client as an à la carte service. This was a fantastic way to get paid for something I honestly wanted to do anyway in order to complete my primary objectives. If you see potential, broach the topic!
4. Backup, Duplicate, Sandbox
I can’t tell you how many times I’ve seen someone (myself included) make “just this tiny little change” or run “this harmless little script," and wake up to a data hellscape. So much of data is intricately connected, automated, and dependent; this can be a fantastic productivity and quality-control boon and a perilous house of cards, all at once.
So, back everything up!
All the time!
And especially when you’re making changes!
I love the ability to create a duplicate dataset within a sandbox environment and go to town. Salesforce is great at this, as the platform regularly offers the option when you make major changes, install an application, or run root code. But even when sandbox code works perfectly, I jump into the backup module and download a manual package of key client data. Why not?
The four strategies above save my team countless hours with every engagement and give us a clean data foundation for lucid analysis. What are your best practices and lessons learned wrangling data?
In this series’ next post, I tackle the thorny subject of scoping and client expectations. You can read that here.