TODAY: Winter Bootcamp Early Application Deadline. Apply Now
Speaker Series: Dave Robinson, Data Scientist at Stack Overflow
By Emily Wilson • February 09, 2016
As part of our ongoing speaker series, we had Dave Robinson in class last week in NYC to discuss his experience as a Data Scientist at Stack Overflow. Metis Sr. Data Scientist Michael Galvin interviewed him before his talk.
Watch the video to check out the conversation and read the full Q&A below.
Mike: First off, thanks for coming in and joining us. We have Dave Robinson from Stack Overflow here today. Can you tell me a little bit about your background and how you got into data science?
Dave: I did my PhD. D. at Princeton, which I finished last May. Near the end of the Ph.D., I was considering opportunities both inside academia and outside. I'd been a really long-time user of Stack Overflow and huge fan of the site. I got to talking with them and I ended up becoming their first data scientist.
Mike: What did you get your Ph.D. in?
Dave: Quantitative and Computational Biology, which is kind of the interpretation and understanding of really large sets of gene expression data, telling when genes are turned on and off. That involves statistical and computational and biological insights all combined.
Mike: How did you find that transition?
Dave: I found it a lot easier than expected. I was really interested in the product at Stack Overflow, so getting to analyze that data was at least as interesting as analyzing biological data. I think that if you use the right tools, they can be applied to any domain, which is one of the things I love about data science. It wasn't using tools that would just work for one thing. Largely I work with R and Python and statistical methods that are equally applicable everywhere.
The biggest change has been switching from a scientific-minded culture to an engineering-minded culture. I used to have to convince people to use verge control, now everyone around me is, and I am picking up things from them. On the other hand, I'm used to having everyone knowing how to interpret a P-value; so what I'm learning and what I'm teaching have been sort of inverted.
Mike: That's a cool transition. What kinds of problems are you guys working on Stack Overflow now?
Dave: We look at a lot of things, and some of them I'll talk about in my talk with the class today. My biggest example is, almost every developer in the world is going to visit Stack Overflow at least a couple times a week, so we have a picture, like a census, of the entire world's developer population. The things we can do with that are really great.
We have a jobs site where people post developer jobs, and we advertise them on the main site. We can then target those based on what kind of developer you are. When someone visits the site, we can recommend to them the jobs that best match them. Similarly, when they sign up to look for jobs, we can match them well with recruiters. That's a problem that we're really the only company with the data to solve it.
Mike: What kind of advice would you give to junior data scientists who are getting into the field, especially coming from academics in the non-traditional hard science or data science?
Dave: The first thing is, people coming from academics, it's all about programming. I think sometimes people think that it's all learning more complicated statistical methods, learning more complicated machine learning. I'd say it's all about comfort programming and especially comfort programming with data. I came from R, but Python's equally good for these approaches. I think, especially academics are often used to having someone hand them their data in a clean form. I'd say go out to get it and clean the data yourself and work with it in programming rather than in, say, an Excel spreadsheet.
Mike: Where are most of your problems coming from?
Dave: One of the great things is that we had a back-log of things that data scientists could look at even when I joined. There were a few data engineers there who do really terrific work, but they come from mostly a programming background. I'm the first person from a statistical background. A lot of the questions we wanted to answer about statistics and machine learning, I got to jump into right away. The presentation I'm doing today is about the question of what programming languages are growing in popularity and decreasing in popularity over time, and that's something we have a really good data set to answer.
Mike: Yeah. That's actually a really good point, because there's this huge debate, but being at Stack Overflow you probably have the best insight, or data set in general.
Dave: We have even better insight into the data. We have traffic information, so not just how many questions are asked, but also how many visited. On the career site, we also have people filling out their resumes over the past 20 years. So we can say, in 1996, how many employees used a language, or in 2000 how many people are using these languages, and other data questions like that.
Other questions we have are, how does the gender imbalance differ between languages? Our career data has names with them that we can identify, and we see that actually there are some differences by as much as 2 to 3 fold between programming languages in terms of the gender imbalance.
Mike: Now that you have insight into it, can you give us a little preview into where you think data science, meaning the tool stack, is going to be in the next 5 years? What do you guys use now? What do you think you're going to use in the future?
Dave: When I started, people weren't using any data science tools except things that we did in our production language C#. I think the one thing that's clear is that both R and Python are growing really rapidly. While Python's a bigger language, in terms of usage for data science, they two are neck and neck. You can really see that in how people ask questions, visit questions, and fill out their resumes. They're both terrific and growing quickly, and I think they're going to take over more and more.
Mike: That's really cool. Well thanks again for coming in and chatting with me. I'm really looking forward to hearing your talk today.
As a physicist, TV host, and our Chief Data Scientist, Debbie Berebichez is always up to something interesting. Lately, she's been focused on the relationship between Critical Thinking and Data Science and discussed it on both the DataFramed podcast and the Story By Data YouTube channel. Check out both here.
As Viacom's Vice President of Analytics and Measurement Sciences, Sergey Fogelson uses machine learning and artificial intelligence techniques almost every day. In this Q&A, he discusses why he's excited to share his knowledge with students in our upcoming Live Online Machine Learning & Artificial Intelligence Principles course.
Paul Trowbridge, instructor of our upcoming Live Online Statistical Foundations for Data Science & Machine Learning course, discussed the need for a firm stats foundation, talked about his career, and more during a recent Q&A.