Upcoming Seattle Events: RSVP here to attend February 14th's webinar on Building a Song Recommendation Engine Based Off Radio Playlist Data with Metis Intro to Data Science instructor Trent Hauck. Additionally, we're excited to have Rob McDaniel, Lead Data Scientist at LiveStories, give a talk on Effective Keyword Generation and Topic Modeling Using Python on 2/16. RSVP here.
Last month, we had the pleasure of hosting a panel event on the topic of "Demystifying Data Science." The event was also our official Grand Opening in Seattle, a wonderful city we can't wait to teach and train within! We're kicking things off with an Introduction to Data Science part-time course, along with our full-time, a 12-week Data Science Bootcamp, and more to come in the near future.
At the event, guests heard from Erin Shellman, Senior Data Scientist at Zymergen, Trey Causey, Senior Product Manager at Socrata, Joel Grus, Research Engineer at Allen Institute for Artificial Intelligence, and Claire Jaja, Senior Data Scientist at Atlas Informatics. Each provided insight into their personal journeys and current roles through a series of lightning talks followed by a moderated panel discussion.
Each of their full presentation decks is available here:
During the panel, the group discussed how the title of "data scientist" is often loaded to the point of not being totally clear.
"I think one of the ideas is that it's kind of an umbrella term, and anyone you find who's a data scientist could be totally different from another person who's a data scientist," said Joel Grus.
Each panelist broke down their daily work to give the audience a better idea of what a data scientist can mean in practice.
"A large part of what I do is analytical automation," said Erin Shellman. "At Zymergen, we are largely a testing company, we do a lot of comparing things against other things, and then we try to improve based on the comparisons we make. A lot of what I do is automate the processing that comes with that, and then test it to make it easier for our scientists to interpret the results and figure out what happened. Often we're asking hundreds of questions, and at the same time, we want to be able to figure out what happened, and what's good."
"It depends a lot on the size of the organization you work for," added Trey Causey. "For instance, say you work for a big social media company, where they might ask, 'What does engagement look like for the news feed this month, for stories that have images attached to them?' So you say, "Okay, I need to go look at the table for news feed interactions,' and there's going to be a flag on each of those interactions, whether or not that particular news item had a picture attached to it or not, and what was the dwell time, meaning how long was it in view for, and things like that."
Claire Jaja chimed in next, saying, "My job is a lot of a hodgepodge, and it's part of what working at a startup is. I run a lot of the production code, and I talk to designers, and I talk to people all over the place. Also, I help people think about things in a way where we can actually use the tools to approach it. I'm thinking about, 'Okay, is this the problem we're actually trying to solve? Is this actually the hypothesis we're trying to prove, or disprove? Okay, now here's how we could do that.'"
She emphasized the idea of being flexible if your company and position call for it, and being communicative with coworkers to ensure the job gets done well. "Sometimes it means we have to start gathering more data that we don't have currently; sometimes it means we have to see what we can do with what we have right now. There's a lot of scrappiness to it, and sometimes it feels like you're making your own
"Sometimes it means we have to start gathering more data that we don't have currently; sometimes it means we have to see what we can do with what we have right now. There's a lot of scrappiness to it, and sometimes it feels like you're making your own work, because it's not very well defined a lot of times. You have to talk to people and massage it out to figure out what you actually want," she said.
Joel Grus went on to describe a recent project he's been working on with his team.
"Last month, I worked on this project called Aristo, and it's a sort of generalized approach to answering science questions," he said. "On my team, we were taking a look at the question: Can we answer science questions about a very specific sub-topic using a corpus of data only about that sub-topic? And the kinds of questions we were trying to answer are the sort of things you might find on a fourth-grade science exam. To give an example, and this was not our question, but a question might be: Jimmy wants to go rollerskating, which of the following would be the best choice of surface? A: Sand. B: Ice. C: Blacktop. D: Dirt.
It's the sort of thing where, if you go to Google and type in that question, you're not going to get an exact answer," he continued. "You first have to know something about what roller skating means, what it entails, what the surfaces are like. It's a more subtle problem than it sounds like at first. So I was doing a lot of collecting of corpus data about specific topics by scraping the web and extracting census from that. I was trying a bunch of different approaches to answer a question; I was training a Word 2 Vec model on those sentences, building IR lookup models on those sentences, and then trying to untangle those models to come up with the right answers to the questions."
Audience members then asked a number of great questions for the panelists. Here is a truncated version of that Q&A session:
Q: If somebody was entering the field, and coming to your company as an incoming data scientist, can you give an idea of what that person's work might look like?
Joel: Every job has a pretty idiosyncratic stack of tools. Especially a junior person, you're probably not going to expect them to have experience using all those tools, and so you have to be pretty mindful about, 'Okay, I'm going to give this person projects, where they can get acclimated to what we're doing.'
Erin: I have an intern right now, so I'm thinking a little bit about the exercises I'm going through with him. I'm just trying to put him in a position where he knows who in the company to talk to, because there's a lot of parts, so he's going to be working on a model that's going to make predictions about things we should build and then test. He needs to talk to people who are going to do the tests, and figure out the other players in the business who are going to be advocates for his work and be consumers of it. And make sure that he understands how to deliver his stuff to them so that they can actually make use of it and it doesn't become this demoralizing project where you've done a bunch of work and nobody can do anything with it.
Claire: Yes, having the answerable question, or helping [the new employee] frame it, that's a lot of the learning happens, in how to frame the question. And then they can try different things, and you can be like, "Well, what have you learned here? Can we actually do this?"
Q: It seems like the main part of your jobs is knowing how to ask the right questions. So my question to you is: How do you train your management to ask you the right questions, so they can use data science more effectively?
Trey: That's a super question. I think that actually, that fits nicely with the 'Be careful of people who are buying the idea that data science solves everything.' Setting expectations is hard to do for junior people a lot of the time. Being able to say, "Here's what we're probably going to be able to accomplish. Here's what we're not." It's about product knowledge and business knowledge.
It's a lot about trust on a number of levels. If a senior person asks you a question, you have to be like, "That's not something we're going to be able to answer." Once you've established that trust, that's a legitimate answer – but before you have that trust, that's your job.
Erin: A technique that I use that I find really effective...is to think about the solution, and assume that you have it, then think about the inputs that would be required to get to the solution. That provides you a with a roadmap to say, "This is the state we all agree we want to be on, here are the inputs that you would need in order to do that." Then you're able to lay that out, which provides you with a road map to be able to say, "Well, we agree we want to get here, you need that, that, and that to be able to even start answering this question. So how do we get all of it?" That at least gives you a framework where you start with an agreement and then you work up to saying, "Here's where we are now."
Trey: I really like that approach, and I actually use that in interviews a little bit, where I say, 'Hey here is a problem. Let's say you're trying to break fraud or something like that. What kind of data would you need to try and build that model? And what would some of your inputs look like?' Working backward from that state really shows you a lot about how a person approaches a problem, but you can also use the other direction as well, saying here's where we're starting from, let's think about what we need to get there.
Q: I want to ask about the backgrounds and the traits that somebody should have coming into data science. On the background side, Trent you made a point that Ph.D.does not matter. I'm curious your perspectives on the significance of an academic degree. At Metis, half of the bootcamp students come in with a masters of Ph.D. and half do not, so I'm really curious to hear your perspective there.
Then on the traits, curiosity has been mentioned repeatedly, and you've mentioned passion a few times. You just talked about creativity, so I'm curious if there are certain traits that you believe are the DNA for what makes a great data scientist?
Joel: I don't care what degree a person has at all. I used to and I got burned by that in many different ways. This guy went to an awesome school and studied something hard, he must be good. Nope. Not good. Conversely I've worked with great people who didn't even go to college. So I don't give a lot of weight to that.
In terms of skills, I think the ones you listed are right. For me, personally, I put a lot of weight on rigorous thinking, as well as extraction. Those are things that are more important to me, than are necessarily important to everyone.
Erin: I similarly don't really care about the degree at all. I will say that the training in a Ph.D. program is not, in general – even if you're in a CS department – it's not super technical. You're not getting a lot of technical skills. In general, the point of a Ph.D. program is to get the skills to do your own research. You get the skills to frame questions, to figure out what the next question would be if that question fails or becomes unanswerable.
You learn a lot of administrative things, and you also learn about how to run your own research program. It's not a skill that is necessarily something you need to be a data scientist, for what you do every day. But it really just depends on what you want to do. If you are interested in framing questions or running a research organization, then that might be more important to you. If you're not, and you just care about technical skills and people who are generally curious, then I don't think it matters really at all.
I would say that mine was not a poor waste of my time, but I was also in a band, doing all kinds of fun stuff, so I recommend grad school.
Claire: One thing I want to say, that I think is an important skill, and not just for data scientists, but definitely for data scientists, is: Not being afraid to admit when you're wrong.
When you're like, "I thought this would totally work," and it didn't work, be okay with that, because that's a big part of what this is. The fast you figure that out, and say, "I was wrong. I thought that'd work, and it didn't." Or, "I tried this out, and whoops." Whatever it is. Then the faster you can recover and get to the next step.
I think that's a really big thing – and especially people entering the field are really nervous to say they're wrong or admit they don't know the answer. The sooner you become okay with saying that, the sooner you can learn more. That's always an opportunity to learn something else.
Q: If you have one piece of advice you want to leave everybody with, what would that advice be?
Claire: I think I've had a theme this whole night, so I'll just repeat that theme, which is: Find your passion, think about what you're interested in – and it's not just data science, it's something else. We all have interests and passions and things that really excite us, and there's always a way to apply data science to it. That's really cool, there's a lot of data out there.
Think to yourself, 'Ok, here's this thing I really like doing on the side, how can I apply data science to this?' I think that's a really good path forward.
Trey: Learning the business, learning the products. Why are people using your product? What are they trying to accomplish with it? Keep asking that, and does the data reflect the actual reason?
Joel: I'm a big skeptic of the long-term demand for data scientists who aren't really solid coders, so become a really solid coder.
Erin: I would say, do things before you feel like you're ready to do them. As an example, applying and doing an interview at a job is a great way to learn how interviews work. I would recommend, for example, applying to your C and B list jobs before your A list jobs, and get some experience doing that. If in your mind, you're telling yourself, "I'm not ready to do this thing." Whatever that is, just do it and you'll learn from the experience and then don't take it personally – it's not about you, you're still a good person.