Data science has become the central approach to tackling data-heavy problems in both business and academia. In this course, students learn how data science is done in the wild, with a focus on data acquisition, cleaning, and aggregation, exploratory data analysis and visualization, feature engineering, and model creation and validation. Students use the Python scientific stack to work through real-world examples that illustrate these concepts. Concurrently, students learn some of the statistical and mathematical foundations that power the data-scientific approach to problem solving.
Who is this course for?
Introduction to Data Science is for anyone with a basic understanding of data analysis techniques and anyone interested in improving their ability to tackle problems involving multi-dimensional data in a systematic, principled way. A familiarity with a programming language is helpful, but unnecessary, if the pre-work for the course is completed (more on that below). No prior advanced mathematical training beyond an introductory statistics course is necessary.
Students should have some experience with Python and have some familiarity with basic statistical and linear algebraic concepts such as mean, median, mode, standard deviation, correlation, and the difference between a vector and a matrix. In Python, it will be helpful to know basic data structures such as lists, tuples, and dictionaries, and what distinguishes them (that is, when they should be used).
Students should skip the pre-work if they can accomplish all of the following:
Write a program in Python that finds the most frequently occurring word in a given sentence.
Explain the difference between correlation and covariance, and why the difference between the two terms matters.
Multiply two small matrices together (e.g. 3X2 and 2X4 matrices).
Otherwise, students should complete the following pre-work (approximately 8 hours) before the first day of class:
Charles Givre has worked as a Senior Lead Data Scientist for Booz Allen Hamilton for the last six years where he works in the intersection of cyber security and data science. For the last few years, Mr. Givre worked on one of Booz Allen's largest analytic programs where he led data science efforts and worked to expand the role of data science in the program. Charles Givre is passionate about teaching others data science and analytic skills and has taught data science classes all over the world at conferences, universities and for clients. Most recently, Charles Givre taught a data science class at the BlackHat conference in Las Vegas and the Center for Research in Applied Cryptography and Cyber Security at Bar Ilan University. He is a sought-after speaker and has delivered presentations at major industry conferences such as Strata-Hadoop World, BlackHat, Open Data Science Conference and others.
One of Charles Givre's research interests is increasing the productivity of data science and analytic teams, and towards that end, he has been working extensively to promote the use of Apache Drill in security applications and has contributed to the code base. Charles Givre teaches online classes for O'Reilly about Drill and Security Data Science and is a coauthor for the forthcoming O'Reilly book about Apache Drill. Prior to joining Booz Allen, Charles Givre, worked as a counterterrorism analyst at the Central Intelligence Agency for five years. Charles Givre holds a Masters Degree in Middle Eastern Studies from Brandeis University, as well as a Bachelors of Science in Computer Science and a Bachelor's of Music both from the University of Arizona. Charles Givre holds various Certifications including CISSP, Security+, Network+, Certified Penetration Tester, and CDIA+. He speaks French reasonably well, plays trombone, lives in Baltimore with his family and in his non-existant spare time, is restoring a classic British sports car. Charles Givre blogs at thedataist.com and tweets @cgivre.
Trent Hauck is currently consulting in the insurance and e-commerce industries, using Data Science to improve operations and customer experience for his clients. Prior to that, he was a Data Scientist at Zulily working on the Relevancy team, and prior to that led reporting and analysis for several clients at a marketing analytics agency. He is also the author of two books through Packt Publishing: Instant Data Intensive Apps with pandas How-to, and Scikit-Learn Cookbook. In his free time he enjoys drinking coffee and staring off into space.
Want to see this course in your city? Let us know!
Course Structure and Syllabus
Class sessions are a mix of lectures/instruction and hands-on programming/lab work. See below for a week-by-week breakdown:
CS/Statistics/Linear Algebra Short Course
Start with the basics. In the CS portion, we briefly cover basic data structures/types, program control flow, and syntax in Python. For statistics, we go over basic probability and probability distributions, along with general properties of some common distributions. As for linear algebra, we cover matrices, vectors, and some of their properties and how to use them in Python.
Exploratory Data Analysis and Visualization
We spend a considerable amount of time using the Pandas Python package to attack a dataset we’ve never seen before and to uncover some useful information from it. At this point, students decide on a course project that would benefit from the data-scientific approach. The project must involve public (freely-accessible and usable) data and must answer an interesting question – or collection of questions – about that data. (Note: Several resources of free data will be provided.)
Data Modeling: Supervised / Unsupervised Learning and Model Evaluation
We learn about the two basic kinds of statistical models, which have classically been used for prediction (supervised learning): Linear Regression and Logistic Regression.We also look at clustering using K-Means, one of the ways you can glean information from unlabeled data.
Data Modeling: Feature Selection, Engineering, and Data Pipelines
We switch gears from talking about algorithms to talk about features. What are they? How do we engineer them? And what can be done (Principal Component Analysis / Independent Component Analysis, regularization) to create and use them given the data at hand? We also cover how to construct complete data pipelines, going from data ingestion and preprocessing to model construction and evaluation.
Data Modeling: Advanced Supervised / Unsupervised Learning
We delve into more advanced supervised learning approaches, during which we get a feel for linear support vector machines, decision trees, and random forest models for regression and classification. We also explore DBSCAN, an additional unsupervised learning approach.
Data Modeling: Advanced Model Evaluation and Data Pipelines | Presentations
We explore more sophisticated model evaluation approaches (cross-validation and bootstrapping) with the goal of understanding how we can make our models as generalizable as possible. Students complete their data science projects and share learnings and discoveries.
I believe Sergey is an EXCELLENT teacher and he has created an excellent curriculum. There are not that many teachers out there that can take a relatively difficult topic, like Data Science, and teach it clearly, succinctly and in fun way. Sergey is clearly very knowledgeable on the topic and brought cutting edge teachings to the class. Fabulous! He also genuinely wants to see everyone succeed. That inspired us all to do our very best.
— Robyn Reid, Introduction to Data Science (NY) alumnus
I felt supported in my journey into Data Science throughout the class. I started the class with only a bare minimum knowledge of Python and ended knowing how to create several machine learning models in the span of six weeks. I'd say that's a win. My only regret was that it ended so quickly.
— Drace Zhan
As a BI professional with 20+ years experience, I found the Metis Intro to Data Science course to be exactly the shot in the arm my career needed to upgrade my skills.
— Ronald Haynes
The Intro to Data Science class provides a great overview on Data Science topics. [The instructor] clearly explains the main points and always takes extra mileage to help us to understand the issue whether in the class or after hours via group chat. I am really impressed on the lectures and have learned a lot from this intensive class.
— Linda Fu
The class was an excellent introduction to this topic, and I now feel so much more prepared to continue learning about data science on my own.
— Laura Pederson
Thanks for subscribing!
Sign up to learn more about our data science training + upcoming free events.