Introduction to Data Science

Introduction to Data Science Overview

Data science has become the central approach to tackling data-heavy problems in both business and academia. In this course, students learn how data science is done in the wild, with a focus on data acquisition, cleaning, and aggregation, exploratory data analysis and visualization, feature engineering, and model creation and validation. Students use the Python scientific stack to work through real-world examples that illustrate these concepts. Concurrently, students learn some of the statistical and mathematical foundations that power the data-scientific approach to problem solving.

This course is offered both in-person at Metis campuses and Live Online from anywhere. Sign up for a free Live Online sample class.

Who is this course for?

Introduction to Data Science is for anyone with a basic understanding of data analysis techniques and anyone interested in improving their ability to tackle problems involving multi-dimensional data in a systematic, principled way. A familiarity with a programming language is helpful, but unnecessary, if the pre-work for the course is completed (more on that below). No prior advanced mathematical training beyond an introductory statistics course is necessary.

Considering the data science immersive bootcamp?

Part-Time Alumni can apply the amount of tuition paid for one part-time professional development course towards enrollment in an upcoming bootcamp upon admittance.


Students should have some experience with Python and have some familiarity with basic statistical and linear algebraic concepts such as mean, median, mode, standard deviation, correlation, and the difference between a vector and a matrix. In Python, it will be helpful to know basic data structures such as lists, tuples, and dictionaries, and what distinguishes them (that is, when they should be used).

Students should skip the pre-work if they can accomplish all of the following:

  1. Write a program in Python that finds the most frequently occurring word in a given sentence.
  2. Explain the difference between correlation and covariance, and why the difference between the two terms matters.
  3. Multiply two small matrices together (e.g. 3X2 and 2X4 matrices).

Otherwise, students should complete the following pre-work (approximately 8 hours) before the first day of class:

  1. Exercises 1-7, 13, 18-21, 27-35, 38,39 of Learn Python The Hard Way.
  2. Videos 1-6 of Linear Algebra review from Andrew Ng’s Machine Learning course (labeled as: III. Linear Algebra Review (Week 1, Optional).
  3. The exercises in Chapters 2 and 3 of OpenIntro Statistics.


Upon completing the course, students have:

An understanding of problems solvable with data science and an ability to attack those problems from a statistical perspective.
An understanding of when to use supervised and unsupervised statistical learning methods on labeled and unlabeled data-rich problems.
The ability to create data analytical pipelines and applications in Python.
Familiarity with the Python data science ecosystem and the various tools one can use to continue developing as a data scientist
Omoju miller
Omoju Miller

Omoju Miller is a Senior Machine Learning Data Scientist with Github. She has over a decade of experience in computational intelligence. She has a Ph.D. from UC Berkeley. In the past, she has co-led the non-profit investment in Computer Science Education for Google and served as a volunteer advisor to the Obama administration's White House Presidential Innovation Fellows. She is considered one of the folks to watch, as part of Bloomberg's Beta Future Founders program. She is a member of the World Economic Forum Expert Network in AI.

Trent hauck
Trent Hauck

Trent Hauck is currently a Senior Data Scientist at Zymergen, where he builds data products that help biologists and other scientists improve their decision outcomes. Prior to his that role he spent 18 months consulting in the insurance and e-commerce industries, and prior to that worked at Zulily on the relevancy team building recommendation tools for the e-commerce site and conducting operational analysis. He is also the author of two books through Packt Publishing: Instant Data Intensive Apps with pandas How-to, and Scikit-Learn Cookbook. Ask him to get coffee, he can't say no.

Sergey fogelson
Sergey Fogelson

Sergey Fogelson is the vice president of analytics and measurement sciences at Viacom. He began his career as an academic at Dartmouth College in Hanover, New Hampshire, where he researched the neural bases of visual category learning and obtained his Ph.D. in Cognitive Neuroscience. After leaving academia, Sergey got into the rapidly growing startup scene in the NYC metro area, where he has worked as a data scientist in alternative energy analytics, digital advertising, cybersecurity, finance, and media. He is heavily involved in the NYC-area teaching community and has taught courses at various bootcamps, and has been a volunteer teacher in computer science through TEALSK12. When Sergey is not working or teaching, he is probably hiking. (He thru-hiked the Appalachian trail before graduate school).

Course Structure and Syllabus

Class sessions are a mix of lectures/instruction and hands-on programming/lab work. See below for a week-by-week breakdown:

Week 1

CS/Statistics/Linear Algebra Short Course

Start with the basics. In the CS portion, we briefly cover basic data structures/types, program control flow, and syntax in Python. For statistics, we go over basic probability and probability distributions, along with general properties of some common distributions. As for linear algebra, we cover matrices, vectors, and some of their properties and how to use them in Python.

Week 2

Exploratory Data Analysis and Visualization

We spend a considerable amount of time using the Pandas Python package to attack a dataset we’ve never seen before and to uncover some useful information from it. At this point, students decide on a course project that would benefit from the data-scientific approach. The project must involve public (freely-accessible and usable) data and must answer an interesting question – or collection of questions – about that data. (Note: Several resources of free data will be provided.)

Week 3

Data Modeling: Supervised / Unsupervised Learning and Model Evaluation

We learn about the two basic kinds of statistical models, which have classically been used for prediction (supervised learning): Linear Regression and Logistic Regression.We also look at clustering using K-Means, one of the ways you can glean information from unlabeled data.

Week 4

Data Modeling: Feature Selection, Engineering, and Data Pipelines

We switch gears from talking about algorithms to talk about features. What are they? How do we engineer them? And what can be done (Principal Component Analysis / Independent Component Analysis, regularization) to create and use them given the data at hand? We also cover how to construct complete data pipelines, going from data ingestion and preprocessing to model construction and evaluation.

Week 5

Data Modeling: Advanced Supervised / Unsupervised Learning

We delve into more advanced supervised learning approaches, during which we get a feel for linear support vector machines, decision trees, and random forest models for regression and classification. We also explore DBSCAN, an additional unsupervised learning approach.

Week 6

Data Modeling: Advanced Model Evaluation and Data Pipelines | Presentations

We explore more sophisticated model evaluation approaches (cross-validation and bootstrapping) with the goal of understanding how we can make our models as generalizable as possible. Students complete their data science projects and share learnings and discoveries.