Data Science

Data science has become the central approach to tackling data-heavy problems in both business and academia. In this course, students learn how data science is done in the wild, with a focus on data acquisition, cleaning, and aggregation, exploratory data analysis and visualization, feature engineering, and model creation and validation. Students use the Python scientific stack to work through real-world examples that illustrate these concepts. Concurrently, students learn some of the statistical and mathematical foundations that power the data-scientific approach to problem solving.

Introduction to Data Science is for anyone with a basic understanding of data analysis techniques and anyone interested in improving their ability to tackle problems involving multi-dimensional data in a systematic, principled way. A familiarity with a programming language is helpful, but unnecessary, if the pre-work for the course is completed (more on that below). No prior advanced mathematical training beyond an introductory statistics course is necessary.

Students should have some experience with Python and have some familiarity with basic statistical and linear algebraic concepts such as mean, median, mode, standard deviation, correlation, and the difference between a vector and a matrix. In Python, it will be helpful to know basic data structures such as lists, tuples, and dictionaries, and what distinguishes them (that is, when they should be used).

Students should skip the pre-work if they can accomplish all of the following:

- Write a program in Python that finds the most frequently occurring word in a given sentence.
- Explain the difference between correlation and covariance, and why the difference between the two terms matters.
- Multiply two small matrices together (e.g. 3X2 and 2X4 matrices).

Otherwise, students should complete the following pre-work (approximately 8 hours) before the first day of class:

- Exercises 1-7, 13, 18-21, 27-35, 38,39 of Learn Python The Hard Way.
- Videos 1-6 of Linear Algebra review from Andrew Ng’s Machine Learning course (labeled as: III. Linear Algebra Review (Week 1, Optional).
- The exercises in Chapters 2 and 3 of OpenIntro Statistics.

Upon completing the course, students have:

An understanding of problems solvable with data science and an ability to attack those problems from a statistical perspective.

An understanding of when to use supervised and unsupervised statistical learning methods on labeled and unlabeled data-rich problems.

The ability to create data analytical pipelines and applications in Python.

Familiarity with the Python data science ecosystem and the various tools one can use to continue developing as a data scientist

Questions?
Enroll
##### Drew Fustin

###### CHI Instructor

##### Trent Hauck

###### SEA Instructor

##### Sergey Fogelson

###### NYC Instructor

##### T.J. Bay

###### SF Instructor

Drew is a reformed physicist with a heart for the Chicago tech scene. He currently serves as the Lead Data Scientist at SpotHero, where his responsibilities range from building a marketing attribution model and optimizing ad spend to creating a rate recommendation engine for parking garages to forecasting future company revenues. His prior experience includes a stint with GrubHub as the Insights Analyst, turning food facts into media content for the PR department and transforming data into actionable initiatives within the organization. In the startup space, he was a Data Scientist with Digital H2O, providing water intelligence for the oil/gas industry. He holds a PhD in physics from the University of Chicago, where he studied dark matter by looking for tiny bubbles in a chamber over a mile underground in a Canadian nickel mine.

Trent Hauck is currently a Senior Data Scientist at Zymergen. Prior to that, he was a Data Scientist at Zulily working on the Relevancy team, and prior to that led reporting and analysis for several clients at a marketing analytics agency. He is also the author of two books through Packt Publishing: Instant Data Intensive Apps with pandas How-to, and Scikit-Learn Cookbook. In his free time he enjoys drinking coffee and staring off into space.

Sergey Fogelson is a data science consultant currently working in the financial industry. He began his career as an academic at Dartmouth College in Hanover, New Hampshire, where he researched the neural bases of visual category learning and obtained his Ph.D. in Cognitive Neuroscience. After leaving academia, Sergey got into the rapidly growing startup scene in the NYC metro area, where he has worked as a data scientist in alternative energy analytics, digital advertising, and cybersecurity. He is heavily involved in the NYC-area teaching community and has taught courses at various bootcamps, as well as been a volunteer teacher in computer science through TEALSK12. When Sergey is not working or teaching, he is probably hiking. (He thru-hiked the Appalachian trail before graduate school).

T.J. is a data scientist at Alpine Data, where he has worked on projects in finance, healthcare, manufacturing and government. Prior to leaving academia, he taught astronomy and received a Ph.D. in physics from Stanford, where he worked on experiments to measure deviations in the strength of short-distance gravity and a novel photon detector based on superconductors. He is nearing the end of very long project that involves whiskey and rewatching every episode of Star Trek - even the animated ones.