for Data Science and Machine Learning

This course will serve as introduction to basic statistical principles that are often used by data scientists and applied statisticians. Many of the concepts will be reinforced by using the statistical programming language R, one of the two most popular languages for Data Science.

The intent of this course is to expose students to common statistical issues and teach them how to avoid statistical fallacies. We begin with a high-level overview of probability and common statistical estimates and then proceed to move advanced topics like multiple hypothesis testing, independence, sample size and power calculations as well as bootstrapping.

By the end of the course, students will have a fundamental understanding of many of the statistical principles that underlie machine learning and data science.

This course is open to beginners, but students should have some experience with coding (Python or R preferable but not required) and have a basic understand of calculus, linear algebra and probability. A brief review will be provided but prior experience would be very helpful.

Students may opt to skip the pre-work if they:

- Have taken an introductory course to statistics or probability in college
- Are familiar with Linear Algebra (either coursework or work experience)
- Are able to do a hypothesis test to determine:
- If a coin is fair given 100 flips
- Calculate a confidence interval for the mean height given 100 observations
- Explain how to test if events are independent
- Use Bayes Rules to see what the probability of an event is given another event
- Fit a linear model in R.

Otherwise, students should familiarize themselves with Chapters 1-6 of CK-12 Foundation’s Basic Probability and Statistics – A Short Course. Each chapter should take between 1-2 hours.

Upon completion of the course, students have:

An understanding of basic statistical hypothesis testing and confidence intervals.

The ability to model data using well known statistical distributions as well as handle data that is both continuous and categorical.

The ability to perform linear regression and adjust for multiple hypothesis.

An understanding of how to calculate the number of samples needed to achieve required sensitivity and specificity.

An understanding of bootstrapping and Monte Carlo simulation.

Questions?
Enroll
##### Greg Ryslik

###### SF Instructor

##### Paul Trowbridge

###### NY Instructor

Greg Ryslik graduated summa cum laude from Rutgers College and Rutgers Business School with a triple major in mathematics, computer science and finance. He then went on to complete a Masters degree in statistics from Columbia University as well as a PhD in biostatistics from Yale University.

He has extensive experience as a teacher and a tutor, and has given talks in the United States and internationally. In the fall of 2011, Gregory was one of a select few to be chosen to participate in UCLA’s prestigious Institute for Pure and Applied Mathematics. In academia, he has helped students learn categorical data analysis, design & analysis of epidemiological models, longitudinal data analysis, introduction to statistics and calculus. He is currently an Adjunct Assistant Professor with the statistics department at the Pennsylvania State University.

During his career, he has co-founded several companies, published an actuarial textbook and has worked both on Wall Street and in Biotech. He has written several publicly available bioinformatics software packages and is an author of numerous scientific publications in journals such as Nature and BMC Bioinformatics. More recently he led the Data Science team for Service at Tesla Motors and currently is the Head of Data Science and Analytics at Faraday Future.

Paul Trowbridge received advanced training in statistics, demography and sociology from the University of Washington and Rutgers University. He has worked in applied fields such as fMRI, epidemiology and public health, international relations, urban planning and micro-simulation modeling. He has taught statistics, data science and data visualization through New York University's School of Professional Studies.