data science

If you want to become an expert in machine learning, the learning process cannot be rushed. However, by taking a tactical and focused approach, you can efficiently learn a wide range of foundational machine learning tools and effectively apply them to real-world projects. While there are many paths that might take you to this goal, we’d like to share our own suggestions drawn from years of experience in accelerated data science training. In this post, we’ll provide a roadmap highlighting major points on a path to learn machine learning over 6 months.

The *somewhat* unfortunate truth is that Machine learning is an enormous and quickly evolving field. It can feel overwhelming just to get started, but it should also feel exciting! You may have already tried jumping in at the point where you want to ** use** machine learning to build models, but are daunted by the sheer number of options and details. This can make machine learning feel

We strongly believe that the key to using machine learning well is to start far upstream, grappling with the building blocks of the field. You need to understand what’s happening “under the hood” of the many machine learning algorithms before you can be ready to properly apply them to ‘real’ data. You also need fundamental programming skills before you can effectively work with more specialized tools for machine learning code. So let’s dive into those building blocks.

There are 3 overarching topical skill sets that make-up machine learning (well, actually there are many more, but 3 that are the root topics):

- - “Pure” Math (Calculus, Linear Algebra)
- - Probability and Statistics (applied Math)
- - Programming (generally in Python/R)

In practice, you have to be comfortable with the underlying mathematics before machine learning will make any sense. For instance, if you aren’t familiar with thinking in vector spaces and working with matrices, then thinking about feature spaces and decision boundaries will be a real struggle. The latter concepts are central to classification algorithms for machine learning -- this is a clear example of how extraordinarily complex the algorithms will seem if you don’t have the math on-lock.

Beyond that, everything in machine learning is code driven. To get the data, you'll need code. To process the data, you'll need code. To interact with the machine learning algorithms, you'll need code (even if using algorithms someone else wrote).

Within the broad umbrella of math, the place to start is learning about linear algebra. MIT has an open course on Linear Algebra. This should introduce you to all of the core concepts of linear algebra, and you should pay particular attention to vectors, Euclidean distance, matrix multiplication, determinants, and Eigenvector decomposition - all of which have heavy roles as the cogs that make machine learning algorithms go.

After that, calculus should be your next focus. Here we’re most interested in learning and understanding the meaning of derivatives, and how we can use them for optimization. There are tons of great calculus resources out there, but at a minimum, you should make sure to get through all topics in Single Variable Calculus and at least sections 1 and 2 of Multivariable Calculus. This is a great time to look into Gradient Descent -- a critical tool for many of the algorithms used for machine learning -- which is just an application of partial derivatives.

Next up, it’s time to dig into probability and statistics. We recommend checking out the page for Joe Blitzstein’s Probability course at Harvard, which includes a comprehensive set of recorded lectures and exercises. Here you can get an introduction to concepts such as independence, conditional probability, and probability distributions, all of which play key roles in the assumptions, design, and training of machine learning models.

Finally, you can dive into the programming aspect. We highly recommend Python, because it is a widely supported industry standard with excellent, pre-built machine learning tools. There are many online resources for learning Python, so we recommend that you research an option that works best for you. After getting the basics down, make sure to learn about and practice using key libraries such as pandas for data manipulation and matplotlib for data visualization. To get quick access to a compendium of Python data science/machine learning tools, we recommend installing the package manager Anaconda. Through Anaconda you’ll get the libraries mentioned above along with the ever popular scikit-learn, a library of optimized/pre-built machine learning algorithms.

R is also an excellent, widely supported language to learn for data science work. Python’s growing popularity and power as a general programming language make us prefer it, but it is certainly possible to accomplish great machine learning work with R.

This is where the fun begins. At this point, you’ll have the background needed to start working with real-world data. Most machine learning projects have a very similar workflow -- we map it out below, and include the skill set relevant to that stage.

**Acquire the target data set**(web scraping, API calls, image libraries)**:***coding background*.**Clean/munge the data.**This takes all sorts of forms. Maybe you have incomplete data, how can you handle that? Maybe you have a date, but it’s in a weird format and you need to convert it to day, month, year. This takes a*coding background*combined with a healthy dose of*creative problem solving*.**Choosing a machine learning algorithm(s).**Once you have the data cleaned up, you can start trying different algorithms. The image below is a**rough**guide to the algorithm landscape. Its most important aspect is that it gives you a lot of information to read about. You can look through the names of the many algorithms (e.g. Lasso) and say, “that seems to fit what I want to do based on the flow chart… but I’m not sure what it is,” then jump over to Google and learn about it:*math background*.**Tune your machine learning algorithm.**Here’s where your math background work pays off the most - all of these algorithms have a ton of adjustable buttons and knobs. For example, if you’re using gradient descent, what do you want the learning rate to be? Then you can think back to your calculus and realize that learning rate is just the step-size, and know that you need to tune that based on your understanding of the loss function. And then you adjust all your bells and whistles on your model to work toward a good result (measured with accuracy, recall, precision, f1 score, etc - you should look these up). Finally, check for overfitting/underfitting with cross-validation and testing methods (these are critical to practical machine learning):*math and coding background*.**Visualize your results:**Here’s where your coding background pays off some more, because you now know how to create plots and what plot functions can do what.

From SciKit Learn's Documentation

For this stage in your journey, we highly recommend the book “Data Science from Scratch” by Joel Grus. If you’re trying to go it alone (not using MOOCs or an immersive course such as Metis’), this provides a nice, readable introduction to most of the algorithms and also teaches you how to code them up. He doesn’t go too far in depth into the math side of things, so we would again emphasize learning the math prior to diving into the book. His book gives a particularly nice overview on all the different types of algorithms, covering topics such as classification vs regression and the different types of classifiers, and also shows you the inner workings of the algorithms in Python.

As we’ve mentioned, the key is to break your learning process into accessible building blocks and lay out a timeline for achieving your goal. It may not feel glamorous to put your linear algebra training before your first stab at computer vision, but this is the best way to really get yourself on the right track.

- - Start with learning the “pure” math needed for machine learning (2–3 months)
- - Move into programming tutorials purely on the language you’re using; don’t get caught up in the machine learning side of coding until you feel confident writing ‘regular’ code (1 month)
- - Start jumping into machine learning code, following tutorials. Kaggle is an excellent resource for some great tutorials (e.g. see the Titanic data set). Pick an algorithm you see in tutorials and look up how to write it from scratch. Really dig into it. Follow along with tutorials using pre-made datasets like this: Tutorial To Implement k-Nearest Neighbors in Python From Scratch (1–2 months)
- - Really jump into one (or several) short term project(s) you are passionate about, but that aren’t overly complex. Don’t try to cure cancer with data (yet)… maybe try to predict how successful a movie will be based on the actors they hired and the movie’s budget. Maybe try to predict all-stars in your favorite sport based on their stats (and the stats of all the previous all stars). (1+ month)

Don’t be afraid to fail. The majority of your time in machine learning will be spent trying to figure out why an algorithm didn’t pan out as well as you expected or why you got the error XYZ… that’s normal. Tenacity is key. Just go for it. If you think logistic regression might work… try it with a small set of data and see how it does. These early projects are a sandbox for learning the methods by failing - so make use of it and give everything a try that makes sense.

Finally, if you’re keen to make a living doing machine learning - BLOG. Make a website that highlights all the projects you’ve worked on. Show how you did them. Show the end results. Make it beautiful and use nice visuals. Make it digest-able. Make a product that someone else can learn from and then hope that an employer can see and value all the work you put in.

_____

*Metis teaches machine learning as part of our **Data Science & Machine Learning Bootcamp** and our Short Immersive **Machine Learning Classification course**.*

data science
##### Course Report Webinar: How is Python Used for Data Science?

By Carlos Russo • September 21, 2020

data science
##### Our Top 10 Most-Read Blog Posts of 2020

By Carlos Russo • December 22, 2020

data science
##### How to Become a Data Scientist

By Carlos Russo • April 16, 2021