August 12: Free Corporate Data Literacy Training Register Now

Understanding Natural Language Processing

By Tony Yiu • May 22, 2020

This post by Data Scientist Tony Yiu is a summary of a longer blog he published on his Medium account, which you can read in full here.

Natural language processing (NLP) is one of the trendier areas of data science. Its end applications are many – chatbots, recommender systems, search, virtual assistants, etc. – so it would be beneficial to at least understand the basics even if you only occasionally dabble in analytics. And who knows, some topics extracted through NLP might just give your next model or analysis an extra boost. In this post, we seek to understand why topic modeling is important and how it helps us.

Topic modeling is the practice of using a quantitative algorithm to tease out the key topics that a body of text is about. It bears a lot of similarities with something like PCA (principal components analysis), which identifies the key quantitative trends (that explain the most variance) within your features. The outputs of PCA are a way of summarizing our features. For example, it allows us to go from something like 500 features to 10 summary features, and these 10 summary features are basically topics.

In NLP, it works almost exactly the same way. We want to distill our text data, and its potentially millions of documents, into a set of digestible topics that tells us that this group of documents is primarily about computers, while this group is about finance, and so on. 

There’s a ton of data out there in the world (especially text data); topic modeling is a critical tool for organizing and making sense of all of it.


Before transitioning into data science, Tony Yiu spent nine years in the investments industry as a quantitative researcher, where he worked on portfolio optimization, economic simulation, and built numerous forecasting models to predict everything from emerging market equity returns to household spending in retirement. He now works as a data scientist at Solovis, where he uses his experience in statistics, finance, and machine learning to design and build risk analytics software for financial institutions. Tony is also a Metis Bootcamp graduate and we’re excited to have him back with us as a contributor to the blog, where he’ll write about data science and analytics in business and industry

Similar Posts

business resource
The Artists of Data Science Podcast Feat. Metis Chief Data Scientist Debbie Berebichez

By Metis • June 19, 2020

In an episode of The Artists of Data Science podcast, host Harpreet Sahota talks with Metis Chief Data Scientist Debbie Berebichez about her belief that everyone has unique gifts and perspectives that should be embraced and celebrated within data science. Learn more and listen!

business resource
Why Data Literacy is Important for Your Business (& How You Can Get Started)

By Emily Wilson • July 17, 2020

In today’s fast-moving business environment, it’s tempting to make decisions based on personal opinion, gut feelings, or groupthink, especially when those options feel like all you have at your disposal. But training and emboldening a data-literate team – from your technical to non-technical employees – opens doors to growth and success. Learn more during our upcoming free training series.

business resource
VIDEO: Metis Chief Data Scientist Discusses The Making of a Data Scientist

By Metis • June 16, 2020

In the most recent episode of Kaplan's Bold Leaders in Learning, host Brandon Busteed interviews Metis Chief Data Scientist about The Making of a Data Scientist. Watch it in full here.