Webinar June 18: Deep Learning Approaches to Forecasting and Planning Register

Understanding Natural Language Processing

By Tony Yiu • May 22, 2020

This post by Data Scientist Tony Yiu is a summary of a longer blog he published on his Medium account, which you can read in full here.

Natural language processing (NLP) is one of the trendier areas of data science. Its end applications are many – chatbots, recommender systems, search, virtual assistants, etc. – so it would be beneficial to at least understand the basics even if you only occasionally dabble in analytics. And who knows, some topics extracted through NLP might just give your next model or analysis an extra boost. In this post, we seek to understand why topic modeling is important and how it helps us.

Topic modeling is the practice of using a quantitative algorithm to tease out the key topics that a body of text is about. It bears a lot of similarities with something like PCA (principal components analysis), which identifies the key quantitative trends (that explain the most variance) within your features. The outputs of PCA are a way of summarizing our features. For example, it allows us to go from something like 500 features to 10 summary features, and these 10 summary features are basically topics.

In NLP, it works almost exactly the same way. We want to distill our text data, and its potentially millions of documents, into a set of digestible topics that tells us that this group of documents is primarily about computers, while this group is about finance, and so on. 

There’s a ton of data out there in the world (especially text data); topic modeling is a critical tool for organizing and making sense of all of it.


Before transitioning into data science, Tony Yiu spent nine years in the investments industry as a quantitative researcher, where he worked on portfolio optimization, economic simulation, and built numerous forecasting models to predict everything from emerging market equity returns to household spending in retirement. He now works as a data scientist at Solovis, where he uses his experience in statistics, finance, and machine learning to design and build risk analytics software for financial institutions. Tony is also a Metis Bootcamp graduate and we’re excited to have him back with us as a contributor to the blog, where he’ll write about data science and analytics in business and industry

Similar Posts

business resource
Updates to the Metis Corporate Training Page Break Down Services & Courses

By Metis • May 14, 2020

On our new Corporate Training page, we provide an overview of our offerings, complete with a breakdown of all services and courses, which include Data Literacy, Python for Data Analysts, Machine Learning Foundations, and more. We also wanted to highlight our history, backed by Kaplan, and our team, made up of smart, talented, and ambitious educators and thinkers.

business resource
Available Now: Free On-Demand Webinar on Adopting Python in the Workplace

By Metis • March 31, 2020

We recently hosted a webinar on Exploring the Adoption of Python in the Workplace, during which our team broke down Python for data science and analytics, explaining what drives adoption and how companies are reacting to the shift.

business resource
Free Event: Hitachi Vantara DataOps.NEXT Virtual Conference

By Metis • April 27, 2020

Data practitioners often face multiple challenges when working across complex data architectures. To help, Hitachi Vantara developed the DataOps.NEXT Virtual Conference, happening May 14th from 9 am - 6 pm. Learn more about this totally free online conference.