Limited Time: 30% Savings on Online Flex Bootcamp Tuition! Explore Programs

Understanding Natural Language Processing

By Tony Yiu • May 22, 2020

This post by Data Scientist Tony Yiu is a summary of a longer blog he published on his Medium account, which you can read in full here.

Natural language processing (NLP) is one of the trendier areas of data science. Its end applications are many – chatbots, recommender systems, search, virtual assistants, etc. – so it would be beneficial to at least understand the basics even if you only occasionally dabble in analytics. And who knows, some topics extracted through NLP might just give your next model or analysis an extra boost. In this post, we seek to understand why topic modeling is important and how it helps us.

Topic modeling is the practice of using a quantitative algorithm to tease out the key topics that a body of text is about. It bears a lot of similarities with something like PCA (principal components analysis), which identifies the key quantitative trends (that explain the most variance) within your features. The outputs of PCA are a way of summarizing our features. For example, it allows us to go from something like 500 features to 10 summary features, and these 10 summary features are basically topics.

In NLP, it works almost exactly the same way. We want to distill our text data, and its potentially millions of documents, into a set of digestible topics that tells us that this group of documents is primarily about computers, while this group is about finance, and so on. 

There’s a ton of data out there in the world (especially text data); topic modeling is a critical tool for organizing and making sense of all of it.


Before transitioning into data science, Tony Yiu spent nine years in the investments industry as a quantitative researcher, where he worked on portfolio optimization, economic simulation, and built numerous forecasting models to predict everything from emerging market equity returns to household spending in retirement. He now works as a data scientist at Solovis, where he uses his experience in statistics, finance, and machine learning to design and build risk analytics software for financial institutions. Tony is also a Metis Bootcamp graduate and we’re excited to have him back with us as a contributor to the blog, where he’ll write about data science and analytics in business and industry

Similar Posts

business resource
Corporate Training For Non-Technical Employees: Data Analysis Using Spreadsheets

By Metis • March 04, 2021

Learn about our new Data Analysis Using Spreadsheets Corporate Training course, designed to empower non-technical teams, no prerequisites required.

business resource
Understanding the Business Cycle

By Tony Yiu • September 22, 2020

Despite this rollicking bull market, there is such a thing as the business cycle. And whether you’re a data practitioner or an MBA, it’s worth taking your time to understand what drives its ebbs and flows. Read Data Scientist Tony Yiu's latest post here.

business resource
VIDEO: An AI4 Panel Discussion on The State of AI in Banking

By Metis • September 23, 2020

Metis Sr. Data Scientist Javed Ahmed recently took part in a panel discussion about The State of AI in Banking during an online Ai4 event. He and the other panelists talked about upskilling, challenges related to COVID-19, and more. Watch the recorded panel discussion here.