This post by Data Scientist Tony Yiu is a summary of a longer blog he published on his Medium account, which you can read in full here.
Natural language processing (NLP) is one of the trendier areas of data science. Its end applications are many – chatbots, recommender systems, search, virtual assistants, etc. – so it would be beneficial to at least understand the basics even if you only occasionally dabble in analytics. And who knows, some topics extracted through NLP might just give your next model or analysis an extra boost. In this post, we seek to understand why topic modeling is important and how it helps us.
Topic modeling is the practice of using a quantitative algorithm to tease out the key topics that a body of text is about. It bears a lot of similarities with something like PCA (principal components analysis), which identifies the key quantitative trends (that explain the most variance) within your features. The outputs of PCA are a way of summarizing our features. For example, it allows us to go from something like 500 features to 10 summary features, and these 10 summary features are basically topics.
In NLP, it works almost exactly the same way. We want to distill our text data, and its potentially millions of documents, into a set of digestible topics that tells us that this group of documents is primarily about computers, while this group is about finance, and so on.
There’s a ton of data out there in the world (especially text data); topic modeling is a critical tool for organizing and making sense of all of it.
Before transitioning into data science, Tony Yiu spent nine years in the investments industry as a quantitative researcher, where he worked on portfolio optimization, economic simulation, and built numerous forecasting models to predict everything from emerging market equity returns to household spending in retirement. He now works as a data scientist at Solovis, where he uses his experience in statistics, finance, and machine learning to design and build risk analytics software for financial institutions. Tony is also a Metis Bootcamp graduate and we’re excited to have him back with us as a contributor to the blog, where he’ll write about data science and analytics in business and industry.