Photo by Sebastian Scholz (Nuki) on Unsplash
This post was written by Adam Wearne, Sr. Data Scientist at Metis.
In an increasingly competitive financial landscape, gaining an informational edge is vital to constructing and maintaining a superior equity portfolio. Alternative data sources including satellite images, sensor data from IoT devices, text, and video are all becoming increasingly important sources of insight for active equity strategies. The volume of such sources of unstructured data is so profound that by some estimates, unstructured data accounts for over 90% of the entire digital universe. In this post, we'll highlight some of the common Natural Language Processing (NLP) techniques that are used in asset management.
Text data is, in particular, is one of the largest and fastest growing forms of alternative data. Uncovering investment insights requires not only domain knowledge of finance, but also a strong grasp of data science and machine learning principles. In the past, the volume and velocity of textual data were manageable enough to be manually analyzed by teams of human experts. But given the volume of text data being currently produced on a daily basis, it is now an untenable task for even a large team of fundamental researchers to wade through it all. Fundamental analysis assisted by NLP techniques is now a critical marriage to unlock the complete picture of how the experts and the masses feel about the market.
Sentiment analysis is perhaps the most common methods for gaining investment insight from text. The intuition is pretty apparent here - if we want to gain insight about the expected future return of a stock, it makes sense to know how people feel about that company! The most common methods of sentiment analysis in finance can be largely divided into two camps: Lexical and statistical methods. Lexical approaches have an inherent psychological component that is built into the system. This typically involves a panel of domain experts defining annotating a dictionary of words with their associated semantic polarity and strength. The semantic polarity of a given term is highly domain dependent and care must be given when deciding what lexicon to apply to any given problem. Perhaps one of the most notable sentiment dictionaries used in the space of finance and investment management is the Loughran-McDonald dictionary.
Aside from the Lexical-based methods of sentiment analysis, statistical approaches are drawn from many of the standard supervised learning approaches you may have seen in the past. Techniques like logistic regression, ensemble methods, and deep-learning all fit the bill here. The interesting research challenges in this arena are less-so the problem of applying the aforementioned models, but rather, how can one reliably assign a sentiment score to a longer piece of text, and how do we go about determining relative sentiment? A news headline that reads: "Company X wins large lawsuit against Company Y" is certainly good news for Company X, and potentially very bad news for Company Y. An interesting problem in modern applications of NLP to quantitative finance lies in understanding how the same document may have very different implications for different companies.
In addition to the specific method of sentiment analysis, one must also be mindful of the source text that is being analyzed. Are we looking at press releases? News headlines? Conference calls or corporate filings? Social media chatter? All of these sources contain potentially useful insights for generating alpha, but how information is conveyed across these different mediums is wildly different. What might be considered a very negative sentiment statement from Reuter's headline would look very different from a very negative tweet.
Of course, these statistical methods require us to have large labeled datasets. This effect is compounded if multiple emotional valences are being considered. So, if we're unable to produce such a large labeled dataset, are we stuck? Not at all! One can also employ unsupervised learning strategies along with some human-in-the-loop intervention to produce novel and sustainable investment strategies. Non-sentiment-based approaches may combine elements of topic modeling and clustering which have the potential for interesting investment applications. In the one approach, news headlines are first analyzed using modern dependency parsing and named-entity-recognition techniques in an attempt to determine what is happening and to whom, distilling their contents down into a simple Subject-Verb-Object (SVO) format. Taking the Verb-Object portion of this triplet across many news headlines, one can cluster them and begin to develop a picture of the effect they have on the Subject of a given headline by examining stock returns relative to the publication date of the headline.
Apart from using sentiment-like approaches to aid in forecasting returns, one can also incorporate text information from corporate filings to provide an alternative perspective on risk modeling. Publicly traded companies in the United States are required by the SEC to annually submit a form (10-K) which details information about the financial performance of the company. There are many standardized sections that companies are obliged to submit, including Risk Factors identified by the company. By running a topic model like LDA over the text in the risk section of corporate 10-Ks and examining how the distribution of topics overlaps between companies, one can gain insight as to what companies share common underlying risk factors. This method provides an alternate view of the portfolio risk that can be used to enhance standard returns-based approaches.
Perspectives on the future
There are still many open problems in the space of NLP applied to finance. Aspect-based sentiment analysis, coreference resolution, evaluating how novel or "surprising" a news article is, and many others. There is no shortage of interesting research problems in the space of NLP applications in finance. What's need is more creative minds and enthusiastic data scientists to help drive the field into the future!