Companies in the pharma industry are doing amazing work, and this is especially apparent in the current COVID-19 pandemic. Laypeople who want to invest in pharma company stocks may have difficulty understanding the technology or the business. Company press releases, which are official communications intended for the press, investors, and the public, are a rich source of information about the companies’ products and activities. Analyzing pharma company press releases for topic (vocabulary words) and sentiment might help laypeople understand the business more, and also understand what constitutes positive versus negative company news.
For this project, press release text was analyzed, and the target variables were sentiment and change in stock price. The initial idea was to build a model using the text data and sentiment to try to predict movements in the stock prices in the days after the press release. However, the wide differences in date formats of scraped press releases, plus the fact that there are not press releases every day, whereas the market is open every business day, results in mismatched data points — there is ‘missing’ press release data on some days, and marking those days as ‘neutral’ or zero is not a good solution.
Press releases of companies that had drugs approved by the US Food and Drug Administration (FDA) during that time period 2016–2019 were chosen for analysis. This time period was selected because it is recent, but does not overlap with the COVID-19 pandemic. Press release text was scraped from 10 companies; due to issues with website formatting and date format types, four companies were analyzed (Novartis, Vertex, Eli Lilly, and Clovis). Historical stock price data for these companies was collected using Tiingo.
I had expected that the topics that came up in my topic modeling would be the different purposes of the press releases; for example, announcing quarterly financial results, announcing presentations at scientific meetings, announcing FDA drug approvals, announcing clinical trial data, etc. Surprisingly, the topics that popped up in the modeling had to do mainly with the different disease areas addressed by the companies: Oncology, Diabetes, Cystic Fibrosis, Migraine, Heart Disease. There were also the business-related topics of Investing and Financial Operations. Using Tableau, I created word clouds to illustrate these topics. Some of them are below.
For the sentiment analysis, I decided to focus on one company. I chose the Eli Lilly company, for no particular reason. They happen to have an extremely well-organized archive of press releases, which made the analysis much easier. Would that all companies had such well-organized archives! I used Tableau to plot sentiment polarity of each press release over the 2016–2019 time period:
Each press release is represented by a red circle on the above plot. Sentiment, on the y-axis, is expressed on a continuum from -1 (most negative sentiment) to +1 (most positive sentiment). Not too surprisingly, the sentiment of most of the press releases was determined to be positive; these are official company communications after all, and the company wants to communicate its products and mission as positively as possible. However, some press releases are more positive than others! Most of the press releases had a sentiment between 0 (neutral sentiment) and +0.2, indicated on the above plot with a blue rectangle. There are a few press releases which are more positive (circled in green) and a couple that fall below the neutral line into the negative zone (circled in blue). Let’s dig into the individual documents to see what’s going on.
First, the three most positive press releases. Excerpts from these are shown above, along with their sentiment polarity. Notably, all three of these press releases deal with some type of public outreach that the Eli Lilly company was doing: a scholars program for business students at Florida A&M and Howard University, an award sponsored by Lilly given to the Mayor of Boston and the Alzheimer’s Association MA/NH Chapter, and announcement of the winner of an Innovation Challenge for Inflammatory Bowel Disease, sponsored by Lilly. These sorts of public outreach projects are good publicity, and good for business. No formal analysis was done, so this is anecdotal and no causality can be determined, but in each case the closing stock price rose in the 3 days following the above press releases: from $60.90 to $61.43 in the first case; from $105.43 to $107.59 in the second case, and from $113.95 to $115.83. The more modest increase in the first case may be due to the time period — the month after the US 2016 election, a time when the markets were somewhat unsettled.
How about the ‘negative’ press releases? Notice that the word ‘negative’ is in quotation marks.
Excerpts from the two press releases rated as ‘negative’ in this analysis are pictured above. These press releases are much more technical than the three positive ones we discussed earlier, with an extremely specialized vocabulary. They both deal with Lilly’s treatment for psoriatic arthritis, Taltz (ixekizumab); the first press release announces that Lilly will present data from its Phase 3 clinical trial of Ixekizumab — 29 abstracts in all to be presented, with new data, and safety and efficacy results. The second, more negatively-rated press release, announced that the treatment improved the condition of psoriatic arthritis patients. Both of these press releases contain good news — positive data on a new, efficacious treatment for a serious, debilitating disease. So why did the sentiment analysis give these texts negative ratings?
Here is what is going on. Looking only at the excerpts above, you will read the words ‘arthritis,’ ‘severe,’ ‘ inadequate,’ ‘intolerance,’ ‘inhibitors,’ and ‘intolerant.’ In non-technical writings, these words usually have a negative connotation. In general, lay language usage, phrases like ‘severe weather,’ ‘this thing is inadequate for my needs,’ and ‘being intolerant of other people’s ideas’ have negative connotations and evoke negative sentiment. However, in the context of an official communication from a company that makes a product targeted to arthritis, these words are, at the very least, neutral, and may even be positive depending on the context. People with psoriatic arthritis are the target market for this product, so the word ‘arthritis’ is simply an indicator of the market and the target of the treatment. The phrase ‘prior inadequate response or intolerance’ sounds pretty negative, but in this context, it is again pointing toward the target market for this product: people with a severe condition, whose disease has not responded to other treatments. These people may benefit from the new treatment, according to the text in the press release. When an off-the-shelf sentiment analyzer is used, these words are flagged as negative, because the analyzer was trained to see them as negative words.
What is needed is a sentiment analysis tool that uses a specialized dictionary, and considers word context when analyzing these texts. For example, the word ‘ intolerance’ is used here in the context of patients who were intolerant of an existing treatment (i.e., the treatment made them ill) benefitting from using Lilly’s Taltz product. This should be rated as contributing to a positive sentiment of the text. However, there is sometimes a scenario where a company announces that some patients are intolerant of that company’s product — that is, the company’s product that is being used to treat their disease makes them ill. For example, some patients may be allergic to a component of the treatment. This should be rated as contributing to a negative sentiment of the text.
The goal of this project was to build something that might help make pharma press releases more understandable to the layperson. However, the sentiment analysis done using NLTK and TextBlob was not the right one for this use case. This type of highly specialized, technical literature require a tuned, specialized dictionary of words rated as having a positive or negative sentiment to be used as the basis for determining the sentiment of a press release. Further work will incorporate a more appropriate dictionary, and use context, when analyzing sentiment of press release texts.
Tools used for this project were: SciKit-Learn, Pandas, Numpy, Tiingo, Pandas DataReader, NLTK, TextBlob, Seaborn, Matplotlib, Tableau. I also used the plot_top_words function written by Olivier Grisel, Lars Buitinck, and Chyi-Kwei Yau to plot my topic modeling. I modified this code to create csv files for each topic, with weights for each word in the topic, to make the word clouds in Tableau.