Investigating public perceptions regarding the Long COVID on Twitter using sentiment analysis and topic modeling

2022-11-03YuBoFu

Medical Data Mining 2022年4期

Yu-Bo Fu

1Hasan School of Business,Colorado State University Pueblo,Pueblo,CO 81001,USA.

Abstract Background: An estimated 10 to 30 percent of people who become infected with Severe acute respiratory syndrome coronavirus 2 will experience persistent symptoms after recovering from Coronavirus Disease 2019 (COVID-19),which is known as Long COVID.Social media platforms like Facebook and Twitter are the primary sources to gather and examine people’s opinion and sentiments towards various topics.Methods:In this paper,we aimed to examine sentiments,discover key themes and associated topics in Long COVID-related messages posted by Twitter users in the US between March 2022 and April 2022 using sentiment analysis and topic modeling.Results:A total of 117,789 tweets were examined,of which three dominant themes were identified,ranging from symptoms to social and economic impacts,and preventive measures.We also found that more negative sentiments were expressed in the tweets by users toward long-term COVID-19.Conclusions:Our research throws light on dominant themes,topics and sentiments surrounding the ongoing public health crisis.From the insights gained,we discuss the major implications of this study for health practitioners and policymakers.

Keywords: Long COVID,Twitter,social media,sentiment analysis,topic modeling

Introduction

COVID-19 is an infectious disease caused by Severe Acute Respiratory Syndrome virus.The earliest cases of COVID-19 were reported in late December 2019 in Wuhan,China and soon after,the novel virus got spread to the entire world.As of May 11,2022,there were around 517 million confirmed cases of coronavirus,including 6,258,023 deaths,reported to WHO.The virus has infected and killed millions of people worldwide.Vaccinations helped us ward off the worst-case scenario and the worst of the Omicron variant wave has passed.We started seeing a decline in the number of new COVID-19 cases and we are finally moving on,however,the pandemic is not over yet [1].We are in the transition toward managing COVID-19 as an endemic disease.

Survivors of COVID-19 now exceed 472 million.However,49% of survivors reported persistent symptoms 4 months after diagnosis [2].Around 10 % to 30 % of patients experience Long COVID after recovering,even if they had mild illness or no symptoms from COVID-19—a condition commonly referred to as Long COVID[3].It is also known as post-COVID,long-haul COVID or long-term COVID [4].According to the Centers for Disease Control and Prevention,post-COVID are a wide range of symptoms that can last four weeks or even months after first being infected with the virus [5].Symptoms of Long COVID may vary from person to person.The commonly reported symptoms are fatigue,difficulty breathing,cough,brain fog,joint pain,chest pain,muscle pain,headache,anxiety,depression,concentration and sleep problems,hair loss,skin rash and dyspnea[6,7].A study found that there are around 203 symptoms associated with Long COVID,however,no universal clinical definition exists.Health experts are still trying to understand more about Long COVID symptoms,causes,severity,and impact on long haulers’ daily lives[8].Long COVID sufferers often have no idea what to do about their persistent symptoms [9].In addition to lingering health effects of infection caused by Long COVID,studies have shown that Long COVID is having a dramatic economic impact in most countries.It affects people’s ability to head back to work,which possibly led to labor shortages.It also has serious impact on people’s social life [10,11].The consequences caused by Long COVID could turn out to be a more severe public-health problem than excess deaths from COVID-19 [12].

Long COVID is not fully understood by researchers and doctors,as it has yet to be consistently and thoroughly investigated.Causes,treatments and who is at highest risk of getting long-haul COVID remain unknown [13].A major benefit of social and digital platforms in infectious disease is the effective dissemination of information,including updates of health crisis and essential medical information to the public [14].There are currently 206 million daily active users on Twitter [15].Posts published by Twitter users tend to reflect their ideas about a variety of topics and events in real time,including health conditions,making Twitter particularly well suited as a source of identifying public health conditions and concerns on a global scale.

Text mining can be used to extract meaningful insights and nontrivial patterns from a large volume of textual data.People started talking about Long COVID on Twitter first before physicians and clinicians came to know about it[16].Analyzing Tweets with the help of text mining techniques provides valuable insights about the emerging public health issue.It serves as a possible toolset for researchers,practitioners,communities and health policymakers to better understand Long COVID problems through conversational data.In this paper,a conceptual framework for Twitter messages analysis was developed and described (Figure 1).It contributes to the academic research community by providing a text analysis framework that helps researchers better understand how to approach and analyze Twitter data to study the world’s health,monitor infectious outbreaks,and it serves as a useful tool for the early detection of public health threats from social media.

Regarding long-term COVID,very little social media data-based research has been done to study the causes and symptoms,examine public awareness and emergent conversations.This study aims to analyze posts on a major social media platform,Twitter,regarding Long COVID to better understand the opinions and sentiments of the general public.This is critical to enhance public awareness of Long Covid and support policy makers’ decision making.The objective of this study is two-folds: (1) identify and classify sentiments toward Long COVID that are expressed in tweets (2) examine the public discourse and emergent themes surrounding long-haulers.

Methods

Text analysis takes unstructured text data as the input,preprocesses and transforms them through text mining techniques,identifies the sentiments that are expressed in the texts and finally attempts to uncover themes and topics from the text analysis using topic models.

Data Collection

Twitter posts served as a valuable source of data in our study because it is widely used for public health surveillance,event detection,disease tracking and forecasting [17].Twitter provides a suite of Application Programming Interface to let researchers and developers to access and gather all tweets with specific keywords or hashtags.The Python programming language was used for data collection and analysis.We collected tweets of Twitter users located in the US who mentioned about Long COVID or post-COVID in their tweets in the English language starting from March 26,2022 to April 26,2022 using Twitter’s Application Programming Interface.The keywords used were: Long COVID and Post-Covid,including their hashtag equivalents.Besides,we applied the retweets feature in GetOldTseets3 to remove retweets.This results in a dataset of 117,789 unique messages.

Figure 1 Text analysis framework.VADER,Valence Aware Dictionary and Sentiment Reasoner;LDA,Latent Dirichlet Allocation.

Data Preprocessing

Not all characters and words in tweets add value towards text analysis.We pre-processed the tweets prior to topic modeling leveraging the Natural Language Toolkit(NLTK),Twitter Preprocessor and regular expression.In the first round of data preprocessing,we removed the special characters,numbers,hashtags,mentions,URLs and hyperlinks using the regular expression operations library of Python.In the second round of preprocessing,all text data were converted to lower case and then broken into tokens using NLTK function.Since stopwords do not contain useful information,we removed them from further analysis.We then applied the WordNet Lemmatizer to convert a word to its meaningful base form.Finally,we created the dictionary (id2word) and corpus needed for topic modeling using Gensim python library.Gensim created a unique id for each token and calculated the term frequency to reflect how important a token is in that corpus.

In addition,we performed experiments on part-of-speech tagging(POS).The outputs of topic model are often difficult to interpret for useful insights.We explored part-of-speech tagging techniques to enhance interpretability.POS technique assign each word in a sentence with its appropriate part of speech (e.g.,noun,adjective,etc.).First,we built a topic model without tags as a baseline model.Second,we used NLTK package to tag part of speech and experimented on topic models on nouns and adjectives.Finally,we removed all words except nouns,as nouns are better indicators of a topic being talked about [18].Previous work suggests that reducing a news corpus to nouns only would improve the topics’ semantic coherence[19].

Sentiment Analysis &Topic Modeling

Sentiment analysis is the process of identifying and extracting opinions of people about a variety of topics.To assess the sentiments in Twitter messages,a sentiment score for each tweet was calculated using the VADER(Valence Aware Dictionary and Sentiment Reasoner)package in Python.VADER is one of the most popular lexicon and rule-based sentiment analysis tools that is specifically attuned to sentiments expressed in social media[20].VADER sentiment not only tells if the tweet is positive or negative,but it also reveals the intensity(strength) of emotion.The sentiment of each tweet was grouped into three categories: negative,neutral or positive based on its compound score.A compound score is calculated by summing the positive,negative &neutral scores of each word in the lexicon,which is then normalized between-1 (most extreme negative sentiments) and +1(most extreme positive sentiments).If the compound score of a tweet is less than-0.05,the associated sentiment is negative.If it is greater than+0.05,the sentiment is classified as positive and if it falls within the range of-0.05 to +0.05,the associated sentiment is neutral.An example of sentiments of tweets can be found in Table 1.

Topic modeling is an unsupervised machine learning technique to discover the abstract topics that occur in unstructured text data.It is commonly used to automatically discover hidden topical patterns and obtain recurring patterns of texts.It provides a good way to identify what people are saying and understand their thoughts in social media platforms.There are many techniques can be used to obtain topic model,including Gaussian Mixture Model,Formal Concept Analysis,hierarchical latent Dirichlet allocation and Latent Dirichlet Allocation(LDA).LDA is one of the most commonly used topic modeling techniques.We ran LDA from Gensim package to infer the themes of the 117,789 unique tweets.One of the key advantages of LDA is that no prior knowledge about the themes is required in order for topic modeling to work,which allows for discovery of new topics [21].Webuilt a LDA model with 20 topics where each topic is a combination of keywords,and each keyword contributes a certain weightage to that topic.The weights reflect the importance of a keyword in that topic.After the LDA model was built,we examined the produced topics and the associated keywords through word clouds.

Table 1 Sentiment values of tweets

Results

Frequency of Keywords

Word clouds was used to visualize word frequency in tweets.As shown in Figure 2,the most common word was people.Vaccine and symptom were also frequently mentioned words.

Sentiment Analysis

Sentiment analysis deals with tagging individual tweets with their respective sentiment polarities.The percentage of negative tweets is significantly higher than those of the positive and neutral tweets concerning Long COVID.Of the total 117,789 tweets,there were 53,359 negative tweets (45.3%),45,702 positive tweets (38.8%) and 18,728 neutral tweets (15.9%) (Figure 3).

Topic Modeling

To get better interpretable results in topic models and give appropriate labels,we experimented with different POS tags.

First,a 20-topic model was built without the tags as the baseline model.The results showed that the topics extracted by LDA model did not help make sense of data (Figure 4).

Then,another 20-topic model on nouns and adjectives was created.

Figure 2 Word cloud showing keywords frequencies.

Figure 3 Distribution of sentiments.

Figure 4 Topic model without tags.

As shown in Figure 5,the topics started making more sense but still did not have very clear distinctions.Building topic models without tags or on nouns and adjectives did not yield anything meaningful,surprising or insightful.Therefore,we also created topic model on the nouns only.It turned out that topic model with nouns only performed well.The emergent topics and the top 10 most frequently occurred words in each topic extracted using topic modeling are shown in Table 2.

Twitter users were discussing three Long COVID related questions.(1) What are some symptoms of COVID-19 long haulers? (2) How will the invisible disease change the way we live? (3) How do we prevent long-term COVID-19?

To visualize topics and associated keywords generated by topic modeling approach,word cloud technique was applied.Three topic clouds were created to identify and visualize themes and high-frequency topic keywords regarding Long COVID (Figure 6).

As shown in Figure 7,the top 10 most common words in each topic by beta value were visualized.These words were used to provide each topic with a degree of semantic interpretation in the related contexts through relevant topic descriptions.The higher the beta value is,the greater the possibility of a relatable word appearing in each topic.

After comparing the word clouds for noun tags only with the cloud without tags and the cloud on nouns and adjectives,experimental result shows that reducing the corpus to nouns prior to topic modeling leads to more interpretable,segregated and meaningful topics.

Discussion

The results indicate some aspects of public awareness and concerns regarding one of the most serious consequences of the coronavirus,the long-haul COVID.

Figure 5 Topic model on nouns&adjectives.

Figure 6 Word cloud for noun tags only.

Figure 7 Top 10 most common words in each topic by beta value.

Table 2 The emergent topics and themes in tweets about Long COVID with nouns only structure

To begin with,sentiment analysis showed that 38.8% of the tweets contained positive sentiments,while 45.3% contained negative sentiments,which indicates that Twitter users had a negative outlook toward Long COVID.Negative emotions may cause misperceptions.COVID-19 can cause persistent ill-health and our understanding of how to manage and treat Long COVID is still evolving.Policymakers need to consider how they can deal with the negative emotions that the pandemic elicit and effectively inform the public about Long COVID.While there is no ‘best practice’ for communication during a complex public health crisis,an effective communication strategy involves providing clear and specific information,communicated with openness and empathy,delivered through appropriate social media platforms,tailored for diverse community needs and shared by trusted authority[22].

Additionally,topic modeling results revealed that there were three themes’ people discussed a lot regarding Long COVID.Topic 1 describes symptoms of long-haul COVID.Fatigue was one of the commonly reported problems.Some people reported they are experiencing brain fog,a term used to describe short-term memory loss and difficulty concentrating,from Long COVID [23].Like adults,children can experience Long COVID.A study has shown that Long COVID affects children with the same wide and disturbing range of symptoms as adults,from fatigue to brain fog and trouble breathing[24].However,which kids will be affected and how badly they will suffer remains unknown.To date,data on long-term COVID-19 in children and adolescents remains scarce since they are typically less severely affected by acute COVID-19 [25].The potential for Long COVID in kids looms large for many cautious parents and educational leaders.Researchers and health experts do not yet fully understand the risk factors,causes and effects of Long COVID.They are only starting to define the condition.Therefore,further research is urgently needed to investigate causes and symptoms and treatments of Long COVID.

Topic 2 discusses the impact associated with life-changing lingering effects of Long COVID.Patients with Long COVID are struggling to get back to work and have normal social lives.It affects not only their physical and mental health but may also result in significant economic consequences for them and the society.People who are suffering from post-COVID syndromes may be too sick to work.Some patients might change their job or have to work fewer hours or have to work from home because of health issues.Therefore,while Long COVID is taking a heavy toll on the individuals affected,it also represents a disaster in the making for businesses and society-potentially pushing significant numbers of people out of labor markets.Although there is no evidence that the labor shortages seen in the U.S.is directly related to Long COVID,it is time for employment law and Disabilities Act to catch up with the new condition.Governments need to take actions to make clear to what extent it should be treated as a disability or an occupational disease.Directed federal agencies should support long haulers as they seek treatment and attempt to return to work.Also,policymakers need to take steps to address potential economic effects of Long COVID.

Moreover,topic 3 suggests people are looking for ways to prevent post-COVID conditions.Since millions of people suffer from long-haul COVID and we are still in the process of understanding the clinical patterns of long-term COVID-19,the best way to prevent post-COVID syndrome is to protect us from becoming infected in the first place.CDC suggests face masks are effective at preventing infection with COVID-19.Although masks are no longer required in public settings,wearing a mask indoor is strongly recommend for everyone,regardless of vaccination status because transmission remains a significant risk.

The best way to reduce the risks of suffering from Long COVID is to get vaccinated.Vaccinations helped us ward off the worst-case scenario of COVID-19.A recent study suggests that those who are vaccinated are less likely than unvaccinated individuals to report post-COVID conditions[26].

We can summarize the important public health policy recommendations based on our study and findings.They are summarized in Table 3.

Conclusion

Long-haul COVID is a terrible and debilitating disease.Our research examined social media Twitter data to discover public opinions aboutLong COVID.Traditional LDA was employed in this study to obtain the patterns,topics and associated themes in Twitter textual data.We also examined the sentiments associated with the tweets using VADER.We found that Twitter users expressed negative feelings about the ongoing symptomatic COVID.While the diagnosis of Long COVID is unclear,its social and economic impact is visible.The symptoms,social and economic impacts,and ways to prevent long-term COVID were the most discussed topics.

Table 3 Public health policy recommendations

This study offers several research insights for policymakers.Health leaders should effectively communicate with the public,build platforms to provide details like where the public can go for information and help.Also,supporting research initiatives and improve data collection on Long COVID is critical as causes,symptoms and treatments remains unclear.Actions need to be taken to address the wider social and economic consequences of long-haul COVID.Developing support programs to those with Long COVID is necessary.

This study contains a few limitations.We collected text data from Twitter.The research findings are reflective of Twitter users only.Future research should include other data resources such as data from Facebook,Instagram,TikTok,etc.Another important limitation is that we only focused on Twitter users in the U.S.Regional and cultural differences could have impacts on people’s opinions,concerns and sentiments regarding Long COVID.As a future research extension,tweets posted in other geographical regions and languages should be collected and analyzed to make the study more generalizable.In addition,data used in this study was collected over a limited period of time.It worthwhile to study public health event in time series.Monitoring social media sentiment scores over time to check for spikes and identify what might have caused changes.Moreover,Twitter influencers may largely affect people’s thoughts,opinions and sentiments towards Long Covid,which in turn may affect the results of text analysis.In order to minimize the effect of influential Twitter users,multiple retweets of the same original tweet were removed from analysis.Only unique tweets were analyzed.Future studies can address the issues that maybe caused by influential Twitter users and measuring their influence on people’s sentiments and opinions on social media platforms.Furthermore,an application could be built in the future to monitor and visualize real-time topics and sentiments associated with a keyword.

Medical Data Mining

2022年4期