By Rachael White ’23, Biochemistry, Biophysics, and Computer Science
I’m a sophomore studying biochemistry and computer science at Rensselaer Polytechnic Institute. I’m no expert in big data, or even social media for that matter, but I recently helped design a tool that analyzes tweets to measure public attitudes about mask-wearing during the COVID-19 pandemic as part of a Health Analytics Challenge Lab.
The Health Analytics Challenge Lab, run by the Rensselaer Institute for Data Exploration and Applications (IDEA), offers students a chance to use data analytics tools in research. One of the IDEA research projects, the COVID-Twitter project, is a research pipeline that uses Twitter data to identify and visually represent topic clusters in social media discussions. I was hugely intrigued by this research framework and joined the project team right away.
The COVID-Twitter project analyzes tweets on an immense scale. Tweets are first harvested from Twitter’s Application Programming Interface (API), then augmented with a powerful machine learning tool that computes the textual similarities among them, and finally deposited in a statistical computing platform for analysis. There, they can be rendered visually as three-dimensional clusters, for display and accentuation of the central discussion themes they contain. During weekly lab breakouts, our COVID-Twitter team developed and refined the tools that accommodate this data stream.
While Twitter’s samples were provided in a random and location-general fashion, I wondered if we could study conversations by region if we found out the location of each tweet. I proposed that we introduce geospatial analysis in a user-convenient way with a location filter, and a fellow lab member had the feature running the next day. This allowed us to study and visualize trends in pandemic-era Twitter discourse, on a state-by-state basis. Taking hints from the national news cycles, we zeroed in on the conversations about masks taking place across the United States. By indexing a new database, we were able to select tweets that were strictly related to coronavirus and mask usage from our larger pool of tweets posted between Jan.1 and July 31, 2020.
One lab mate added a sentiment analysis tool that allows us to study attitudes toward mask-wearing based on tweet text. Another optimized the clustering method we use to summarize and display the tweets. I worked on a Python script that helped us evaluate our sentiment analysis tool against other known means of measuring public sentiment. We also introduced a highly anticipated analysis feature: automatic cluster labeling, courtesy of a nifty machine learning tool one lab member discovered. When the feature rolled out, our lab group watched in awe as the clever text interpreter went to town producing highly accurate and informative cluster summaries for our visualizations.
In our finalized analysis, we showed that the volume of COVID-19-related Twitter activity increased between January and the end of July, as might be expected. More notably, we found that the average sentiment of the discussion over this time grew consistently negative, and the sentiment scores of individual tweets grew more polarized. The collected corpus of tweets, full source code for the data collection, and analysis pipeline are freely accessible.
Overall, official reports of infection rates can tell us the scope of the pandemic from an objective, biological standpoint, and are unquestionably crucial for epidemiological analyses. But when we consider the broader scope of the pandemic’s impact on society, those numbers only tell us part of the story. Powerful visualizations like clusters developed as part of the COVID-Twitter project offer insight into what people are actually saying and feeling in the midst of a pandemic. This is important because these social factors may prove equally, if not more, predictive of how quickly and effectively society and the economy will recover than the running disease tolls alone.
References: