Skip to content

Latest commit

 

History

History
21 lines (15 loc) · 5.87 KB

README.md

File metadata and controls

21 lines (15 loc) · 5.87 KB

'banner'

BEHIND THE MASK: TRACING THE COVID-19 RESEARCH TIMELINE THROUGH TOPIC MODELING AND EVOLUTION

The COVID-19 pandemic has reshaped the world in ways we never imagined. With 682.6 million cases and 7 million deaths reported by April 2024, its effects have been felt everywhere—especially in the workforce, where the World Health Organization (WHO) warns that half of the global workforce could face job losses. But amidst these challenges, the research community has rallied, publishing over 350,000 COVID-related articles on PubMed Central alone, with many offering fresh insights. Interestingly, nearly half of these researchers usually delve into the realms of energy physics and condensed-matter physics, showing just how adaptable and dedicated the scientific community is.

This project is a tribute to their tireless efforts. We aim to trace the evolution of COVID-19 research through topic modeling, creating a tool that’s not just useful for tracking the pandemic but can also be applied to other industries—from academia to media and government policy. While the need is especially urgent now, with rising cases and new variants emerging, we believe the methods we develop here can have far-reaching benefits, helping researchers in their future work and beyond.

In short, this project isn’t just about COVID—it’s about harnessing the power of data to support innovation, help researchers, and ultimately make the world a better place. And while it’s a serious task, we’re committed to making it as impactful and accessible as possible, all while keeping the hard work of researchers front and center.

HIGHLIGHTS 🌟

  • Data Collection: The dataset used in this project was obtained directly from the Registry of Open Data on AWS. The team connected to the S3 bucket containing the CORD-19 dataset to load and subsequently process it in preparation for analysis. The data includes full-text and metadata information, which are crucial for executing the project's main methodology. Additionally, the team converted the data into Parquet format to facilitate easy access whenever the code needs to be re-run.
  • Data Pre-processing: Data pre-processing was particularly necessary because the raw information could not be used as-is. The team implemented several steps to prepare the data for subsequent analysis. First, the publish date of each article was extracted and divided into two separate columns: month and year. This division is crucial for exploratory analysis and for tracking how topics evolve over time. Additionally, filtering was performed to exclude articles lacking content by removing those with abstracts containing fewer than 100 words. Rows with null values in the abstract, title, year, and month columns were also removed to prevent empty articles from contaminating the final dataset. Finally, all duplicate articles were removed to avoid over-representing certain data points.
  • Data Exploration: In this section, the team conducts an exploratory data analysis (EDA) of COVID-19 research to present initial metrics about the data and demonstrate that a simple technique cannot fully address the problem statement. The EDA begins with plotting the distribution of the number of articles produced on a monthly and yearly basis, serving as a proxy for the severity of the virus during different periods. Next, the team generates word clouds using only the titles of the articles to gain insights into what simple visualization techniques can reveal about the evolution of research studies.
  • Topic Modeling and Evolution: The topic modeling process involves two key steps. First, text pre-processing prepares the data for LDA by tokenizing, removing stop words, and vectorizing the text. Then, LDA is used to categorize articles into topics, with each topic represented by important words. The team used pyLDAvis to refine their COVID-19 topic model, determining the number of topics based on article distribution and topic differences. The tool also helped identify key words within each topic, balancing their commonality and rarity. This approach revealed how COVID-19 discussions evolved over time, from routine in 2018 to more retrospective topics by 2022.
  • Results and Discussion: From 2018 to 2022, scientific research evolved in response to the global health crisis of COVID-19. In 2018, research focused on general medical topics and early studies on coronaviruses like MERS-CoV. In 2019, attention shifted to coronaviruses, particularly those linked to bats, foreshadowing the emergence of COVID-19. By 2020, research concentrated on understanding COVID-19, public health responses, and its broader impacts on mental health and education. In 2021, the focus expanded to include vaccine development, pandemic effects on education, and the use of machine learning in medical imaging for diagnosis. By 2022, research had broadened further to encompass the socio-economic, health, and environmental impacts of the pandemic, reflecting COVID-19’s lasting influence on global scientific and policy efforts.

KEY TAKEAWAYS 🔑

  • Topic Modeling and Evolution: The team used Latent Dirichlet Allocation (LDA) and pyLDAvis to detect and label topics across different COVID-19 time periods. This methodology is not only applicable to the current use case but also has broad potential for academic research, media information, public health, and government policy making.
  • Threat Identification: The methodology demonstrated by the team can detect emerging COVID-19 threats by pinpointing emerging research areas (topics with low marginal distributions), enabling early recognition and quicker implementation of preventive measures.
  • Trend Analysis: It also helps identify trends in COVID-19 research by highlighting key areas of focus (topics with high marginal distributions), forecasting future research directions, promoting collaboration, and encouraging more impactful studies.