Topic Modeling using Non-Negative Matrix Factorization (NMF)

This project demonstrates the application of Non-Negative Matrix Factorization (NMF) for topic modeling on a dataset of abstracts. Below, you will find a detailed explanation of the dataset, target variable, mathematical background of NMF, and evaluation methodology.

"""
## 📂 Project Structure
├── data/
│   └── NLP_Topic_modeling_Data.csv  # abstracts with 31 discipline labels
├── NMF_TOPIC_MODELING.ipynb         # Full analysis pipeline
└── README.md                        # Documentation
"""

Data Description

The dataset consists of research abstracts across various scientific disciplines. Each entry contains:

Abstract: The main text data used for topic modeling.
Fields of Study: Binary columns indicating the fields of research associated with each abstract (e.g., Physics, Mathematics, Computer Science, etc.).

Example Columns:

id: Unique identifier for each abstract.
ABSTRACT: The text of the research abstract.
Physics, Mathematics, Statistics, etc.: Binary indicators of relevance to specific fields.

Target Variable

The target variable for this project is the ABSTRACT column, which contains the data used for topic extraction and modeling.

Methodology

We utilized Non-negative Matrix Factorization (NMF) for topic modeling. NMF is a dimensionality reduction technique that factorizes a non-negative matrix V into two non-negative matrices W and H, such that:

V: Document-term matrix.
W: Document-topic matrix.
H: Topic-term matrix.

The optimization problem solved by NMF is:

Where F^2 represents the Frobenius norm.

Topic Coherence

The coherence score measures the semantic similarity between words in a topic. A higher coherence score indicates more interpretable topics. Here’s how it is calculated:

Preprocessing:
- The text data is cleaned by removing stop words, punctuation, and irrelevant tokens.
- The text is tokenized into individual words (or tokens).
- Words are lemmatized to their root forms.
Topic Extraction:
- After applying NMF, each topic is represented as a ranked list of words. These are the most significant words for each topic, determined by their weights in the topic-term matrix H.
Pairwise Word Similarity:
- For each topic, pairs of the top N words are created.
- A similarity measure, such as Pointwise Mutual Information (PMI), is calculated for each pair based on their co-occurrence in the original dataset.
Average Coherence:
- The coherence score for a topic is the average of the pairwise similarities of its words.
- The overall coherence score C across all topics is:
Where N is the number of topics, and Coherence(Topic(i)) is the coherence score of the i topic

Optimization of Number of Topics

To identify the optimal number of topics k, multiple values were tested. The best k was chosen based on:

Maximizing the coherence score. The CoherenceModel from gensim.models evaluates the quality of a topic model by measuring how coherent or semantically meaningful the topics are. It works by comparing the words within each topic and checking their co-occurrence patterns or similarity. It can use different coherence measures, such as C_v, C_p, U_mass, and NPMI, which vary in how they calculate word relationships (e.g., cosine similarity between word vectors or frequency of co-occurrence). The model takes the topic model, the corpus, and the dictionary as inputs and returns a coherence score, where higher values indicate better topic coherence.
Minimizing the reconstruction error.

Evaluation

The evaluation was conducted using:

Topic Coherence Score: This ensures that the extracted topics are interpretable and meaningful.
Reconstruction Error: This measures how well the factorized matrices W and H approximate the original matrix V. A lower reconstruction error indicates a better approximation.

How to Run the Notebook

Clone this repository:

git clone https://github.com/Topic-Modeling-with-NMFt
cd Topic-Modeling-with-NMF

Installing libraries:

  pip install pandas numpy nltk scikit-learn gensim seaborn matplotlib wordcloud

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
data		data
.DS_Store		.DS_Store
NMF_TOPIC_MODELING.ipynb		NMF_TOPIC_MODELING.ipynb
README.md		README.md
gdm.py		gdm.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Topic Modeling using Non-Negative Matrix Factorization (NMF)

Data Description

Example Columns:

Target Variable

Methodology

Topic Coherence

Optimization of Number of Topics

Evaluation

How to Run the Notebook

About

Releases

Packages

Contributors 2

Languages

yChaaby/Topic-Modeling-with-NMF

Folders and files

Latest commit

History

Repository files navigation

Topic Modeling using Non-Negative Matrix Factorization (NMF)

Data Description

Example Columns:

Target Variable

Methodology

Topic Coherence

Optimization of Number of Topics

Evaluation

How to Run the Notebook

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages