This project demonstrates the application of Non-Negative Matrix Factorization (NMF) for topic modeling on a dataset of abstracts. Below, you will find a detailed explanation of the dataset, target variable, mathematical background of NMF, and evaluation methodology.
"""
## 📂 Project Structure
├── data/
│ └── NLP_Topic_modeling_Data.csv # abstracts with 31 discipline labels
├── NMF_TOPIC_MODELING.ipynb # Full analysis pipeline
└── README.md # Documentation
"""
The dataset consists of research abstracts across various scientific disciplines. Each entry contains:
- Abstract: The main text data used for topic modeling.
- Fields of Study: Binary columns indicating the fields of research associated with each abstract (e.g., Physics, Mathematics, Computer Science, etc.).
id
: Unique identifier for each abstract.ABSTRACT
: The text of the research abstract.Physics
,Mathematics
,Statistics
, etc.: Binary indicators of relevance to specific fields.
The target variable for this project is the ABSTRACT
column, which contains the data used for topic extraction and modeling.
We utilized Non-negative Matrix Factorization (NMF) for topic modeling. NMF is a dimensionality reduction technique that factorizes a non-negative matrix V
into two non-negative matrices W
and H
, such that:
V
: Document-term matrix.W
: Document-topic matrix.H
: Topic-term matrix.
The optimization problem solved by NMF is:
Where F^2
represents the Frobenius norm.
The coherence score measures the semantic similarity between words in a topic. A higher coherence score indicates more interpretable topics. Here’s how it is calculated:
-
Preprocessing:
- The text data is cleaned by removing stop words, punctuation, and irrelevant tokens.
- The text is tokenized into individual words (or tokens).
- Words are lemmatized to their root forms.
-
Topic Extraction:
- After applying NMF, each topic is represented as a ranked list of words. These are the most significant words for each topic, determined by their weights in the topic-term matrix
H
.
- After applying NMF, each topic is represented as a ranked list of words. These are the most significant words for each topic, determined by their weights in the topic-term matrix
-
Pairwise Word Similarity:
- For each topic, pairs of the top
N
words are created. - A similarity measure, such as Pointwise Mutual Information (PMI), is calculated for each pair based on their co-occurrence in the original dataset.
- For each topic, pairs of the top
-
Average Coherence:
- The coherence score for a topic is the average of the pairwise similarities of its words.
- The overall coherence score
C
across all topics is:
Where
N
is the number of topics, andCoherence(Topic(i))
is the coherence score of thei
topic
To identify the optimal number of topics k
, multiple values were tested. The best k
was chosen based on:
- Maximizing the coherence score. The CoherenceModel from gensim.models evaluates the quality of a topic model by measuring how coherent or semantically meaningful the topics are. It works by comparing the words within each topic and checking their co-occurrence patterns or similarity. It can use different coherence measures, such as C_v, C_p, U_mass, and NPMI, which vary in how they calculate word relationships (e.g., cosine similarity between word vectors or frequency of co-occurrence). The model takes the topic model, the corpus, and the dictionary as inputs and returns a coherence score, where higher values indicate better topic coherence.
- Minimizing the reconstruction error.
The evaluation was conducted using:
- Topic Coherence Score: This ensures that the extracted topics are interpretable and meaningful.
- Reconstruction Error: This measures how well the factorized matrices
W
andH
approximate the original matrixV
. A lower reconstruction error indicates a better approximation.
- Clone this repository:
git clone https://github.com/Topic-Modeling-with-NMFt cd Topic-Modeling-with-NMF
- Installing libraries:
pip install pandas numpy nltk scikit-learn gensim seaborn matplotlib wordcloud