Dataset consists of news articles pertaining to business,political,entertainment, sports,technology are mentioned in JSON format. There is no segregation of articles in the dataset. All are mixed. We have to segregate the articles and validate which article pertains to which topic(business,political,entertainment, sports,technology ).
- To tackle this problem I have implemented the K-means clustering
- Firstly, I applied the K-means clustering without dimensionality reduction
- Secondly, I applied the K-means clustering with dimensionality reduction (Using PCA), which improves performance of model by feature extraction.
- In addition to that I found out
a) which cluster has max. articles(before & after implementing PCA) b) top 50 words in entertainment cluster and printing last 50th word (before & after implementing PCA)
Python 3+, jupyter notebbook, Pandas, Numpy, Sklearn, K-means, PCA
The purpose of this project is to gain insights, develop competency & help others in
a) Pratical implementation of K-Means clustering using python code
b) Clustering of News articles with the help of K-Means alogarithm for finding the suitable & related topics clusters
c) Dimensionality reduction using PCA.