Developed a binary classification model to determine spam or non-spam email, analyzed the dataset, and visualized the top high-frequency spam and non-spam words with the WordCloud Python library. Measured the performance of classification algorithms with evaluation metrics such as accuracy, precision, confusion matrix, and f1-score. Also, performance was improved by using a voting classifier.
- NumPy
- Pandas
- matplotlib
- Seaborn
- nltk
- WordCloud
- Scikit-learn
- Naive Bayes (Gaussian, Bernoulli, Multinomial)
- SVM
- Decision Tree
- Random Forest
- XGBoost
- Spam and Ham e-mail ratio in dataset
- Number of Characters, Words, Sentences (Spam and Non Spam e-mail)
- Correlation Matrix Graphical Representation
- Spam Wordcloud Representation
- Top 50 Spam e-mail words
- Top 50 Ham e-mail words