Skip to content

Commit

Permalink
Add files via upload
Browse files Browse the repository at this point in the history
  • Loading branch information
dipayan22 authored Feb 19, 2024
1 parent 7137bb5 commit 6f323e3
Show file tree
Hide file tree
Showing 13 changed files with 12,706 additions and 0 deletions.
5,172 changes: 5,172 additions & 0 deletions Spam Vs Ham Mail Classification [With Streamlit GUI]/Dataset/newData.csv

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
# SPAM vs HAM Email Classification<br>

## 🎯 Goal<br>

The main goal of this project is to develop a robust deep learning model for classifying emails as either spam (SPAM) or legitimate (HAM). Additionally, the project aims to create a Streamlit GUI for a user-friendly interface in real-time email classification.<br>

## 🧵 Dataset<br>

The dataset used for this project is available [here](https://www.kaggle.com/datasets/omokennanna/simple-spam-classification). It consists of two columns: 'text' containing the email content and 'label' indicating whether the email is spam or ham.<br>

## 🧾 Description<br>

This project utilizes a deep learning model with an embedding layer, bidirectional LSTM layers, and a dense output layer. The model is trained on the provided dataset to classify emails as spam or ham. A Streamlit GUI is implemented to enable users to perform real-time email classification.<br>

## 🧮 What I had done!<br>

### 1. Data Preparation<br>

- The dataset is loaded and split into training and testing sets.<br>
- Missing values are handled, and the text data is preprocessed.<br>

### 2. Model Architecture<br>

- The Machine learning model I used several algorithm for better accuracy.<br>

- The deep learning model comprises an embedding layer, LSTM layers, and a dense output layer with a sigmoid activation function.<br>

- The deep learning model comprises an embedding layer, bidirectional LSTM layers, and a dense output layer with a sigmoid activation function.<br>

### 3. Training the Model<br>

The model is trained on the preprocessed dataset using the Adam optimizer and binary crossentropy loss. The training process is monitored for convergence and effectiveness.<br>

### 4. Streamlit GUI<br>

A Streamlit GUI is implemented for real-time email classification. Users can input an email, and the model predicts whether it is spam or ham.<br>

## 🚀 Models Implemented<br>

1. Machine Learning Model
2. Deep Learning Model with LSTM Layers
3. Deep Learning Model with Bidirectional LSTM Layers

**Why these models:**<br>

1. **Machine Learning Model:**<br>

This traditional machine learning model serves as a baseline and allows us to compare the performance of deep learning models against a more conventional approach.<br>

2. **Deep Learning Model with LSTM Layers:**<br>

LSTM layers are particularly effective for sequential data, making them suitable for capturing long-range dependencies and patterns within the input data.<br>

3. **Deep Learning Model with Bidirectional LSTM Layers:**<br>

Bidirectional LSTM layers enhance the LSTM model by processing sequences in both forward and backward directions, allowing the model to capture information from past and future time steps simultaneously.<br>

## 📚 Libraries Needed<br>

1. TensorFlow
2. scikit-learn
3. pandas
4. matplotlib
5. seaborn
6. streamlit

## 📊 Exploratory Data Analysis Results<br>

### Insight<br>

In the Dataset, we have 88% Ham Data and 12% Spam Data. The distribution of classes is imbalanced, which creates a challenge in accurately classifying emails.<br>

![Spam vs Ham dataset](./../Image/Spam-vs-ham-piechart.jpg)

![Pairplot of Dataset](./../Image/PairPlot_withHue.png)



| Model | Accuracy Score |
| ---------------------------------- | -------------- |
| Machine Learning Model (BernoulliNB)| 96% |
| Deep Learning Model (LSTM) | 88.58% |
| Deep Learning Model (Bidirectional LSTM)| 98.56% |

## 📈 Performance of the Models based on the Accuracy Scores<br>

1. Machine Learning Model (BernoulliNB) : 96%
2. Deep Learning Model (LSTM) : 88.58%
3. Deep Learning Model (Bidirectional LSTM) : 98.56%

## 📢 Conclusion<br>

The SPAM vs HAM Email Classification project, coupled with the Streamlit GUI, provides an effective solution for real-time email categorization. The deep learning model demonstrates promising accuracy, and the user-friendly interface makes it accessible for practical use.<br>

## ✒️ Your Signature<br>

Dipayan Majumder<br>
[GitHub: dipayan22](https://github.com/dipayan22)
61 changes: 61 additions & 0 deletions Spam Vs Ham Mail Classification [With Streamlit GUI]/Model/app1.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
# This Streamlit App is for the Machine Learning model.


import streamlit as st
from nltk.stem import WordNetLemmatizer
import pickle
import nltk
import string
from nltk.corpus import stopwords


lemmatizer = WordNetLemmatizer()

# function of Data Preprocessing.
def transform_text(text):
text = text.lower()
text = nltk.word_tokenize(text)

y = []
for i in text:
if i.isalnum():
y.append(i)

text = y[:]
y.clear()

for i in text:
if i not in stopwords.words('english') and i not in string.punctuation:
y.append(i)

text = y[:]
y.clear()

for i in text:
y.append(lemmatizer.lemmatize(i))


return " ".join(y)


# Store the model in your file
# here we can store the tfidf and model pkl file in a specfic folder and use it.
tfidf=pickle.load(open('vectorizer.pkl','rb'))
model=pickle.load(open('bnb.pkl','rb'))

st.title('SMS Spam Classification')

sms_input=st.text_area("Enter the text")

if st.button('Predict'):
transform_sms=transform_text(sms_input)

vector_input=tfidf.transform([transform_sms])

result=model.predict(vector_input)[0]

if result==1:
st.title("SMS is Spam")

else:
st.title("SMS is not Spam")
40 changes: 40 additions & 0 deletions Spam Vs Ham Mail Classification [With Streamlit GUI]/Model/app2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
# This Streamlit GUI is used for the Deep Learning Model.

import streamlit as st
import tensorflow as tf
from tensorflow.keras.preprocessing.sequence import pad_sequences
import numpy as np

# # Load the trained model
model2 = tf.keras.models.load_model('path/to/your/trained/model')

# Load the tokenizer
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(['your', 'list', 'of', 'common', 'words'])

# Define the maximum sequence length (adjust based on your model)
max_length = 100

# Streamlit App
def main():
st.title("SPAM vs HAM Email Classification")

# User input
user_input = st.text_area("Enter the email text:")

if st.button("Predict"):
# Tokenize and pad the input text
input_sequence = tokenizer.texts_to_sequences([user_input])
padded_input = pad_sequences(input_sequence, maxlen=max_length, padding='post', truncating='post')

# Make the prediction
prediction = model2.predict(padded_input)

# Display the result
if prediction[0][0] > 0.5:
st.success("Prediction: HAM (Legitimate Email)")
else:
st.error("Prediction: SPAM")

if __name__ == '__main__':
main()
Loading

0 comments on commit 6f323e3

Please sign in to comment.