-
-
Notifications
You must be signed in to change notification settings - Fork 368
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
13 changed files
with
12,706 additions
and
0 deletions.
There are no files selected for viewing
5,172 changes: 5,172 additions & 0 deletions
5,172
Spam Vs Ham Mail Classification [With Streamlit GUI]/Dataset/newData.csv
Large diffs are not rendered by default.
Oops, something went wrong.
5,575 changes: 5,575 additions & 0 deletions
5,575
Spam Vs Ham Mail Classification [With Streamlit GUI]/Dataset/spam-vs-ham-dataset.csv
Large diffs are not rendered by default.
Oops, something went wrong.
Binary file added
BIN
+424 KB
Spam Vs Ham Mail Classification [With Streamlit GUI]/Image/PairPlot_withHue.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+10.9 KB
... Vs Ham Mail Classification [With Streamlit GUI]/Image/Spam-vs-ham-piechart.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+29.7 KB
Spam Vs Ham Mail Classification [With Streamlit GUI]/Image/spam-ham-num_chr.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+25.8 KB
Spam Vs Ham Mail Classification [With Streamlit GUI]/Image/spam-ham-num_sent.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+30.3 KB
Spam Vs Ham Mail Classification [With Streamlit GUI]/Image/spam-ham-num_word.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
98 changes: 98 additions & 0 deletions
98
Spam Vs Ham Mail Classification [With Streamlit GUI]/Model/README.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,98 @@ | ||
# SPAM vs HAM Email Classification<br> | ||
|
||
## 🎯 Goal<br> | ||
|
||
The main goal of this project is to develop a robust deep learning model for classifying emails as either spam (SPAM) or legitimate (HAM). Additionally, the project aims to create a Streamlit GUI for a user-friendly interface in real-time email classification.<br> | ||
|
||
## 🧵 Dataset<br> | ||
|
||
The dataset used for this project is available [here](https://www.kaggle.com/datasets/omokennanna/simple-spam-classification). It consists of two columns: 'text' containing the email content and 'label' indicating whether the email is spam or ham.<br> | ||
|
||
## 🧾 Description<br> | ||
|
||
This project utilizes a deep learning model with an embedding layer, bidirectional LSTM layers, and a dense output layer. The model is trained on the provided dataset to classify emails as spam or ham. A Streamlit GUI is implemented to enable users to perform real-time email classification.<br> | ||
|
||
## 🧮 What I had done!<br> | ||
|
||
### 1. Data Preparation<br> | ||
|
||
- The dataset is loaded and split into training and testing sets.<br> | ||
- Missing values are handled, and the text data is preprocessed.<br> | ||
|
||
### 2. Model Architecture<br> | ||
|
||
- The Machine learning model I used several algorithm for better accuracy.<br> | ||
|
||
- The deep learning model comprises an embedding layer, LSTM layers, and a dense output layer with a sigmoid activation function.<br> | ||
|
||
- The deep learning model comprises an embedding layer, bidirectional LSTM layers, and a dense output layer with a sigmoid activation function.<br> | ||
|
||
### 3. Training the Model<br> | ||
|
||
The model is trained on the preprocessed dataset using the Adam optimizer and binary crossentropy loss. The training process is monitored for convergence and effectiveness.<br> | ||
|
||
### 4. Streamlit GUI<br> | ||
|
||
A Streamlit GUI is implemented for real-time email classification. Users can input an email, and the model predicts whether it is spam or ham.<br> | ||
|
||
## 🚀 Models Implemented<br> | ||
|
||
1. Machine Learning Model | ||
2. Deep Learning Model with LSTM Layers | ||
3. Deep Learning Model with Bidirectional LSTM Layers | ||
|
||
**Why these models:**<br> | ||
|
||
1. **Machine Learning Model:**<br> | ||
|
||
This traditional machine learning model serves as a baseline and allows us to compare the performance of deep learning models against a more conventional approach.<br> | ||
|
||
2. **Deep Learning Model with LSTM Layers:**<br> | ||
|
||
LSTM layers are particularly effective for sequential data, making them suitable for capturing long-range dependencies and patterns within the input data.<br> | ||
|
||
3. **Deep Learning Model with Bidirectional LSTM Layers:**<br> | ||
|
||
Bidirectional LSTM layers enhance the LSTM model by processing sequences in both forward and backward directions, allowing the model to capture information from past and future time steps simultaneously.<br> | ||
|
||
## 📚 Libraries Needed<br> | ||
|
||
1. TensorFlow | ||
2. scikit-learn | ||
3. pandas | ||
4. matplotlib | ||
5. seaborn | ||
6. streamlit | ||
|
||
## 📊 Exploratory Data Analysis Results<br> | ||
|
||
### Insight<br> | ||
|
||
In the Dataset, we have 88% Ham Data and 12% Spam Data. The distribution of classes is imbalanced, which creates a challenge in accurately classifying emails.<br> | ||
|
||
 | ||
|
||
 | ||
|
||
|
||
|
||
| Model | Accuracy Score | | ||
| ---------------------------------- | -------------- | | ||
| Machine Learning Model (BernoulliNB)| 96% | | ||
| Deep Learning Model (LSTM) | 88.58% | | ||
| Deep Learning Model (Bidirectional LSTM)| 98.56% | | ||
|
||
## 📈 Performance of the Models based on the Accuracy Scores<br> | ||
|
||
1. Machine Learning Model (BernoulliNB) : 96% | ||
2. Deep Learning Model (LSTM) : 88.58% | ||
3. Deep Learning Model (Bidirectional LSTM) : 98.56% | ||
|
||
## 📢 Conclusion<br> | ||
|
||
The SPAM vs HAM Email Classification project, coupled with the Streamlit GUI, provides an effective solution for real-time email categorization. The deep learning model demonstrates promising accuracy, and the user-friendly interface makes it accessible for practical use.<br> | ||
|
||
## ✒️ Your Signature<br> | ||
|
||
Dipayan Majumder<br> | ||
[GitHub: dipayan22](https://github.com/dipayan22) |
61 changes: 61 additions & 0 deletions
61
Spam Vs Ham Mail Classification [With Streamlit GUI]/Model/app1.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,61 @@ | ||
# This Streamlit App is for the Machine Learning model. | ||
|
||
|
||
import streamlit as st | ||
from nltk.stem import WordNetLemmatizer | ||
import pickle | ||
import nltk | ||
import string | ||
from nltk.corpus import stopwords | ||
|
||
|
||
lemmatizer = WordNetLemmatizer() | ||
|
||
# function of Data Preprocessing. | ||
def transform_text(text): | ||
text = text.lower() | ||
text = nltk.word_tokenize(text) | ||
|
||
y = [] | ||
for i in text: | ||
if i.isalnum(): | ||
y.append(i) | ||
|
||
text = y[:] | ||
y.clear() | ||
|
||
for i in text: | ||
if i not in stopwords.words('english') and i not in string.punctuation: | ||
y.append(i) | ||
|
||
text = y[:] | ||
y.clear() | ||
|
||
for i in text: | ||
y.append(lemmatizer.lemmatize(i)) | ||
|
||
|
||
return " ".join(y) | ||
|
||
|
||
# Store the model in your file | ||
# here we can store the tfidf and model pkl file in a specfic folder and use it. | ||
tfidf=pickle.load(open('vectorizer.pkl','rb')) | ||
model=pickle.load(open('bnb.pkl','rb')) | ||
|
||
st.title('SMS Spam Classification') | ||
|
||
sms_input=st.text_area("Enter the text") | ||
|
||
if st.button('Predict'): | ||
transform_sms=transform_text(sms_input) | ||
|
||
vector_input=tfidf.transform([transform_sms]) | ||
|
||
result=model.predict(vector_input)[0] | ||
|
||
if result==1: | ||
st.title("SMS is Spam") | ||
|
||
else: | ||
st.title("SMS is not Spam") |
40 changes: 40 additions & 0 deletions
40
Spam Vs Ham Mail Classification [With Streamlit GUI]/Model/app2.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,40 @@ | ||
# This Streamlit GUI is used for the Deep Learning Model. | ||
|
||
import streamlit as st | ||
import tensorflow as tf | ||
from tensorflow.keras.preprocessing.sequence import pad_sequences | ||
import numpy as np | ||
|
||
# # Load the trained model | ||
model2 = tf.keras.models.load_model('path/to/your/trained/model') | ||
|
||
# Load the tokenizer | ||
tokenizer = tf.keras.preprocessing.text.Tokenizer() | ||
tokenizer.fit_on_texts(['your', 'list', 'of', 'common', 'words']) | ||
|
||
# Define the maximum sequence length (adjust based on your model) | ||
max_length = 100 | ||
|
||
# Streamlit App | ||
def main(): | ||
st.title("SPAM vs HAM Email Classification") | ||
|
||
# User input | ||
user_input = st.text_area("Enter the email text:") | ||
|
||
if st.button("Predict"): | ||
# Tokenize and pad the input text | ||
input_sequence = tokenizer.texts_to_sequences([user_input]) | ||
padded_input = pad_sequences(input_sequence, maxlen=max_length, padding='post', truncating='post') | ||
|
||
# Make the prediction | ||
prediction = model2.predict(padded_input) | ||
|
||
# Display the result | ||
if prediction[0][0] > 0.5: | ||
st.success("Prediction: HAM (Legitimate Email)") | ||
else: | ||
st.error("Prediction: SPAM") | ||
|
||
if __name__ == '__main__': | ||
main() |
Oops, something went wrong.