Add files via upload

abhisheks008 · Feb 19, 2024 · 6f323e3 · 6f323e3
1 parent 7137bb5
commit 6f323e3
Show file tree

Hide file tree

Showing 13 changed files with 12,706 additions and 0 deletions.
diff --git a/Spam Vs Ham Mail Classification [With Streamlit GUI]/Dataset/newData.csv b/Spam Vs Ham Mail Classification [With Streamlit GUI]/Dataset/newData.csv
diff --git a/Spam Vs Ham Mail Classification [With Streamlit GUI]/Dataset/spam-vs-ham-dataset.csv b/Spam Vs Ham Mail Classification [With Streamlit GUI]/Dataset/spam-vs-ham-dataset.csv
diff --git a/Spam Vs Ham Mail Classification [With Streamlit GUI]/Image/PairPlot_withHue.png b/Spam Vs Ham Mail Classification [With Streamlit GUI]/Image/PairPlot_withHue.png
diff --git a/... Vs Ham Mail Classification [With Streamlit GUI]/Image/Spam-vs-ham-piechart.jpg b/... Vs Ham Mail Classification [With Streamlit GUI]/Image/Spam-vs-ham-piechart.jpg
diff --git a/Spam Vs Ham Mail Classification [With Streamlit GUI]/Image/spam-ham-num_chr.jpg b/Spam Vs Ham Mail Classification [With Streamlit GUI]/Image/spam-ham-num_chr.jpg
diff --git a/Spam Vs Ham Mail Classification [With Streamlit GUI]/Image/spam-ham-num_sent.jpg b/Spam Vs Ham Mail Classification [With Streamlit GUI]/Image/spam-ham-num_sent.jpg
diff --git a/Spam Vs Ham Mail Classification [With Streamlit GUI]/Image/spam-ham-num_word.jpg b/Spam Vs Ham Mail Classification [With Streamlit GUI]/Image/spam-ham-num_word.jpg
diff --git a/Spam Vs Ham Mail Classification [With Streamlit GUI]/Model/README.md b/Spam Vs Ham Mail Classification [With Streamlit GUI]/Model/README.md
@@ -0,0 +1,98 @@
+# SPAM vs HAM Email Classification<br>
+
+## 🎯 Goal<br>
+
+The main goal of this project is to develop a robust deep learning model for classifying emails as either spam (SPAM) or legitimate (HAM). Additionally, the project aims to create a Streamlit GUI for a user-friendly interface in real-time email classification.<br>
+
+## 🧵 Dataset<br>
+
+The dataset used for this project is available [here](https://www.kaggle.com/datasets/omokennanna/simple-spam-classification). It consists of two columns: 'text' containing the email content and 'label' indicating whether the email is spam or ham.<br>
+
+## 🧾 Description<br>
+
+This project utilizes a deep learning model with an embedding layer, bidirectional LSTM layers, and a dense output layer. The model is trained on the provided dataset to classify emails as spam or ham. A Streamlit GUI is implemented to enable users to perform real-time email classification.<br>
+
+## 🧮 What I had done!<br>
+
+### 1. Data Preparation<br>
+
+- The dataset is loaded and split into training and testing sets.<br>
+- Missing values are handled, and the text data is preprocessed.<br>
+
+### 2. Model Architecture<br>
+
+- The Machine learning model I used several algorithm for better accuracy.<br>
+
+- The deep learning model comprises an embedding layer, LSTM layers, and a dense output layer with a sigmoid activation function.<br>
+
+- The deep learning model comprises an embedding layer, bidirectional LSTM layers, and a dense output layer with a sigmoid activation function.<br>
+
+### 3. Training the Model<br>
+
+The model is trained on the preprocessed dataset using the Adam optimizer and binary crossentropy loss. The training process is monitored for convergence and effectiveness.<br>
+
+### 4. Streamlit GUI<br>
+
+A Streamlit GUI is implemented for real-time email classification. Users can input an email, and the model predicts whether it is spam or ham.<br>
+
+## 🚀 Models Implemented<br>
+
+1. Machine Learning Model
+2. Deep Learning Model with LSTM Layers
+3. Deep Learning Model with Bidirectional LSTM Layers
+
+**Why these models:**<br>
+
+1. **Machine Learning Model:**<br>
+
+    This traditional machine learning model serves as a baseline and allows us to compare the performance of deep learning models against a more conventional approach.<br>
+
+2. **Deep Learning Model with LSTM Layers:**<br>
+
+    LSTM layers are particularly effective for sequential data, making them suitable for capturing long-range dependencies and patterns within the input data.<br>
+
+3. **Deep Learning Model with Bidirectional LSTM Layers:**<br>
+
+    Bidirectional LSTM layers enhance the LSTM model by processing sequences in both forward and backward directions, allowing the model to capture information from past and future time steps simultaneously.<br>
+
+## 📚 Libraries Needed<br>
+
+1. TensorFlow
+2. scikit-learn
+3. pandas
+4. matplotlib
+5. seaborn
+6. streamlit
+
+## 📊 Exploratory Data Analysis Results<br>
+
+### Insight<br>
+
+In the Dataset, we have 88% Ham Data and 12% Spam Data. The distribution of classes is imbalanced, which creates a challenge in accurately classifying emails.<br>
+
+![Spam vs Ham dataset](./../Image/Spam-vs-ham-piechart.jpg)
+
+![Pairplot of Dataset](./../Image/PairPlot_withHue.png)
+
+
+
+| Model                              | Accuracy Score |
+| ---------------------------------- | -------------- |
+| Machine Learning Model (BernoulliNB)| 96%            |
+| Deep Learning Model (LSTM)          | 88.58%         |
+| Deep Learning Model (Bidirectional LSTM)| 98.56%       |
+
+## 📈 Performance of the Models based on the Accuracy Scores<br>
+
+1. Machine Learning Model (BernoulliNB)     : 96%    
+2. Deep Learning Model (LSTM)               : 88.58%         
+3. Deep Learning Model (Bidirectional LSTM) : 98.56%  
+
+## 📢 Conclusion<br>
+
+The SPAM vs HAM Email Classification project, coupled with the Streamlit GUI, provides an effective solution for real-time email categorization. The deep learning model demonstrates promising accuracy, and the user-friendly interface makes it accessible for practical use.<br>
+
+## ✒️ Your Signature<br>
+
+Dipayan Majumder<br>  
+[GitHub: dipayan22](https://github.com/dipayan22)
diff --git a/Spam Vs Ham Mail Classification [With Streamlit GUI]/Model/app1.py b/Spam Vs Ham Mail Classification [With Streamlit GUI]/Model/app1.py
@@ -0,0 +1,61 @@
+# This Streamlit App is for the Machine Learning model.
+
+
+import streamlit as st
+from nltk.stem import WordNetLemmatizer
+import pickle
+import nltk
+import string
+from nltk.corpus import stopwords
+
+
+lemmatizer = WordNetLemmatizer()
+
+# function of Data Preprocessing.
+def transform_text(text):
+    text = text.lower()
+    text = nltk.word_tokenize(text)
+
+    y = []
+    for i in text:
+        if i.isalnum():
+            y.append(i)
+
+    text = y[:]
+    y.clear()
+
+    for i in text:
+        if i not in stopwords.words('english') and i not in string.punctuation:
+            y.append(i)
+
+    text = y[:]
+    y.clear()
+
+    for i in text:
+        y.append(lemmatizer.lemmatize(i))
+
+
+    return " ".join(y)
+
+
+# Store the model in your file
+# here we can store the tfidf and model pkl file in a specfic folder and use it.
+tfidf=pickle.load(open('vectorizer.pkl','rb'))
+model=pickle.load(open('bnb.pkl','rb'))
+
+st.title('SMS Spam Classification')
+
+sms_input=st.text_area("Enter the text")
+
+if st.button('Predict'):
+    transform_sms=transform_text(sms_input)
+
+    vector_input=tfidf.transform([transform_sms])
+
+    result=model.predict(vector_input)[0]
+
+    if result==1:
+        st.title("SMS is Spam")
+
+    else:
+        st.title("SMS is not Spam")
diff --git a/Spam Vs Ham Mail Classification [With Streamlit GUI]/Model/app2.py b/Spam Vs Ham Mail Classification [With Streamlit GUI]/Model/app2.py
@@ -0,0 +1,40 @@
+# This Streamlit GUI is used for the Deep Learning Model.
+
+import streamlit as st
+import tensorflow as tf
+from tensorflow.keras.preprocessing.sequence import pad_sequences
+import numpy as np
+
+# # Load the trained model
+model2 = tf.keras.models.load_model('path/to/your/trained/model')
+
+# Load the tokenizer
+tokenizer = tf.keras.preprocessing.text.Tokenizer()
+tokenizer.fit_on_texts(['your', 'list', 'of', 'common', 'words'])
+
+# Define the maximum sequence length (adjust based on your model)
+max_length = 100
+
+# Streamlit App
+def main():
+    st.title("SPAM vs HAM Email Classification")
+
+    # User input
+    user_input = st.text_area("Enter the email text:")
+
+    if st.button("Predict"):
+        # Tokenize and pad the input text
+        input_sequence = tokenizer.texts_to_sequences([user_input])
+        padded_input = pad_sequences(input_sequence, maxlen=max_length, padding='post', truncating='post')
+
+        # Make the prediction
+        prediction = model2.predict(padded_input)
+
+        # Display the result
+        if prediction[0][0] > 0.5:
+            st.success("Prediction: HAM (Legitimate Email)")
+        else:
+            st.error("Prediction: SPAM")
+
+if __name__ == '__main__':
+    main()