Spam Email Classification

Introduction

In today's digital world, spam emails are a constant nuisance. They clog inboxes, waste time, and can even pose security threats containing phishing attempts or malware. This document details the development of a machine learning model designed to tackle this problem by automatically classifying incoming emails as spam or legitimate (ham).

Data Cleaning

Data Description Our data set is 5 columns. The first column is the index of the email, the second column is “# sent email” which describes the number of times this email has been sent, the third column “label” describes the email as ham and spam, the fourth column is “text” describes email content, the final column describe the third column as 0’s and 1’s.
Data Cleaning First step we check null values and duplication which is 0 and renaming “label_num” to be “IS_Spam” to be more readable because this is the target column.

Data Exploring

Histogram

Pie Chart

Word cloud for Spam emails

Word cloud for ham emails

Top 30 words in Spam Emails

Top 30 words in ham emails

Data Modeling

Using TF IDF in the “text” column we generate their models

Naïve Bayes

Accuracy of training = 0.9400386

Precision of training = 0.94599

Recall of training = 0.84245

Accuracy of testing = 0.9333

Precision of testing = 0.95599

Recall of testing = 0.80245

Confusion Matrix:

Logistic Regression

Accuracy of training = 0.9700

Precision of training = 0.9307

Recall of training = 0.96932

Accuracy of testing = 0.96908

Precision of testing = 0.93939

Recall of testing = 0.9522

Confusion Matrix:

SVM

Accuracy of training = 0.97340

Precision of training = 0.92349

Recall of training = 0.9908

Accuracy of testing = 0.97294

Precision of testing = 0.91536

Recall of testing = 0.9965

Confusion Matrix:

Decision Tree

Accuracy of training = 0.99854

Precision of training = 0.9950

Recall of training = 1.0

Accuracy of testing = 0.956

Precision of testing = 0.9078

Recall of testing = 0.9419

Confusion Matrix:

KNN

Accuracy of training = 0.9514

Precision of training = 0.9122

Recall of training = 0.92205

Accuracy of testing = 0.9062

Precision of testing = 0.8379

Recall of testing = 0.8293

Confusion Matrix:

Contributors

Zeina Wady

Sara Darwish

Ruba AbdELSalam

Basmala Ayman

Sara Habib

Bassant Ahmed

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
Images		Images
README.md		README.md
Spam_classification_final_project		Spam_classification_final_project
Spam_classification_final_project.pkl		Spam_classification_final_project.pkl
app.py		app.py
spam_final.ipynb		spam_final.ipynb
spam_ham_dataset.csv		spam_ham_dataset.csv
vectorizer.pkl		vectorizer.pkl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spam Email Classification

Introduction

Data Cleaning

Data Exploring

Data Modeling

Contributors

About

Releases

Packages

Languages

sara-saye/Spam_Email_Classification

Folders and files

Latest commit

History

Repository files navigation

Spam Email Classification

Introduction

Data Cleaning

Data Exploring

Data Modeling

Contributors

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages