Analysing Patterns in an SMS Spam Dataset Using Data Mining Techniques

Project Overview

Course Code: MIT 816

Course Title: Data Mining

This assignment was submitted as part of my coursework requirements for the Master in Information Technology (MIT) program at the University of Lagos, Nigeria.

This project analyzes and predicts patterns in a SMS Spam Collection dataset using two data mining techniques; classification and clustering. The primary goal is to classify SMS messages as spam or ham (legitimate) using text processing, exploratory data analysis (EDA), and machine learning models.

Dataset

The dataset used is the SMS Spam Collection Dataset, which consists of 5,574 SMS messages labeled as either spam or ham.

Dataset Attributes:

v1: A categorical label indicating whether the message is spam or ham.
v2: The text content of the message.

Project Structure

├── plots/                   # Visualizations in PNG format (word clouds, histograms, Confusion matrix, etc.)
├── requirements.txt         # Python dependencies
├── README.md                # Project documentation
└── analysis.py              # Main script

Installation & Setup

Prerequisites

Python 3.x
Required libraries (install using pip):
```
pip3 install -r requirements.txt
```
Run the main script:
```
python3 analysis.py
```

Data Preprocessing

The dataset is clean and does not contain any missing values in the SMS text or labels (spam or ham). However, we need to preprocess the text data to handle inconsistencies:

Text normalization (converting text to lowercase)
Remove punctuation and special characters
Tokenizing the text into words.
Remove stop words (words that don't provide significant meaning, like "is", "the", etc.).
Perform stemming using PorterStemmer
Apply TF-IDF (Term Frequency-Inverse Document Frequency) transformation for feature extraction

Exploratory Data Analysis (EDA)

EDA helps in understanding the distribution of the data, identifying trends, and visualizing relationships between different features. In our case, we will focus on the distribution of the target labels (spam vs ham), the characteristics of the message text (e.g., message length), and the most frequently occurring words.

Label distribution (spam vs ham). We need to analyze how many messages are labeled as “spam” and how many as “ham” to check for any class imbalance.

Message length analysis. We also explore the length of messages by calculating the number of characters and words in each message. This is important because spam messages tend to be either very short (containing promotional phrases) or very long (to look legitimate).

Message Length Distribution (hist 1)	Message Length Distribution (hist 2)

Word cloud visualization for spam and ham messages. We create separate word clouds for “spam” and “ham” messages to understand the common vocabulary used in each category

Top words frequency analysis. We extract the most frequent words in both spam and ham messages after preprocessing and this provides insight into the content of spam messages versus legitimate ones

Pattern Discovery

We apply two data mining techniques to the SMS Spam Collection Dataset. Classification to predict whether a message is spam or ham and clustering to explore natural groupings in the dataset, even though we already know the labels.

Classification using Logistic Regression:

Using PyCaret, we set up the dataset, train multiple classification models, compare their performance, and select the best model for spam detection. Once the models are compared, PyCaret automatically ranks them based on performance metrics such as accuracy, precision, recall, and F1-score.

Clustering with K-Means:

We also perform K-Means Clustering to explore the possibility of identifying natural clusters in the data. While we already have labels (spam or ham), clustering allows us to examine if the messages group naturally into distinct categories.

Results & Insights

The best-performing model was a Logistic regression model and had an accuracy of 89%, precision of 90%, and recall of 89%.

The confusion matrix shows that the model is very precise (100%), meaning when it classifies a message as spam, it is never wrong but the recall is also very low (26.85%), meaning the model fails to detect a large portion of actual spam messages (high false negative rate).

The ROC AUC Curve also demonstrates the final model’s ability to distinguish between spam and ham. The AUC-ROC score was 0.91, indicating strong classification capability.

Clustering results confirmed that messages naturally group into two distinct categories by the spam/ham characteristics found in the dataset. Cluster 0 represents Ham and Cluster 1 represents spam.

Conclusion

Our results reveal that the model effectively classifies legitimate messages (high precision) but struggles with identifying spam (low recall). While the ROC curve shows strong overall performance (AUC of 0.91), the model's recall for spam detection needs improvement. The results of our K-means clustering also show that there is a distinction between spam and legitimate messages, and this is also visible in the word cloud with spam messages often containing promotional or urgent language like "text", "call" and "free". The logistic regression model performed quite well but will need to be further trained to be considered useful in real-world applications such as spam filters, etc.

Possible Improvements

Improve recall by using ensemble methods like Random Forest or XGBoost.
Use deep learning (LSTMs, Transformers) for better text representation.

Author

Efe Omoregie

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
plots		plots
.gitignore		.gitignore
README.md		README.md
analysis.py		analysis.py
requirements.txt		requirements.txt
spam.csv		spam.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analysing Patterns in an SMS Spam Dataset Using Data Mining Techniques

Project Overview

Dataset

Dataset Attributes:

Project Structure

Installation & Setup

Prerequisites

Data Preprocessing

Exploratory Data Analysis (EDA)

Pattern Discovery

Classification using Logistic Regression:

Clustering with K-Means:

Results & Insights

Conclusion

Possible Improvements

Author

About

Releases

Packages

Languages

marvelefe/sms-spam-data-mining

Folders and files

Latest commit

History

Repository files navigation

Analysing Patterns in an SMS Spam Dataset Using Data Mining Techniques

Project Overview

Dataset

Dataset Attributes:

Project Structure

Installation & Setup

Prerequisites

Data Preprocessing

Exploratory Data Analysis (EDA)

Pattern Discovery

Classification using Logistic Regression:

Clustering with K-Means:

Results & Insights

Conclusion

Possible Improvements

Author

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages