Skip to content

mragetsars/Statistical-Learning-and-Probabilistic-Classification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Statistical Learning and Probabilistic Classification

Engineering Probability and Statistics – University of Tehran – Department of Electrical & Computer Engineering

Language Notebook Framework Status

Overview

This repository contains Statistical Learning and Probabilistic Classification, a Python and Jupyter Notebook implementation of numerical computing exercises, Condorcet-style majority-vote analysis, and Naive Bayes spam-email classification. This project was developed as the Computer Assignment Zero for the Engineering Probability and Statistics course at the University of Tehran.

The project follows a complete experimental pipeline, including assignment inspection, deterministic NumPy validation, probability-based majority-vote modeling, dataset loading, text preprocessing, Bag-of-Words feature extraction, Naive Bayes training, evaluation, visualization, and final analysis.

Project Objectives

  • ✅ Implement foundational NumPy operations for array construction, indexing, reshaping, vectorization, and normalization.
  • ✅ Analyze binary majority voting using posterior probability, exact binomial computation, Monte Carlo simulation, and heatmap visualization.
  • ✅ Build a leakage-free spam-email classification pipeline using train-only Bag-of-Words vectorization.
  • ✅ Apply Laplace smoothing and log-likelihood prediction for a Multinomial Naive Bayes classifier.
  • ✅ Compare the effect of stop-word removal on held-out test accuracy.

Methodology

1️⃣ NumPy Foundation Exercises

The NumPy section validates helper functions in src/numpy_basic.py. It covers scalar counting, array mutation, slice indexing, integer-array indexing, one-hot encoding, reshape-based layout transformation, batched matrix multiplication, row-wise minimum replacement, and sample-standard-deviation column normalization.

2️⃣ Majority-Vote Probability Analysis

The majority-vote notebook computes the posterior probability that the observed majority option is correct for the five assignment scenarios. It also compares exact majority-vote accuracy with Monte Carlo simulation for 12 voters and visualizes the effect of individual accuracy p and voter count n.

3️⃣ Spam-Email Preprocessing and Bag-of-Words Modeling

The spam-detection notebook loads data/emails.csv, normalizes email text with a deterministic regex tokenizer, performs a stratified train/test split, and fits CountVectorizer only on the training data to avoid test-set vocabulary leakage.

4️⃣ Naive Bayes Evaluation and Analysis

A custom Multinomial Naive Bayes implementation estimates P(y) and P(x_i | y) from the training matrix. The classifier uses Laplace smoothing and log-likelihood prediction, then cross-checks its accuracy against sklearn.naive_bayes.MultinomialNB.

Repository Structure

The project is organized as follows:

Statistical-Learning-and-Probabilistic-Classification/
├── data/                    # Assignment dataset for spam-email classification
│   └── emails.csv
├── docs/                    # Original assignment specification
│   └── assignment-specification.pdf
├── notebooks/               # Standardized executable Jupyter notebooks
│   ├── 01_numpy_basic.ipynb
│   ├── 02_majority_vote.ipynb
│   └── 03_spam_email_detection.ipynb
├── src/                     # Reusable Python helper functions
│   ├── __init__.py
│   └── numpy_basic.py
├── tests/                   # Pytest validation for NumPy helper functions
│   └── test_numpy_basic.py
├── .gitignore               # Git ignore rules
├── Makefile                 # Convenience commands for setup and testing
├── README.md                # Project documentation
└── requirements.txt         # Python dependencies

Setup & Usage

Create a virtual environment and install the required packages:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Launch the notebooks:

jupyter notebook notebooks

Run the NumPy helper-function tests:

make test

Results

The NumPy validation notebook passes all deterministic checks for the implemented helper functions.

For the majority-vote scenarios, the posterior probabilities of a correct majority decision are:

Scenario Individual Accuracy p Votes for 1 Votes for 0 P(majority correct)
1 0.7 8 4 0.9674
2 0.7 10 2 0.9989
3 0.3 8 4 0.0326
4 0.5 9 3 0.5000
5 0.5 5 7 0.5000

For 12 voters, exact 100% majority-vote accuracy occurs only at p = 1.0; at least 99% majority-vote accuracy starts at p = 0.9 on the assignment grid.

For spam-email detection, the final custom Naive Bayes classifier achieved 97.47% accuracy on the stratified held-out test set. The confusion matrix was:

[[846  26]
 [  3 271]]

Removing English stop words produced 97.29% accuracy, a slight decrease on this dataset under the same split and vocabulary size.

Notes

The original assignment PDF is retained in docs/. The dataset is included in data/ because it was part of the submitted assignment archive. No license file is included because the assignment specification and dataset provenance are course-specific.

Author

About

a Python and Jupyter Notebook implementation of NumPy exercises, Condorcet-style majority-vote analysis, and Naive Bayes spam-email classification. This project was developed as Computer Assignment Zero for the Engineering Probability and Statistics course at the University of Tehran.

Topics

Resources

Stars

Watchers

Forks

Contributors