Statistical Learning and Probabilistic Classification

Engineering Probability and Statistics – University of Tehran – Department of Electrical & Computer Engineering

Overview

This repository contains Statistical Learning and Probabilistic Classification, a Python and Jupyter Notebook implementation of numerical computing exercises, Condorcet-style majority-vote analysis, and Naive Bayes spam-email classification. This project was developed as the Computer Assignment Zero for the Engineering Probability and Statistics course at the University of Tehran.

The project follows a complete experimental pipeline, including assignment inspection, deterministic NumPy validation, probability-based majority-vote modeling, dataset loading, text preprocessing, Bag-of-Words feature extraction, Naive Bayes training, evaluation, visualization, and final analysis.

Project Objectives

✅ Implement foundational NumPy operations for array construction, indexing, reshaping, vectorization, and normalization.
✅ Analyze binary majority voting using posterior probability, exact binomial computation, Monte Carlo simulation, and heatmap visualization.
✅ Build a leakage-free spam-email classification pipeline using train-only Bag-of-Words vectorization.
✅ Apply Laplace smoothing and log-likelihood prediction for a Multinomial Naive Bayes classifier.
✅ Compare the effect of stop-word removal on held-out test accuracy.

Methodology

1️⃣ NumPy Foundation Exercises

The NumPy section validates helper functions in src/numpy_basic.py. It covers scalar counting, array mutation, slice indexing, integer-array indexing, one-hot encoding, reshape-based layout transformation, batched matrix multiplication, row-wise minimum replacement, and sample-standard-deviation column normalization.

2️⃣ Majority-Vote Probability Analysis

The majority-vote notebook computes the posterior probability that the observed majority option is correct for the five assignment scenarios. It also compares exact majority-vote accuracy with Monte Carlo simulation for 12 voters and visualizes the effect of individual accuracy p and voter count n.

3️⃣ Spam-Email Preprocessing and Bag-of-Words Modeling

The spam-detection notebook loads data/emails.csv, normalizes email text with a deterministic regex tokenizer, performs a stratified train/test split, and fits CountVectorizer only on the training data to avoid test-set vocabulary leakage.

4️⃣ Naive Bayes Evaluation and Analysis

A custom Multinomial Naive Bayes implementation estimates P(y) and P(x_i | y) from the training matrix. The classifier uses Laplace smoothing and log-likelihood prediction, then cross-checks its accuracy against sklearn.naive_bayes.MultinomialNB.

Repository Structure

The project is organized as follows:

Statistical-Learning-and-Probabilistic-Classification/
├── data/                    # Assignment dataset for spam-email classification
│   └── emails.csv
├── docs/                    # Original assignment specification
│   └── assignment-specification.pdf
├── notebooks/               # Standardized executable Jupyter notebooks
│   ├── 01_numpy_basic.ipynb
│   ├── 02_majority_vote.ipynb
│   └── 03_spam_email_detection.ipynb
├── src/                     # Reusable Python helper functions
│   ├── __init__.py
│   └── numpy_basic.py
├── tests/                   # Pytest validation for NumPy helper functions
│   └── test_numpy_basic.py
├── .gitignore               # Git ignore rules
├── Makefile                 # Convenience commands for setup and testing
├── README.md                # Project documentation
└── requirements.txt         # Python dependencies

Setup & Usage

Create a virtual environment and install the required packages:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Launch the notebooks:

jupyter notebook notebooks

Run the NumPy helper-function tests:

make test

Results

The NumPy validation notebook passes all deterministic checks for the implemented helper functions.

For the majority-vote scenarios, the posterior probabilities of a correct majority decision are:

Scenario	Individual Accuracy `p`	Votes for `1`	Votes for `0`	`P(majority correct)`
1	0.7	8	4	0.9674
2	0.7	10	2	0.9989
3	0.3	8	4	0.0326
4	0.5	9	3	0.5000
5	0.5	5	7	0.5000

For 12 voters, exact 100% majority-vote accuracy occurs only at p = 1.0; at least 99% majority-vote accuracy starts at p = 0.9 on the assignment grid.

For spam-email detection, the final custom Naive Bayes classifier achieved 97.47% accuracy on the stratified held-out test set. The confusion matrix was:

[[846  26]
 [  3 271]]

Removing English stop words produced 97.29% accuracy, a slight decrease on this dataset under the same split and vocabulary size.

Notes

The original assignment PDF is retained in docs/. The dataset is included in data/ because it was part of the submitted assignment archive. No license file is included because the assignment specification and dataset provenance are course-specific.

Author

Meraj Rastegar

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Statistical Learning and Probabilistic Classification

Overview

Project Objectives

Methodology

1️⃣ NumPy Foundation Exercises

2️⃣ Majority-Vote Probability Analysis

3️⃣ Spam-Email Preprocessing and Bag-of-Words Modeling

4️⃣ Naive Bayes Evaluation and Analysis

Repository Structure

Setup & Usage

Results

Notes

Author

About

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
docs		docs
notebooks		notebooks
src		src
tests		tests
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Statistical Learning and Probabilistic Classification

Overview

Project Objectives

Methodology

1️⃣ NumPy Foundation Exercises

2️⃣ Majority-Vote Probability Analysis

3️⃣ Spam-Email Preprocessing and Bag-of-Words Modeling

4️⃣ Naive Bayes Evaluation and Analysis

Repository Structure

Setup & Usage

Results

Notes

Author

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages