Engineering Probability and Statistics – University of Tehran – Department of Electrical & Computer Engineering
This repository contains Statistical Learning and Probabilistic Classification, a Python and Jupyter Notebook implementation of numerical computing exercises, Condorcet-style majority-vote analysis, and Naive Bayes spam-email classification. This project was developed as the Computer Assignment Zero for the Engineering Probability and Statistics course at the University of Tehran.
The project follows a complete experimental pipeline, including assignment inspection, deterministic NumPy validation, probability-based majority-vote modeling, dataset loading, text preprocessing, Bag-of-Words feature extraction, Naive Bayes training, evaluation, visualization, and final analysis.
- ✅ Implement foundational NumPy operations for array construction, indexing, reshaping, vectorization, and normalization.
- ✅ Analyze binary majority voting using posterior probability, exact binomial computation, Monte Carlo simulation, and heatmap visualization.
- ✅ Build a leakage-free spam-email classification pipeline using train-only Bag-of-Words vectorization.
- ✅ Apply Laplace smoothing and log-likelihood prediction for a Multinomial Naive Bayes classifier.
- ✅ Compare the effect of stop-word removal on held-out test accuracy.
The NumPy section validates helper functions in src/numpy_basic.py. It covers scalar counting, array mutation, slice indexing, integer-array indexing, one-hot encoding, reshape-based layout transformation, batched matrix multiplication, row-wise minimum replacement, and sample-standard-deviation column normalization.
The majority-vote notebook computes the posterior probability that the observed majority option is correct for the five assignment scenarios. It also compares exact majority-vote accuracy with Monte Carlo simulation for 12 voters and visualizes the effect of individual accuracy p and voter count n.
The spam-detection notebook loads data/emails.csv, normalizes email text with a deterministic regex tokenizer, performs a stratified train/test split, and fits CountVectorizer only on the training data to avoid test-set vocabulary leakage.
A custom Multinomial Naive Bayes implementation estimates P(y) and P(x_i | y) from the training matrix. The classifier uses Laplace smoothing and log-likelihood prediction, then cross-checks its accuracy against sklearn.naive_bayes.MultinomialNB.
The project is organized as follows:
Statistical-Learning-and-Probabilistic-Classification/
├── data/ # Assignment dataset for spam-email classification
│ └── emails.csv
├── docs/ # Original assignment specification
│ └── assignment-specification.pdf
├── notebooks/ # Standardized executable Jupyter notebooks
│ ├── 01_numpy_basic.ipynb
│ ├── 02_majority_vote.ipynb
│ └── 03_spam_email_detection.ipynb
├── src/ # Reusable Python helper functions
│ ├── __init__.py
│ └── numpy_basic.py
├── tests/ # Pytest validation for NumPy helper functions
│ └── test_numpy_basic.py
├── .gitignore # Git ignore rules
├── Makefile # Convenience commands for setup and testing
├── README.md # Project documentation
└── requirements.txt # Python dependencies
Create a virtual environment and install the required packages:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtLaunch the notebooks:
jupyter notebook notebooksRun the NumPy helper-function tests:
make testThe NumPy validation notebook passes all deterministic checks for the implemented helper functions.
For the majority-vote scenarios, the posterior probabilities of a correct majority decision are:
| Scenario | Individual Accuracy p |
Votes for 1 |
Votes for 0 |
P(majority correct) |
|---|---|---|---|---|
| 1 | 0.7 | 8 | 4 | 0.9674 |
| 2 | 0.7 | 10 | 2 | 0.9989 |
| 3 | 0.3 | 8 | 4 | 0.0326 |
| 4 | 0.5 | 9 | 3 | 0.5000 |
| 5 | 0.5 | 5 | 7 | 0.5000 |
For 12 voters, exact 100% majority-vote accuracy occurs only at p = 1.0; at least 99% majority-vote accuracy starts at p = 0.9 on the assignment grid.
For spam-email detection, the final custom Naive Bayes classifier achieved 97.47% accuracy on the stratified held-out test set. The confusion matrix was:
[[846 26]
[ 3 271]]
Removing English stop words produced 97.29% accuracy, a slight decrease on this dataset under the same split and vocabulary size.
The original assignment PDF is retained in docs/. The dataset is included in data/ because it was part of the submitted assignment archive. No license file is included because the assignment specification and dataset provenance are course-specific.