Permutation Decision Trees and Forests

This repository contains the implementation and performance evaluation of various tree-based machine learning models, including Decision Trees (DT), Random Forests (RF), and Permutation Decision Forests (PDF). Additionally, we used a novel impurity measure, Effort-To-Compress (ETC), and empirically evaluate its performance compared to traditional impurity measures like Shannon Entropy and Gini Index.

📂 Project Structure

├── load.py                                            # Script for loading all datasets
├── classical_decision_tree_using_entropy.ipynb        # DT using Shannon Entropy
├── classical_decision_tree_using_gini.ipynb           # DT using Gini Index
├── random_forest_entropy.ipynb                        # RF using Shannon Entropy
├── random_forest_gini.ipynb                           # RF using Gini Index
├── permutated_decision_tree_and_forest.ipynb          # PDF and PDT Implemenataion using PDT
├── barplots.ipynb                                     # Script for generating barplots for results
└── README.md                                          # Project documentation

📊 Datasets Used

The following datasets were analyzed:

Appendicitis
Breast Cancer Wisconsin
Diabetes Pima Indian
Ionosphere
Iris
Sonar
Wine

Each dataset was preprocessed and split into training and testing sets using a random seed (42) and time series split (validation size = 15).

⚙️ Hyperparameter Tuning

The following parameters were optimized:

Decision Trees (DT)

Depth: 2 to 20
Impurity Measures: Shannon Entropy and Gini Index

Random Forests (RF)

n_estimators: [10, 50, 100, 150, 1000]
Depth: 2 to 20
Impurity Measures: Shannon Entropy and Gini Index

Permutation Decision Tree (PDT) and Permutation Decision Forest (PDF)

Permutation of Data:
- 21 unique seed values for datasets: Ionosphere, Iris, Wine, Diabetes, Appendicitis
- 100 and 200 unique seed values for datasets: Breast Cancer Wisconsin and Sonar
Depth Tuning:
- PDT depths range from 2 to 20.
- For each permutation, the ideal depth that produces the highest macro F1-score is determined using time series split (five splits, validation size = 15).
Testing Phase:
- PDT models are retrained on the shuffled data with optimal depth and seed.
- For PDF, the test data is passed through multiple PDTs, and a voting scheme is employed to predict the final label.

🧪 Results

The models were evaluated based on the following metrics:

Accuracy
Macro F1-Score
Precision
Recall

Key Findings:

Permutation Decision Tree (PDT) outperformed classical Decision Trees on average.
Permutation Decision Forest (PDF) achieved performance close to Random Forests with significantly fewer estimators:
- PDF: 21–200 estimators
- RF: 50–1000 estimators
ETC impurity measure behaved comparably to Shannon Entropy in empirical evaluations.

📈 Results Visualization

Results are visualized using barplots to compare Macro F1-Scores across all models and datasets. Execute barplots.ipynb to generate the plots.

🚀 How to Run the Code

Prerequisites

Ensure you have the required libraries installed:

pip install numpy pandas scikit-learn matplotlib

Steps to Run

Clone the repository:

git clone https://github.com/arham-jain12/Permutation-Decision-Tree.git
cd Permutation-Decision-Tree

Load the datasets by running load.py:
```
python load.py
```
Run the specific Jupyter Notebooks for model implementation:
- Decision Trees (DT):
  - classical_decision_tree_using_entropy.ipynb
  - classical_decision_tree_using_gini.ipynb
- Random Forests (RF):
  - random_forest_entropy.ipynb
  - random_forest_gini.ipynb
- Permutation Decision Forests:
  - permutated_decision_tree_and_forest.ipynb
Generate the barplots using barplots.ipynb.

🔬 Future Work

Investigate the relationship between ETC and Kolmogorov complexity.
Test PDT and PDF on non-iid datasets to evaluate their robustness further.

👤 Contributions

Contributions are welcome! If you encounter any issues or have suggestions for improvements, feel free to open an issue or submit a pull request.

📄 License

This project is licensed under the MIT License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Permutation Decision Trees and Forests

📂 Project Structure

📊 Datasets Used

⚙️ Hyperparameter Tuning

Decision Trees (DT)

Random Forests (RF)

Permutation Decision Tree (PDT) and Permutation Decision Forest (PDF)

🧪 Results

Key Findings:

📈 Results Visualization

🚀 How to Run the Code

Prerequisites

Steps to Run

🔬 Future Work

👤 Contributions

📄 License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Permutation Decision Trees and Forests

📂 Project Structure

📊 Datasets Used

⚙️ Hyperparameter Tuning

Decision Trees (DT)

Random Forests (RF)

Permutation Decision Tree (PDT) and Permutation Decision Forest (PDF)

🧪 Results

Key Findings:

📈 Results Visualization

🚀 How to Run the Code

Prerequisites

Steps to Run

🔬 Future Work

👤 Contributions

📄 License