The University of New Brunswick's Canadian Insitute of Cybersecurity published the open source CIC-IoT-2023 dataset. This dataset promotes research on how to detect 7 kinds of cyberattacks across 100 IoT devices.
Leading a team of 15 at the University of Waterloo, this repository contains useful notebooks for sampling, preprocessing, visualising, and training models on the CIC-IoT-2023 dataset.
downsampling.ipynb
- This notebook samples 0.1, 0.5, 1, 5, and 10% of the rows from each cyberattack class from the dataset. This reduces the dataset size from 14GB to 12-600 MB, making it easier to perform feature visualisation and feature selection. Kaggle. Blog.heatmaps.ipynb
- This notebook tries to understand which of the around 50 features are most important for training ML models. It notes some of the problems with simple correlational analysis and heatmaps. Kaggle. Blog.greyWolf.ipynb
- This notebook finds useful features from the 46 total features in the dataset. It uses the Grey Wolf Optimiser to do this. Kaggle. Blog.geneticAlgorithm.ipynb
- This notebook reduces the 46 features in the dataset to 20 uesful features. It also visualises how the genetic algorithm works while doing this. Kaggle. Blog.unsupervisedClustering.ipynb
- This notebook compares the negative selection algorithm, k-means clustering, and DBSCAN to generate signatures of benign network requests. Kaggle. Blog.