This repo serves as a template for the exam assignment of the AutoML SS25 course at the university of Freiburg.
The aim of this repo is to provide a minimal installable template to help you get up and running.
To install the repository, first create an environment of your choice and activate it.
You can change the Python version here to the version you prefer.
Virtual Environment
python3 -m venv automl-tabular-env
source automl-tabular-env/bin/activate
Conda Environment
Can also use conda
, left to individual preference.
conda create -n automl-tabular-env python=3.11
conda activate automl-tabular-env
Then install the repository by running the following command:
pip install -e .
You can test that the installation was successful by running the following command:
python -c "import automl"
We place no restrictions on the Python version or libraries you use, but we recommend using Python 3.10 or higher.
We provide the following:
-
download-datasets.py
: This script downloads the suggested training datasets that we provide ahead of time, before the official exam dataset becomes available. -
run.py
: A script that loads in a downloaded dataset, trains an AutoML-System and then generates predictions forX_test
, saving those predictions to a file. For the training datasets, you will also have access toy_test
which is present in the./data
folder, however you will not have access toy_test
for the test dataset we provide later. Instead you will generate the predictions forX_test
and submit those to us through Github Classroom. -
./src/automl
: This is a python package that will be installed above and contain your source code for whatever system you would like to build. We have provided a dummyAutoML
class to serve as an example.
You are completely free to modify, install new libraries, make changes and in general do whatever you want with the
code. The only requirement for the exam will be that you can generate predictions for X_test
in a .npy
file
that we can then use to give you a test score through Github Classroom.
The following datasets are provided for practice purposes:
- bike_sharing_demand
- brazilian_houses
- wine_quality
- superconductivity
- yprop_4_1
You can download the practice data using:
python download-datasets.py
This will by default, download the data to the /data
folder with the following structure.
The fold numbers 1, ..., n
refer to outer folds, meaning each can be treated as a separate dataset for training and validation. You can use the --fold
argument to specify which fold you would like.
./data
├── bike_sharing_demand
│ ├── 1
│ │ ├── X_test.parquet
│ │ ├── X_train.parquet
│ │ ├── y_test.parquet
│ │ └── y_train.parquet
│ ├── 2
│ │ ├── X_test.parquet
│ │ ├── X_train.parquet
│ │ ├── y_test.parquet
│ │ └── y_train.parquet
│ ├── 3
...
├── wine_quality
│ ├── 1
│ │ ├── X_test.parquet
│ │ ├── X_train.parquet
│ │ ├── y_test.parquet
│ │ └── y_train.parquet
...
This will train a dummy AutoML system and generate predictions for X_test
:
python run.py --task bike_sharing_demand --seed 42 --output-path preds-42-bsd.npy
You are free to modify these files and command line arguments as you see fit.
Dataset | Test performance |
---|---|
bike_sharing_demand | 0.9457 |
brazilian_houses | 0.9896 |
superconductivity | 0.9311 |
wine_quality | 0.4410 |
yprop_4_1 | 0.0778 |
The scores listed are the R² values calculated using scikit-learn's metrics.r2_score
.
-
If you need to add dependencies that you and your teammates are all on the same page, you can modify the
pyproject.toml
file and add the dependencies there. This will ensure that everyone has the same dependencies -
Please feel free to modify the
.gitignore
file to exclude files generated by your experiments, such as models, predictions, etc. Also, be a friendly teammate and ignore your virtual environment and any additional folders/files created by your IDE.