Skip to content

Commit c999ba6

Browse files
authored
Initial commit
0 parents  commit c999ba6

File tree

12 files changed

+786
-0
lines changed

12 files changed

+786
-0
lines changed

.flake8

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
#########################
2+
# Flake8 Configuration #
3+
# (.flake8) #
4+
#########################
5+
[flake8]
6+
ignore =
7+
# pickle
8+
S301
9+
S403
10+
S404
11+
S603
12+
# Line break before binary operator (flake8 is wrong)
13+
W503
14+
# Ignore the spaces black puts before columns.
15+
E203
16+
# allow path extensions for testing.
17+
E402
18+
DAR101
19+
DAR201
20+
# flake and pylance disagree on linebreaks in strings.
21+
N400
22+
# asserts are ok in test.
23+
S101
24+
exclude =
25+
.tox,
26+
.git,
27+
__pycache__,
28+
docs/source/conf.py,
29+
build,
30+
dist,
31+
tests/fixtures/*,
32+
*.pyc,
33+
*.bib,
34+
*.egg-info,
35+
.cache,
36+
.eggs,
37+
data.
38+
max-line-length = 120
39+
max-complexity = 20
40+
import-order-style = pycharm
41+
application-import-names =
42+
seleqt
43+
tests

.github/workflows/test.yml

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
name: Tests
2+
3+
on: [ push, pull_request ]
4+
5+
jobs:
6+
tests:
7+
name: Tests
8+
runs-on: ${{ matrix.os }}
9+
strategy:
10+
matrix:
11+
os: [ ubuntu-latest ]
12+
python-version: [3.11.0]
13+
steps:
14+
- uses: actions/checkout@v2
15+
- name: Set up Python ${{ matrix.python-version }}
16+
uses: actions/setup-python@v2
17+
with:
18+
python-version: ${{ matrix.python-version }}
19+
- name: Install dependencies
20+
run: pip install nox
21+
- name: Test with pytest
22+
run:
23+
nox -s test
24+
lint:
25+
name: Lint
26+
runs-on: ubuntu-latest
27+
strategy:
28+
matrix:
29+
python-version: [3.11.0]
30+
steps:
31+
- uses: actions/checkout@v2
32+
- name: Set up Python ${{ matrix.python-version }}
33+
uses: actions/setup-python@v2
34+
with:
35+
python-version: ${{ matrix.python-version }}
36+
- name: Install dependencies
37+
run: pip install nox
38+
- name: Run flake8
39+
run: nox -s lint
40+
- name: Run mypy
41+
run: nox -s typing

.gitignore

Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
.vscode/
2+
.pytest_cache/
3+
4+
# Byte-compiled / optimized / DLL files
5+
__pycache__/
6+
*.py[cod]
7+
*$py.class
8+
9+
# C extensions
10+
*.so
11+
12+
# Distribution / packaging
13+
.Python
14+
build/
15+
develop-eggs/
16+
dist/
17+
downloads/
18+
eggs/
19+
.eggs/
20+
lib/
21+
lib64/
22+
parts/
23+
sdist/
24+
var/
25+
wheels/
26+
share/python-wheels/
27+
*.egg-info/
28+
.installed.cfg
29+
*.egg
30+
MANIFEST
31+
32+
# PyInstaller
33+
# Usually these files are written by a python script from a template
34+
# before PyInstaller builds the exe, so as to inject date/other infos into it.
35+
*.manifest
36+
*.spec
37+
38+
# Installer logs
39+
pip-log.txt
40+
pip-delete-this-directory.txt
41+
42+
# Unit test / coverage reports
43+
htmlcov/
44+
.tox/
45+
.nox/
46+
.coverage
47+
.coverage.*
48+
.cache
49+
nosetests.xml
50+
coverage.xml
51+
*.cover
52+
*.py,cover
53+
.hypothesis/
54+
.pytest_cache/
55+
cover/
56+
57+
# Translations
58+
*.mo
59+
*.pot
60+
61+
# Django stuff:
62+
*.log
63+
local_settings.py
64+
db.sqlite3
65+
db.sqlite3-journal
66+
67+
# Flask stuff:
68+
instance/
69+
.webassets-cache
70+
71+
# Scrapy stuff:
72+
.scrapy
73+
74+
# Sphinx documentation
75+
docs/_build/
76+
77+
# PyBuilder
78+
.pybuilder/
79+
target/
80+
81+
# Jupyter Notebook
82+
.ipynb_checkpoints
83+
84+
# IPython
85+
profile_default/
86+
ipython_config.py
87+
88+
# pyenv
89+
# For a library or package, you might want to ignore these files since the code is
90+
# intended to run in multiple environments; otherwise, check them in:
91+
# .python-version
92+
93+
# pipenv
94+
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
95+
# However, in case of collaboration, if having platform-specific dependencies or dependencies
96+
# having no cross-platform support, pipenv may install dependencies that don't work, or not
97+
# install all needed dependencies.
98+
#Pipfile.lock
99+
100+
# poetry
101+
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
102+
# This is especially recommended for binary packages to ensure reproducibility, and is more
103+
# commonly ignored for libraries.
104+
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
105+
#poetry.lock
106+
107+
# pdm
108+
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
109+
#pdm.lock
110+
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
111+
# in version control.
112+
# https://pdm.fming.dev/#use-with-ide
113+
.pdm.toml
114+
115+
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
116+
__pypackages__/
117+
118+
# Celery stuff
119+
celerybeat-schedule
120+
celerybeat.pid
121+
122+
# SageMath parsed files
123+
*.sage.py
124+
125+
# Environments
126+
.env
127+
.venv
128+
env/
129+
venv/
130+
ENV/
131+
env.bak/
132+
venv.bak/
133+
134+
# Spyder project settings
135+
.spyderproject
136+
.spyproject
137+
138+
# Rope project settings
139+
.ropeproject
140+
141+
# mkdocs documentation
142+
/site
143+
144+
# mypy
145+
.mypy_cache/
146+
.dmypy.json
147+
dmypy.json
148+
149+
# Pyre type checker
150+
.pyre/
151+
152+
# pytype static type analyzer
153+
.pytype/
154+
155+
# Cython debug symbols
156+
cython_debug/
157+
158+
# PyCharm
159+
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
160+
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
161+
# and can be added to the global gitignore or merged into this file. For a more nuclear
162+
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
163+
#.idea/

README.md

Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# k-Nearest Neighbors Classification Exercise
2+
3+
Today we will get to know the package `scikit-learn` (sklearn). It has many different machine learning algorithms already implemented, so we will be using it for the next five classes. The first algorithm, which we are going to learn today is the k-nearest neighbor algorithm. It can be used for classification as well as for regression.
4+
5+
Take a look at the file `src/nn_iris.py`. We will implement the TODOs step by step:
6+
7+
### Task 1: Loading the data
8+
9+
1. Install the `scikit-learn` package with
10+
```bash
11+
pip install -r requirements.txt
12+
```
13+
or directly via `pip install scikit-learn`.
14+
The dataset <em>iris</em> is very popular amongst machine learners in example tasks. For this reason it can be found directly in the sklearn package.
15+
16+
2. Navigate to the `__main__` function of `src/nn_iris.py` and load the iris dataset from `sklearn.datasets`.
17+
In the dataset there are several plants of different species of genus Iris. For each of the examples width and length of petal and sepal of the flower were measured.
18+
![A petal and a sepal of a flower (Wikipedia)](./figures/Petal_sepal.jpg)
19+
20+
3. Find out how to access the attributes of the database (Hint: set a breakpoint and examine the variable). Print the shape of the data matrix and the number of the target entries. Print the names of the labels. Print the names of the features.
21+
22+
### Task 2: Examining the data (optional)
23+
24+
Your goal is to determine the species for an example, based on the dimensions of its petals and sepals. But first we need to inspect the dataset.
25+
26+
1. Use a histogram (classes distribution) to check if the iris dataset is balanced. To plot a histogram you can for example use `pandas.Series.hist` or `matplotlib.pyplot.hist`.
27+
Fortunately, the iris dataset is balanced, so it has the same number of samples for each species. Balanced datasets make it simple to proceed directly to the classification phase. In the opposite case we would have to take additional steps to reduce the negative effects (e.g. collect more data) or use other algorithms than the k-Nearest Neighbors (e.g. Random Forests).
28+
29+
2. We also can use pandas `scatter_matrix` to visualize some trends in our data. A scatter matrix (pairs plot) compactly plots all the numeric variables we have in a dataset against each other.
30+
Plot the scatter matrix. To make the different species visually distinguishable use the parameter `c=iris.target` in `pandas.plotting.scatter_matrix` to colorize the datapoints according to their target species.
31+
In the scatter matrix you can see domains of values as well as the distributions of each of the attributes. It is also possible to compare groups in scatter plots over all pairs of attributes. From those it seems that groups are well separated, two of the groups slightly overlap.
32+
33+
### Task 3: Training
34+
35+
First, we need to split the dataset into train and test data. Then we are ready to train the model.
36+
37+
1. Use `train_test_split` from `sklearn.model_selection` and create a train and a test set with the ratio 75:25. Print the dimensions of the train and the test set. You can use the parameter `random_state` to set the seed for the random number generator. That will make your results reproducible. Set this value to 29.
38+
39+
2. Define a classifier `knn` from the class `KNeighborsClassifier` and set the hyperparameter `n_neighbors` value to 1.
40+
41+
3. Train the classifier on the training set. The method `fit()` is present in all the estimators of the package `scikit-learn`.
42+
43+
### Task 4: Prediction and Evalutation
44+
45+
The trained model is now able to receive the input data and produce predictions of the labels.
46+
1. Predict the labels first for the train and then for the test data.
47+
48+
2. The comparison of a predicted and the true label can tell us valuable information about how well our model performs. The simplest performance measure is the ratio of correct predictions to all predictions, called accuracy. Implement a function `compute_accuracy` to calculate the accuracy of predictions. Use your function and evaluate your model by calculating the accuracy on the train set and the test set. Print both results.
49+
50+
3. To evaluate, whether our model performs well, its performance is compared to other models. Since we now only know one classifier, we will compare it to dummy models. Most frequent models always predict the label that occurs the most in our train set. If the train set is balanced, we choose one of the classes. Implement the function `accuracy_most_frequent` to compute the accuracy of the most frequent model. (Hint: the function `numpy.bincount` might be helpful.) Print the result.
51+
52+
4. (Optional) Another dummy model is a stratified model. A stratified model assigns random labels based on the ratio of the labels in the train set. Implement the function `accuracy_stratified` to compute the accuracy of the stratified model. (Hint: `numpy.random.choice` might help.) Call the function several times and print the results. You see that the results are different. In order to reproduce the results, it is usefull to set a seed. Use `numpy.random.seed` before calling the function to set the seed. Set it to 29.
53+
54+
### Task 5: Confusion matrix
55+
56+
Another common method to evaluate the performance of a classifier is constructing a confusion matrix that shows not only accuracies for each of the classes (labels), but what classes the classifier is most confused about.
57+
58+
1. Use the function `confusion_matrix` to compute the confusion matrix for the test set.
59+
60+
2. (Optional) The accuracy of the prediction can be derived from the confusion matrix as sum of the matrix diagonal over the sum of the whole matrix. Compute the accuracy using the information obtained from the confusion matrix. Print the result.
61+
62+
3. We can also visualize the confusion matrix in form of a heatmap. Use `ConfusionMatrixDisplay` to plot a heatmap of the confusion matrix for the test set. Use `display_labels=iris.target_names` for better visualization.
63+
64+
### Task 6: Hyperparameter tuning
65+
66+
Now we need to find the best value for our hyperparameter `k`. We will use a common procedure called <em>grid search</em> to search the space of the possible values. Since our train dataset is small, we will perform cross-validation in order to compute the validation error for each value of `k`. Implement this hyperparameter tuning in the function `cv_knearest_classifier` following these steps:
67+
68+
1. Define a second classifier `knn2`. Define a grid of parameter values for `k` from 1 to 25 (Hint: `numpy.arange`). This grid must be stored in a dictionary with `n_neighbors` as the key in order to use `GridSearchCV` with it.
69+
70+
2. Use the class `GridSearchCV` to perform grid search. It gives you the possibility to perform n-fold cross-validation too, so use the attribute `cv` to set the number of folds to 3. When everything is set, you can train your `knn2`.
71+
72+
### Task 7: Testing
73+
74+
After the training you can access the best parameter `best_params_`, the corresponding validation accuracy `best_score_` and the corresponding estimator `best_estimator_`.
75+
76+
1. Use the best estimator to compute the accuracy on your train and test sets. Print the results. Has the test accuracy improved after the hyperparameter tuning?
77+
78+
2. Plot the new confusion matrix for the test set.
79+
80+
81+
## k-Nearest Neighbors Regression Exercise (Optional)
82+
83+
Navigate to the `__main__` function of `nn_regression.py` in the `src` directory and fill in the blanks by implementing the TODOs.

figures/Petal_sepal.jpg

48.9 KB
Loading

noxfile.py

Lines changed: 53 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,53 @@
1+
"""This module implements our CI function calls."""
2+
import nox
3+
4+
5+
@nox.session(name="test")
6+
def run_test(session):
7+
"""Run pytest."""
8+
session.install("-r", "requirements.txt")
9+
session.install("pytest")
10+
session.run("pytest")
11+
12+
13+
@nox.session(name="lint")
14+
def lint(session):
15+
"""Check code conventions."""
16+
session.install("flake8")
17+
session.install(
18+
"flake8-black",
19+
"flake8-docstrings",
20+
"flake8-bugbear",
21+
"flake8-broken-line",
22+
"pep8-naming",
23+
"pydocstyle",
24+
"darglint",
25+
)
26+
session.run("flake8", "src", "tests", "noxfile.py")
27+
28+
29+
@nox.session(name="typing")
30+
def mypy(session):
31+
"""Check type hints."""
32+
session.install("-r", "requirements.txt")
33+
session.install("mypy")
34+
session.run(
35+
"mypy",
36+
"--install-types",
37+
"--non-interactive",
38+
"--ignore-missing-imports",
39+
"--no-strict-optional",
40+
"--no-warn-return-any",
41+
"--implicit-reexport",
42+
"--allow-untyped-calls",
43+
"src",
44+
)
45+
46+
47+
@nox.session(name="format")
48+
def format(session):
49+
"""Fix common convention problems automatically."""
50+
session.install("black")
51+
session.install("isort")
52+
session.run("isort", "src", "tests", "noxfile.py")
53+
session.run("black", "src", "tests", "noxfile.py")

pytest.ini

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
[pytest]
2+
markers =
3+
slow: this test is slow and should only run locally.
4+
pythonpath = .

0 commit comments

Comments
 (0)