Skip to content

Commit 0787bb4

Browse files
committed
some clarifications
1 parent 1b9987e commit 0787bb4

File tree

1 file changed

+11
-11
lines changed

1 file changed

+11
-11
lines changed

README.md

Lines changed: 11 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ Today we will get to know the package `scikit-learn` (sklearn). It has many diff
44

55
Take a look at the file `src/nn_iris.py`. We will implement the TODOs step by step:
66

7-
### Task 1: Loading the data
7+
### Task 1: Loading the data
88

99
1. Install the `scikit-learn` package with
1010
```bash
@@ -19,7 +19,7 @@ Take a look at the file `src/nn_iris.py`. We will implement the TODOs step by st
1919

2020
3. Find out how to access the attributes of the database (Hint: set a breakpoint and examine the variable). Print the shape of the data matrix and the number of the target entries. Print the names of the labels. Print the names of the features.
2121

22-
### Task 2: Examining the data (optional)
22+
### Task 2: Examining the data (optional)
2323

2424
Your goal is to determine the species for an example, based on the dimensions of its petals and sepals. But first we need to inspect the dataset.
2525

@@ -30,7 +30,7 @@ Your goal is to determine the species for an example, based on the dimensions of
3030
Plot the scatter matrix. To make the different species visually distinguishable use the parameter `c=iris.target` in `pandas.plotting.scatter_matrix` to colorize the datapoints according to their target species.
3131
In the scatter matrix you can see domains of values as well as the distributions of each of the attributes. It is also possible to compare groups in scatter plots over all pairs of attributes. From those it seems that groups are well separated, two of the groups slightly overlap.
3232

33-
### Task 3: Training
33+
### Task 3: Training
3434

3535
First, we need to split the dataset into train and test data. Then we are ready to train the model.
3636

@@ -40,18 +40,18 @@ First, we need to split the dataset into train and test data. Then we are ready
4040

4141
3. Train the classifier on the training set. The method `fit()` is present in all the estimators of the package `scikit-learn`.
4242

43-
### Task 4: Prediction and Evalutation
43+
### Task 4: Prediction and Evalutation
4444

4545
The trained model is now able to receive the input data and produce predictions of the labels.
4646
1. Predict the labels first for the train and then for the test data.
4747

4848
2. The comparison of a predicted and the true label can tell us valuable information about how well our model performs. The simplest performance measure is the ratio of correct predictions to all predictions, called accuracy. Implement a function `compute_accuracy` to calculate the accuracy of predictions. Use your function and evaluate your model by calculating the accuracy on the train set and the test set. Print both results.
4949

50-
3. To evaluate, whether our model performs well, its performance is compared to other models. Since we now only know one classifier, we will compare it to dummy models. Most frequent models always predict the label that occurs the most in our train set. If the train set is balanced, we choose one of the classes. Implement the function `accuracy_most_frequent` to compute the accuracy of the most frequent model. (Hint: the function `numpy.bincount` might be helpful.) Print the result.
50+
3. To evaluate, whether our model performs well, its performance is compared to other models. Since we now only know one classifier, we will compare it to dummy models. Those dummy models are not trained on the data. Instead, they just follow some simple rule in order to decide which predicition to make. One dummy model is the "Most frequent"-model. It always predicts the label that occurs the most in our train set. If the train set is balanced, we choose one of the classes. Implement the function `accuracy_most_frequent` to compute the accuracy of the most frequent model. (Hint: the function `numpy.bincount` might be helpful.) Print the result.
5151

52-
4. (Optional) Another dummy model is a stratified model. A stratified model assigns random labels based on the ratio of the labels in the train set. Implement the function `accuracy_stratified` to compute the accuracy of the stratified model. (Hint: `numpy.random.choice` might help.) Call the function several times and print the results. You see that the results are different. In order to reproduce the results, it is usefull to set a seed. Use `numpy.random.seed` before calling the function to set the seed. Set it to 29.
52+
4. (Optional) Another dummy model is a stratified model. A stratified model assigns random labels based on the ratio of the labels in the train set. So labels that occur more frequent have a higher chance to be chosen, but there is still a chance for a more rare label to be picked. (Hint: `numpy.random.choice` might help.) Implement the function `accuracy_stratified` to compute the accuracy of the stratified model. Call the function several times and print the results. You see that the results are different. In order to reproduce the results, it is usefull to set a seed. Use `numpy.random.seed` before calling the function to set the seed. Set it to 29.
5353

54-
### Task 5: Confusion matrix
54+
### Task 5: Confusion matrix
5555

5656
Another common method to evaluate the performance of a classifier is constructing a confusion matrix that shows not only accuracies for each of the classes (labels), but what classes the classifier is most confused about.
5757

@@ -61,15 +61,15 @@ Another common method to evaluate the performance of a classifier is constructin
6161

6262
3. We can also visualize the confusion matrix in form of a heatmap. Use `ConfusionMatrixDisplay` to plot a heatmap of the confusion matrix for the test set. Use `display_labels=iris.target_names` for better visualization.
6363

64-
### Task 6: Hyperparameter tuning
64+
### Task 6: Hyperparameter tuning
6565

6666
Now we need to find the best value for our hyperparameter `k`. We will use a common procedure called <em>grid search</em> to search the space of the possible values. Since our train dataset is small, we will perform cross-validation in order to compute the validation error for each value of `k`. Implement this hyperparameter tuning in the function `cv_knearest_classifier` following these steps:
6767

6868
1. Define a second classifier `knn2`. Define a grid of parameter values for `k` from 1 to 25 (Hint: `numpy.arange`). This grid must be stored in a dictionary with `n_neighbors` as the key in order to use `GridSearchCV` with it.
6969

70-
2. Use the class `GridSearchCV` to perform grid search. It gives you the possibility to perform n-fold cross-validation too, so use the attribute `cv` to set the number of folds to 3. When everything is set, you can train your `knn2`.
70+
2. Use the class `GridSearchCV` to perform grid search. It gives you the possibility to perform n-fold cross-validation too, so use the attribute `cv` to set the number of folds to 3. When everything is set, you can train your `knn2` by fitting the `GridSearchCV`-object.
7171

72-
### Task 7: Testing
72+
### Task 7: Testing
7373

7474
After the training you can access the best parameter `best_params_`, the corresponding validation accuracy `best_score_` and the corresponding estimator `best_estimator_`.
7575

@@ -78,6 +78,6 @@ After the training you can access the best parameter `best_params_`, the corresp
7878
2. Plot the new confusion matrix for the test set.
7979

8080

81-
## k-Nearest Neighbors Regression Exercise (Optional)
81+
## ✪ Task 8: k-Nearest Neighbors Regression Exercise (Optional)
8282

8383
Navigate to the `__main__` function of `nn_regression.py` in the `src` directory and fill in the blanks by implementing the TODOs.

0 commit comments

Comments
 (0)