class: center, middle
Gerard Escudero, 2020
class: left, middle, inverse
-
.cyan[Introduction]
-
Distances
-
Probabilities
-
Rules
-
Hyperplanes
-
Learning Theory
-
References
.center[
class | sepal length |
sepal width |
petal length |
petal width |
---|---|---|---|---|
setosa | 5.1 | 3.5 | 1.4 | 0.2 |
setosa | 4.9 | 3.0 | 1.4 | 0.2 |
versicolor | 6.1 | 2.9 | 4.7 | 1.4 |
versicolor | 5.6 | 2.9 | 3.6 | 1.3 |
virginica | 7.6 | 3.0 | 6.6 | 2.1 |
virginica | 4.9 | 2.5 | 4.5 | 1.7 |
150 rows or examples (50 per class).red[*] | ||||
] |
-
The .blue[class] or .blue[target] column is usually refered as vector .blue[Y]
-
The matrix of the rest of columns (.blue[attributes] or .blue[features]) is usually referred as matrix .blue[X]
.footnote[.red[*] Source : Iris problem UCI repository (Frank & Asunción, 2010)]
.large[Build from data a .blue[model] able to give a prediction to new .blue[unseen] examples.]
where:
- data = previous table
- unseen = [4.9, 3.1, 1.5, 0.1]
- prediction = "setosa"
.center[
quality | density | pH | sulphates | alcohol |
---|---|---|---|---|
6 | 0.998 | 3.16 | 0.58 | 9.8 |
4 | 0.9948 | 3.51 | 0.43 | 11.4 |
8 | 0.9973 | 3.35 | 0.86 | 12.8 |
3 | 0.9994 | 3.16 | 0.63 | 8.4 |
7 | 0.99514 | 3.44 | 0.68 | 10.55 |
1599 examples & 12 columns (11 attributes + 1 target).red[*] | ||||
] |
The main diference between classification and regression is the Y or target values:
-
.blue[Classification]: discrete or nominal values
Example: Iris, {“setosa”, “virginica”, “versicolor”}. -
.blue[Regression]: continuous or real values
Example: WineQuality, values from 0 to 10.
.footnote[.red[*] Source : wine quality problem from UCI repository (Frank & Asunción, 2010)]
-
Medicine: diagnosis of diseases
-
Engineering: fault diagnosis and detection
-
Computer Vision: face recognition
-
Natural Language Processing: spam filtering
-
Medicine: estimation of life after an organ transplant
-
Engineering: process simulation, prediction
-
Computer Vision: Face completion
-
Natural Language Processing: opinion detection
class: left, middle, inverse
-
.brown[Introduction]
-
.cyan[Distances]
-
.cyan[kNN]
-
Centroids
-
-
Probabilities
-
Rules
-
Hyperplanes
-
Learning Theory
-
References
- euclidean:
- hamming:
.tiny[https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.DistanceMetric.html]
-
k Nearest Neightbors (kNN)
-
Centroids (or Linear Classifier)
- How can be give a prediction to next examples?
.center[
class | sep-len | sep-wid | pet-len | pet-wid |
---|---|---|---|---|
?? | 4.9 | 3.1 | 1.5 | 0.1 |
Unseen classification example on Iris] |
.center[
target | density | pH | sulphates | alcohol |
---|---|---|---|---|
?? | 0.99546 | 3.29 | 0.54 | 10.1 |
Unseen regression example on WineQuality] |
- Let’s begin with a representation of the problems...
- classification & regression
-
.blue[classification] example (Iris):
- distances: [0.47, 0.17, 3.66, 2.53, 6.11, 3.45]
- prediction = setosa (0.17)
-
.blue[regression] example (WineQuality):
- distances: [0.33, 1.32, 2.72, 1.71, 0.49]
- prediction = 6 (0.33)
-
build the set
$S$ of$k$ $y_i$ 's with minimum distance to unseen example$T$ (as in 1NN) -
prediction:
- .blue[classification]: Iris & euclidean distance
- .blue[regression]: WineQuality & euclidean distance
-
k value uses to be odd or prime number to avoid .blue[ties]
-
an usual modification of the algorithm is to .blue[weight] points by the inverse of their distance in mode or average functions
-
.blue[lazy learning]: it does nothing in learning step; it calculates all in classification step
-
This can produce some problems real time applications
-
This mades kNN one of the most useful algorithm for missing values imputation
-
-
.blue[nominal features]
-
changing the distance (ie: hamming)
-
codifying them as numerical (to see in lab)
-
from sklearn.neighbors import KNeighborsClassifier
k = 3
clf = KNeighborsClassifier(k)
from sklearn.neighbors import KNeighborsRegressor
k = 3
rgs = KNeighborsRegressor(k)
clf = KNeighborsClassifier(k, weights='distance')
# for weighted majority votes or average
.tiny[https://scikit-learn.org/stable/modules/neighbors.html]
class: left, middle, inverse
-
.brown[Introduction]
-
.cyan[Distances]
-
.brown[kNN]
-
.cyan[Centroids]
-
-
Probabilities
-
Rules
-
Hyperplanes
-
Learning Theory
-
References
centroids | ||||
---|---|---|---|---|
setosa | 5.0 | 3.25 | 1.4 | 0.2 |
versicolor | 5.85 | 2.9 | 4.15 | 1.35 |
virginica | 6.25 | 2.75 | 5.55 | 1.9 |
from sklearn.neighbors import NearestCentroid
clf = NearestCentroid()
.tiny[https://scikit-learn.org/stable/modules/neighbors.html#nearest-centroid-classifier]
class: left, middle, inverse
-
.brown[Introduction]
-
.brown[Distances]
-
.cyan[Probabilities]
-
.cyan[Naïve Bayes]
-
LDA
-
Logistic Regression
-
-
Rules
-
Hyperplanes
-
Learning Theory
-
References
- Induction principle: .blue[probabilities]
class | cap-shape | cap-color | gill-size | gill-color |
---|---|---|---|---|
poisonous | convex | brown | narrow | black |
edible | convex | yellow | broad | black |
edible | bell | white | broad | brown |
poisonous | convex | white | narrow | brown |
edible | convex | yellow | broad | brown |
edible | bell | white | broad | brown |
poisonous | convex | white | narrow | pink |
.center[up to 8 124 examples & 22 attributes .red[*]] |
- What is .blue[$P(poisonous)$]?
.footnote[.red[*] Source : Mushroom problem from UCI repository (Frank & Asunción, 2010)]
- In most cases we estimate it from data (.blue[maximum likelihood estimation])
- How can be give a prediction from probabilities to next example?
class | cap-shape | cap-color | gill-size | gill-color |
---|---|---|---|---|
?? | convex | brown | narrow | black |
-
Some algorithms:
-
Naïve Bayes
-
LDA (Linear Discriminant Analysis)
-
Logistic regression
-
.col5050[ .col1[
poisonous | 0.429 |
edible | 0.571 |
] | |
.col2[ | |
attr:value | poisonous |
:----------------- | ----------: |
cap-shape:convex | 1 |
cap-shape:bell | 0 |
cap-color:brown | 0.33 |
cap-color:yellow | 0 |
cap-color:white | 0.67 |
gill-size:narrow | 1 |
gill-size:broad | 0 |
gill-color:black | 0.33 |
gill-color:brown | 0.33 |
gill-color:pink | 0.33 |
] | |
] |
- Test example
$T$ :
class | cap-shape | cap-color | gill-size | gill-color |
---|---|---|---|---|
?? | convex | brown | narrow | black |
- Numbers:
$$P(poisonous|T) = 0.429 \cdot 1 \cdot 0.33 \cdot 1 \cdot 0.33 = 0.047$$ $$P(edible|T) = 0.571 \cdot 0.5 \cdot 0 \cdot 0 \cdot 0.25 = 0$$ - Prediction:
$$h(T) = poisonous$$
-
It needs a smoothing technique to avoid zero counts
- Example: Laplace
$$P(x_i|y)\approx\frac{N(x_i|y)+1}{N(y)+N}$$
- Example: Laplace
-
It assumes conditional independence between every pair of features
-
It is empiricaly a decent classifier but a bad estimator
- This means that
$P(y|T)$ is not a good probability
- This means that
.blue[Gaussian Naïve Bayes] is an implementation assuming gaussian distribution:
.blue[Classification]:
from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()
It has no .blue[parameters]
.blue[User guide]:
.tiny[https://scikit-learn.org/stable/modules/naive_bayes.html]
class: left, middle, inverse
-
.brown[Introduction]
-
.brown[Distances]
-
.cyan[Probabilities]
-
.brown[Naïve Bayes]
-
.cyan[LDA]
-
Logistic Regression
-
-
Rules
-
Hyperplanes
-
Learning Theory
-
References
also from bayes rule & gaussian distributions:
where
.tiny[.red[Source]: https://sebastianraschka.com/faq/docs/lda-vs-pca.html]
from sklearn.decomposition import PCA
pca = PCA(n_components=2)
Xpca = pca.fit(X).transform(X)
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)
Xlda = lda.fit(X, Y).transform(X)
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
clf = LinearDiscriminantAnalysis(shrinkage='auto',solver='lsqr')
.tiny[https://scikit-learn.org/stable/modules/lda_qda.html]
class: left, middle, inverse
-
.brown[Introduction]
-
.brown[Distances]
-
.cyan[Probabilities]
-
.brown[Naïve Bayes]
-
.brown[LDA]
-
.cyan[Logistic Regression]
-
-
Rules
-
Hyperplanes
-
Learning Theory
-
References
-
Also known as Maximum Entropy
-
Regression of the probability
-
.blue[Binary Classification]:
.tiny[.red[Source]: https://en.wikipedia.org/wiki/Logistic_function]
from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
solver = default 'lbfgs'
max_iter = default 100
.tiny[https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression]
class: left, middle, inverse
-
.brown[Introduction]
-
.brown[Distances]
-
.brown[Probabilities]
-
.cyan[Rules]
-
.cyan[Decision Trees]
-
Ensembles
-
-
Hyperplanes
-
Learning Theory
-
References
- Induction principle: .blue[rules]
class | cap-shape | cap-color | gill-color |
---|---|---|---|
poisonous | convex | brown | black |
edible | convex | yellow | black |
edible | bell | white | brown |
poisonous | convex | white | brown |
edible | convex | yellow | brown |
edible | bell | white | brown |
poisonous | convex | white | pink |
.center[up to 8124 examples & 22 attributes.red[*]] |
- Which rules can be extracted from data?
.blue[$$\text{gill-color}=\text{pink}\Longrightarrow\text{poisonous}$$]
.footnote[.red[*] .red[Source]: ``mushroom'' problem from UCI Repository (Frank & Asunción, 2010)]
.blue[Resulting tree]:
.blue[Classification]: exploring the tree using test
class | cap-shape | cap-color | gill-color |
---|---|---|---|
?? | convex | brown | black |
.center[.blue[prediction: poisonous]]
.blue[Learn]:
build a tree by recursively splitting for one of the attributes with most accuracy
.blue[Step 1]: every attribute is evaluated.red[*]
.footnote[.red[*] the number of examples with the mode is assigned to each value]
.blue[Step 2]: one of the best attributes is selected as a node of the tree:
.blue[Step 3]: for every value with only a class a leaf is created:
.blue[Step 4]: a new set is built for the rest of values
.center[white examples without cap-color]
class | cap-shape | gill-color |
---|---|---|
edible | bell | brown |
poisonous | convex | brown |
edible | bell | brown |
poisonous | convex | pink |
.blue[Step 5]: the algorithm restarts with previous set:
.blue[Step 6]: the algorithm ends when no attributes left or get full accuracy
- What about .blue[numerical attributes]?
class | length | width |
---|---|---|
versicolor | 6.1 | 2.9 |
versicolor | 5.6 | 2.9 |
virginica | 7.6 | 3.0 |
virginica | 4.9 | 2.5 |
- .blue[Cutting points] for width attribute
class | width | cutting points | accuracy |
---|---|---|---|
virginica | 2.5 | ||
versicolor | 2.9 | 2.7 | |
versicolor | 2.9 | ||
virginica | 3.0 | 2.95 |
Entropy is a measure of the amount of uncertainty in the data:
Information gain is a measure of the difference of entropy before and after splitting:
- Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled
.center[It is used instead of entropy]
-
It allows regression
-
Example:
.center[.tiny[https://sefiks.com/2018/08/27/a-step-by-step-cart-decision-tree-example/]]
.center[It also contains a regression example]
Models looks like.red[*]:
-
Resulting rules are very understandable for humans
-
Normalization do not affects trees
-
Complex and big trees tends to overfitting (they do not generalize very well)
-
Small changes in data may produce big different trees
.footnote[.red[*] .red[Source]: https://scikit-learn.org/stable/modules/tree.html]
from sklearn.tree import DecisionTreeClassifier
clf = DecisionTreeClassifier()
from sklearn.tree import DecisionTreeRegressor
rgs = DecisionTreeRegressor()
criterion = 'entropy' or 'gini'
max_depth = int or None
.tiny[https://scikit-learn.org/stable/modules/tree.html]
class: left, middle, inverse
-
.brown[Introduction]
-
.brown[Distances]
-
.brown[Probabilities]
-
.cyan[Rules]
-
.brown[Decision Trees]
-
.cyan[Ensembles]
-
-
Hyperplanes
-
Learning Theory
-
References
-
Emsembles: combination of classifiers for improving generalization
-
Meta-learners
-
Two big families:
-
.blue[Averaging]:
average of predictions
improves reducing variance
Bagging & Random Forests -
.blue[Boosting]:
incrementally emphasizing in errors
improves reducing bias
AdaBoost & Gradient Boosting
-
-
Any base estimator algorithm, but the most used Decision Trees
-
User Guide:
https://scikit-learn.org/stable/modules/ensemble.html
-
Set collection by randomly selecting with replacement from original data
-
A estimator is built for each of the previous sets
-
Prediction by averaging those of previous estimators
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
bagging = BaggingClassifier(KNeighborsClassifier())
...
# In a similar way with BaggingRegressor
.tiny[https://scikit-learn.org/stable/modules/ensemble.html#bagging]
It is a bagging with decision trees variant as base estimator
Nodes in decision trees are selected among a random selection of features
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=10)
...
# In a similar way with RandomForestRegressor
https://scikit-learn.org/stable/modules/ensemble.html#forest
-
Learn a weak classifier at each iteration (sample distribution)
-
At each iteration increases weight of errors and decreases the rest
.footnote[.red[*] .red[Source]: https://www.cs.cmu.edu/~aarti/Class/10701/slides/Lecture10.pdf]
.footnote[.red[*] .red[Source]: https://www.cs.cmu.edu/~aarti/Class/10701/slides/Lecture10.pdf]
from sklearn.ensemble import AdaBoostClassifier
clf = AdaBoostClassifier()
...
# In a similar way with AdaBoostRegressor
Parameters:
n_estimators=10
User Guide:
https://scikit-learn.org/stable/modules/ensemble.html#adaboost
Generalization of boosting by optimizing (gradient descent) loss functions
.tiny[Instead of training on a newly sample distribution, the weak learner trains on the remaining errors of the strong learner.]
from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier()
...
# In a similar way with GradientBoostingRegressor
Parameters:
n_estimators=10
max_depth=1
User Guide:
https://scikit-learn.org/stable/modules/ensemble.html#gradient-boosting
XGBoost: another implementation
https://pandas-ml.readthedocs.io/en/latest/xgboost.html
class: left, middle, inverse
-
.brown[Introduction]
-
.brown[Distances]
-
.brown[Probabilities]
-
.brown[Rules]
-
.cyan[Hyperplanes]
-
.cyan[Kernels]
-
SVM
-
Neural Networks
-
-
Learning Theory
-
References
.cols5050[ .col1[
where:
.center[.tiny[https://sv.wikipedia.org/wiki/Fil:Scalar-product-dot-product.svg]]
]]
Next data set is .blue[linearly not separable]
.blue[Kernels] try to convert data sets to linearly separables through projections
Given a projection function
a .blue[kernel] will be:
.footnote[.red[*] .red[source]: https://en.wikipedia.org/wiki/File:Kernel_Machine.svg]
Kernel matrix:
Usual Kernels.red[*]
-
.blue[linear]:
$\langle X,Z\rangle$ -
.blue[polynomial]:
$(\gamma\langle X,Z\rangle+r)^d$ -
.blue[rbf] (radial basis function):
$\exp(-\gamma\Vert X-Z\Vert^2)$
.footnote[.red[*] in sklearn: https://scikit-learn.org/stable/modules/svm.html#svm-kernels]
Prediction function:
-
Kernels has been applied to many algorithms:
-
Centroids
-
PCA
-
k-means
-
SVMs
-
-
Kernels can be adapted to the problem as another way to represent data
- There are many kernels for structured data: trees, graphs, sets...
Example: kernel for sets
- There are many kernels for structured data: trees, graphs, sets...
Representation of previous linearly no separable data set:
from sklearn.decomposition import KernelPCA
kpca = KernelPCA(kernel='rbf', gamma=1)
https://scikit-learn.org/stable/modules/decomposition.html#kernel-pca
class: left, middle, inverse
-
.brown[Introduction]
-
.brown[Distances]
-
.brown[Probabilities]
-
.brown[Rules]
-
.cyan[Hyperplanes]
-
.brown[Kernels]
-
.cyan[SVM]
-
Neural Networks
-
-
Learning Theory
-
References
Those that maximize the .blue[margin].
Those nearest the margin.
- Linear:
- General kernel:
This is called .blue[soft margin].
- SVMs support kernels.red[*]:
- It also support .blue[custom kernels].
.footnote[.red[*] .red[Source]: https://scikit-learn.org/stable/modules/svm.html#svm-classification]
- Which is the model for .blue[regression]?
- It has an additional parameter: the
$\varepsilon$ -tube
.footnote[.red[*] .red[Source]: https://www.saedsayad.com/support_vector_machine_reg.htm]
.cols5050[ .col1[
from sklearn.svm import SVC
clf = SVC()
]
.col2[
from sklearn.svm import SVR
rgs = SVR()
]]
kernel = 'linear', 'poly', 'rbf', 'precomputed'...
degree = 2, 3...
gamma = 'scale', 1, 0.1, 10...
C = 1, 10, 0.1 # penalty of soft margin
epsilon = 0.1
max_iter = -1, 1000...
https://scikit-learn.org/stable/modules/svm.html#svm-classification
class: left, middle, inverse
-
.brown[Introduction]
-
.brown[Distances]
-
.brown[Probabilities]
-
.brown[Rules]
-
.cyan[Hyperplanes]
-
.brown[Kernels]
-
.brown[SVM]
-
.cyan[Neural Networks]
-
-
Learning Theory
-
References
.footnote[Source: Artificial Neuron]
.cols5050[ .col1[
-
Classification and regression
-
Linear model
-
Classification:
- Learning rule:
.tiny[.red[*] Source: wikipedia]
]]
from sklearn.linear_model import Perceptron
clf = Perceptron()
max_iter: default=1000
.tiny[https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Perceptron.html]
.col5050[ .col1[
-
Hidden layers
-
Non-linear model
-
Classification & regression
-
Forward propagation of perceptrons
-
Backpropagation as training algorithm
Gradient descent (optimization)
]]
.footnote[Source: wikipedia]
-
Regression: minimum squared error or root minimum squared error
-
Binary classification: binary cross entropy
-
Multiclass classification: categorical cross entropy
-
Hidden units: ReLU
$f(x)=max(0,x)$ -
Output
-
Regression: linear
$f(x) = x$ -
Binary classification: sigmoid
$f(x) = 1 / (1 + exp(-x))$ -
Multiclass classification: softmax (one unit per class)
maximum value after normalizing as distribution
-
from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(hidden_layer_sizes=(25,))
from sklearn.neural_network import MLPRegressor
clf = MLPRegressor(hidden_layer_sizes=(25,))
# numer of units per hidden layer
hidden_layer_sizes: default (100,)
# training examples per step
batch_size=default min(200, n_samples)
# epochs: training of all training examples one time
max_iter: default=200
# learning rate (eta)
learning_rate_init: default=0.001
...
-
Experimentally adjusted
- low values
$\rightarrow$ good performance, high computational time - high values
$\rightarrow$ erratic (jumps) search, low computational time - trade-off (experimentally)
- low values
-
Could be adaptative
class: left, middle, inverse
-
.brown[Introduction]
-
.brown[Distances]
-
.brown[Probabilities]
-
.brown[Rules]
-
.brown[Hyperplanes]
-
.cyan[Learning Theory]
-
References
- if 11 red points are the available data set:
-
it can be approximated by:
- a 10-degree polynomial:
$R^2=1.0$ - a straight line:
$R^2=0.122$
- a 10-degree polynomial:
-
What happens to 5.5?
.footnote[.red[Source]: https://towardsdatascience.com/regularization-the-path-to-bias-variance-trade-off-b7a7088b4577]
.col2[
-
trade-off
-
overfitting == low bias, high variance
-
underfitting == high bias, low variance
-
noise domine ]]
.footnote[.red[Source]: https://towardsdatascience.com/regularization-the-path-to-bias-variance-trade-off-b7a7088b4577]
.cols5050[ .col1[
-
Symptoms: Training error too high
-
Causes:
- model too simple
- not enough training
-
Solutions:
- increase model complexity
- train longer ] .col2[
-
Symptoms: test error low
-
Causes:
- model too complex
- too much training
- training set too small
-
Solutions:
- reduce model complexity
- stop training (early stopping)
- get more training data / data augmentation
]]
-
.blue[Occam's razor in learning]:
simpler models are more likely to be correct than complex ones -
.blue[No free lunch theorem]:
there is no method which outperforms all others for all data sets -
.blue[Curse of dimensionality]:
when the dimensionality increases the amount of data needed to support the result often grows exponentially
class: left, middle, inverse
-
.brown[Introduction]
-
.brown[Distances]
-
.brown[Probabilities]
-
.brown[Rules]
-
.brown[Hyperplanes]
-
.brown[Learning Theory]
-
.cyan[References]
-
Aurélien Géron. Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow, 2nd Edition. O'Reilly, 2019.
-
Samir Kanaan. Neural Networks and Deep Learning. Improving your Neural Networks, 2020.
-
Gerard Escudero. Machine Learning for Games, 2019. (url)
-
UCI Machine Learning Repository (url)
-
jupyter: interative computing (url)
-
pandas: python data analysis library (url)
-
scikit-learn: machine learning in python (url)
-
pandas-ml: pandas machine learning (url)
-
tutorial markdown: lightweight syntax for writing (url)