Machine Learning

Artificial Intelligence

Machine Learning

Gerard Escudero, 2020

Outline

.cyan[Introduction]
Distances
Probabilities
Rules
Hyperplanes
Learning Theory
References

Classification Data Example

class	sepal length	sepal width	petal length	petal width
setosa	5.1	3.5	1.4	0.2
setosa	4.9	3.0	1.4	0.2
versicolor	6.1	2.9	4.7	1.4
versicolor	5.6	2.9	3.6	1.3
virginica	7.6	3.0	6.6	2.1
virginica	4.9	2.5	4.5	1.7
150 rows or examples (50 per class).red[*]
]

The .blue[class] or .blue[target] column is usually refered as vector .blue[Y]
The matrix of the rest of columns (.blue[attributes] or .blue[features]) is usually referred as matrix .blue[X]

Main objective

where:

data = previous table
unseen = [4.9, 3.1, 1.5, 0.1]
prediction = "setosa"

Regression Data Example

quality	density	pH	sulphates	alcohol
6	0.998	3.16	0.58	9.8
4	0.9948	3.51	0.43	11.4
8	0.9973	3.35	0.86	12.8
3	0.9994	3.16	0.63	8.4
7	0.99514	3.44	0.68	10.55
1599 examples & 12 columns (11 attributes + 1 target).red[*]
]

The main diference between classification and regression is the Y or target values:

.blue[Classification]: discrete or nominal values
Example: Iris, {“setosa”, “virginica”, “versicolor”}.
.blue[Regression]: continuous or real values
Example: WineQuality, values from 0 to 10.

Some Applications

Classification

Medicine: diagnosis of diseases
Engineering: fault diagnosis and detection
Computer Vision: face recognition
Natural Language Processing: spam filtering

Regression

Medicine: estimation of life after an organ transplant
Engineering: process simulation, prediction
Computer Vision: Face completion
Natural Language Processing: opinion detection

Outline

.brown[Introduction]
.cyan[Distances]
- .cyan[kNN]
- Centroids
Probabilities
Rules
Hyperplanes
Learning Theory
References

Distance-based methods

Induction principle: .blue[distances]

euclidean:

$$de(v, w) = \sqrt{\sum_{i=1}^n(v_i-w_i)^2}$$

$$n=\text{#attributes}$$

hamming:

$$dh(v, w) = \frac{\sum_{i=1}^n \delta(v_i, w_i)}{n} $$ $$\delta(a, b)=1\text{, if } a\neq b$$ $$\delta(a, b)=0\text{, if } a = b$$

Distance-based methods

Documentation in .blue[sklearn]

Some algorithms

k Nearest Neightbors (kNN)
Centroids (or Linear Classifier)

Distance-based methods

How can be give a prediction to next examples?

class	sep-len	sep-wid	pet-len	pet-wid
??	4.9	3.1	1.5	0.1
Unseen classification example on Iris]

target	density	pH	sulphates	alcohol
??	0.99546	3.29	0.54	10.1
Unseen regression example on WineQuality]

Let’s begin with a representation of the problems...

Classification Data Example

Regression Data Example

1 Nearest Neighbors algorithm

Algorithm

classification & regression

$$h(T)=y_i$$

$$i = argmin_i (distance(X_i,T))$$

Examples

.blue[classification] example (Iris):
- distances: [0.47, 0.17, 3.66, 2.53, 6.11, 3.45]
- prediction = setosa (0.17)
.blue[regression] example (WineQuality):
- distances: [0.33, 1.32, 2.72, 1.71, 0.49]
- prediction = 6 (0.33)

k Nearest Neighbors algorithm

Algorithm

build the set $S$ of $k$ $y_i$'s with minimum distance to unseen example $T$ (as in 1NN)
prediction:

$$h(T)=mode(S)\text{, if classification}$$

$$h(T)=average(S)\text{, if regression}$$

Examples

.blue[classification]: Iris & euclidean distance

$$h(T)=mode({setosa, setosa, versicolor})=setosa$$

.blue[regression]: WineQuality & euclidean distance

$$h(T)=average({6,4,7})=5.7$$

Some issues

k value uses to be odd or prime number to avoid .blue[ties]
an usual modification of the algorithm is to .blue[weight] points by the inverse of their distance in mode or average functions
.blue[lazy learning]: it does nothing in learning step; it calculates all in classification step
- This can produce some problems real time applications
- This mades kNN one of the most useful algorithm for missing values imputation
.blue[nominal features]
- changing the distance (ie: hamming)
- codifying them as numerical (to see in lab)

sklearn

.blue[Classification]:

from sklearn.neighbors import KNeighborsClassifier
k = 3
clf = KNeighborsClassifier(k)

.blue[Regression]:

from sklearn.neighbors import KNeighborsRegressor
k = 3
rgs = KNeighborsRegressor(k)

Usual parameters:

clf = KNeighborsClassifier(k, weights='distance')
# for weighted majority votes or average

User guide:

Outline

.brown[Introduction]
.cyan[Distances]
- .brown[kNN]
- .cyan[Centroids]
Probabilities
Rules
Hyperplanes
Learning Theory
References

Classification Data Example

Centroids algorithm

.blue[Learn]: model=centroids (averaging columns for each class)

Example: Iris

centroids
setosa	5.0	3.25	1.4	0.2
versicolor	5.85	2.9	4.15	1.35
virginica	6.25	2.75	5.55	1.9

.blue[Classify]: apply 1NN with centroids as data

Example: Iris & euclidean distance

$$distances = (0.23, 3.09, 4.65)$$

$$prediction=setosa (0.23)$$

sklearn

.blue[Classification]:

from sklearn.neighbors import NearestCentroid
clf = NearestCentroid()

It has no parameters

User guide:

Outline

.brown[Introduction]
.brown[Distances]
.cyan[Probabilities]
- .cyan[Naïve Bayes]
- LDA
- Logistic Regression
Rules
Hyperplanes
Learning Theory
References

Probability-based methods

Induction principle: .blue[probabilities]

class	cap-shape	cap-color	gill-size	gill-color
poisonous	convex	brown	narrow	black
edible	convex	yellow	broad	black
edible	bell	white	broad	brown
poisonous	convex	white	narrow	brown
edible	convex	yellow	broad	brown
edible	bell	white	broad	brown
poisonous	convex	white	narrow	pink
.center[up to 8 124 examples & 22 attributes .red[*]]

What is .blue[$P(poisonous)$]?

Probability-based methods

In most cases we estimate it from data (.blue[maximum likelihood estimation])

$$P(poisonous)=\frac{N(poisonous)}{N}=\frac{3}{7}\approx 0.429$$

How can be give a prediction from probabilities to next example?

class	cap-shape	cap-color	gill-size	gill-color
??	convex	brown	narrow	black

Some algorithms:
- Naïve Bayes
- LDA (Linear Discriminant Analysis)
- Logistic regression

Naïve Bayes

.blue[Learning Model]

$$\text{model}=[P(y)\simeq\frac{N(y)}{N},P(x_i|y)\simeq\frac{N(x_i|y)}{N(y)};\forall y \forall x_i]$$

$y$	$P(y)$
poisonous	0.429
edible	0.571
]
.col2[
attr:value	poisonous
:-----------------	----------:
cap-shape:convex	1
cap-shape:bell	0
cap-color:brown	0.33
cap-color:yellow	0
cap-color:white	0.67
gill-size:narrow	1
gill-size:broad	0
gill-color:black	0.33
gill-color:brown	0.33
gill-color:pink	0.33
]
]

Naïve Bayes

.blue[Classification]

$$h(T) \approx argmax_y P(y)\cdot P(t_1|y)\cdot\ldots\cdot P(t_n|y)$$

Test example $T$:

class	cap-shape	cap-color	gill-size	gill-color
??	convex	brown	narrow	black

Numbers: $$P(poisonous|T) = 0.429 \cdot 1 \cdot 0.33 \cdot 1 \cdot 0.33 = 0.047$$ $$P(edible|T) = 0.571 \cdot 0.5 \cdot 0 \cdot 0 \cdot 0.25 = 0$$
Prediction: $$h(T) = poisonous$$

Naïve Bayes

.blue[Notes]:

It needs a smoothing technique to avoid zero counts
- Example: Laplace $$P(x_i|y)\approx\frac{N(x_i|y)+1}{N(y)+N}$$
It assumes conditional independence between every pair of features
It is empiricaly a decent classifier but a bad estimator
- This means that $P(y|T)$ is not a good probability

Gaussian Naïve Bayes

What about numerical features?

$$P(x_i|y)=\frac{1}{\sqrt{2\pi\sigma_y^2}}\exp\left(-\frac{(x_i-\mu_y)^2}{2\sigma_y^2}\right)$$

sklearn:

from sklearn.naive_bayes import GaussianNB
clf = GaussianNB()

It has no .blue[parameters]

Outline

.brown[Introduction]
.brown[Distances]
.cyan[Probabilities]
- .brown[Naïve Bayes]
- .cyan[LDA]
- Logistic Regression
Rules
Hyperplanes
Learning Theory
References

Linear Discriminant Analysis

also from bayes rule & gaussian distributions:

$$h(T)\simeq argmax_y\frac{P(T|y)P(y)}{P(T)}$$

where $d$ = number of features and:

$$P(X)=\sum_{\forall y}P(X|y)P(y)$$

$$P(y)=\frac{N(y)}{N}$$

$$P(X|y)=\frac{1}{\sqrt{(2\pi)^d\vert\sum_k\vert}}\exp\left(-\frac{1}{2}(X-\mu_y)^T\sum_k^{-1}(X-\mu_k)\right)$$

Dimensionality reduction

PCA vs LDA:

sklearn

.blue[Dimensionality reduction]:

from sklearn.decomposition import PCA
pca = PCA(n_components=2)
Xpca = pca.fit(X).transform(X)

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
lda = LinearDiscriminantAnalysis(n_components=2)
Xlda = lda.fit(X, Y).transform(X)

.blue[Classification]:

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
clf = LinearDiscriminantAnalysis(shrinkage='auto',solver='lsqr')

.blue[User Guide]:

Outline

.brown[Introduction]
.brown[Distances]
.cyan[Probabilities]
- .brown[Naïve Bayes]
- .brown[LDA]
- .cyan[Logistic Regression]
Rules
Hyperplanes
Learning Theory
References

Logistic Regression

Also known as Maximum Entropy
Regression of the probability
.blue[Binary Classification]:

$$h_\theta(x)=\sigma(x^T\theta)$$

$$sigma(t)=\frac{1}{1+\exp(-t)}$$

sklearn

.blue[Classification]:

from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()

.blue[Parameters]:

solver = default 'lbfgs'
max_iter = default 100

.blue[User guide]:

Outline

.brown[Introduction]
.brown[Distances]
.brown[Probabilities]
.cyan[Rules]
- .cyan[Decision Trees]
- Ensembles
Hyperplanes
Learning Theory
References

Rules-based methods

Induction principle: .blue[rules]

class	cap-shape	cap-color	gill-color
poisonous	convex	brown	black
edible	convex	yellow	black
edible	bell	white	brown
poisonous	convex	white	brown
edible	convex	yellow	brown
edible	bell	white	brown
poisonous	convex	white	pink
.center[up to 8124 examples & 22 attributes.red[*]]

Which rules can be extracted from data?

Decision Trees

class	cap-shape	cap-color	gill-color
??	convex	brown	black

Learning Decision Trees

build a tree by recursively splitting for one of the attributes with most accuracy

Example:

$$\text{cap-shape}=\frac{3+2}{7}=0.71$$ $$\text{cap-color}=\frac{1+2+2}{7}=0.71$$ $$\text{gill-color}=\frac{1+3+1}{7}=0.71$$

Learning Decision Trees

$$\text{cap-color}$$

$$\text{brown} \Longrightarrow \text{poisonous}$$

$$\text{yellow} \Longrightarrow \text{edible}$$

class	cap-shape	gill-color
edible	bell	brown
poisonous	convex	brown
edible	bell	brown
poisonous	convex	pink

Learning Decision Trees

$$\text{cap-shape}=\frac{2+2}{4}=1$$

$$\text{gill-color}=\frac{2+1}{4}=0.75$$

Cutting Points

What about .blue[numerical attributes]?

class	length	width
versicolor	6.1	2.9
versicolor	5.6	2.9
virginica	7.6	3.0
virginica	4.9	2.5

.blue[Cutting points] for width attribute

class	width	cutting points	accuracy
virginica	2.5
versicolor	2.9	2.7	$\frac{1+2}{4}=0.75$
versicolor	2.9
virginica	3.0	2.95	$\frac{2+1}{4}=0.75$

Cutting Points

Resulting .blue[tree]:

ID3

.blue[ID3]: decision tree variant

Entropy is a measure of the amount of uncertainty in the data:

$$H(S)=-\sum_{y\in Y}p(y)log_2(p(y))$$

Information gain is a measure of the difference of entropy before and after splitting:

CART

.blue[CART]: another decision tree variant

Gini impurity is a measure of how often a randomly chosen element from the set would be incorrectly labeled

$$IG(p)=1-\sum_{y\in Y}p_y^2$$

It allows regression
Example:

Some Issues

Models looks like.red[*]:

Resulting rules are very understandable for humans
Normalization do not affects trees
Complex and big trees tends to overfitting (they do not generalize very well)
Small changes in data may produce big different trees

sklearn

.blue[Classification]:

from sklearn.tree import DecisionTreeClassifier 
clf = DecisionTreeClassifier()

.blue[Regression]:

from sklearn.tree import DecisionTreeRegressor
rgs = DecisionTreeRegressor()

.blue[Parameters]:

criterion = 'entropy' or 'gini'
max_depth = int or None

.blue[User Guide]:

Outline

.brown[Introduction]
.brown[Distances]
.brown[Probabilities]
.cyan[Rules]
- .brown[Decision Trees]
- .cyan[Ensembles]
Hyperplanes
Learning Theory
References

Ensembles

Emsembles: combination of classifiers for improving generalization
Meta-learners
Two big families:
- .blue[Averaging]:
  average of predictions
  improves reducing variance
  Bagging & Random Forests
- .blue[Boosting]:
  incrementally emphasizing in errors
  improves reducing bias
  AdaBoost & Gradient Boosting
Any base estimator algorithm, but the most used Decision Trees
User Guide:
https://scikit-learn.org/stable/modules/ensemble.html

Bagging

.blue[Algoritm]:

Set collection by randomly selecting with replacement from original data
A estimator is built for each of the previous sets
Prediction by averaging those of previous estimators

.blue[sklearn]:

from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
bagging = BaggingClassifier(KNeighborsClassifier())
...
# In a similar way with BaggingRegressor

User Guide:

Random Forests

It is a bagging with decision trees variant as base estimator

Nodes in decision trees are selected among a random selection of features

.blue[sklearn]

from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(n_estimators=10)
...
# In a similar way with RandomForestRegressor

User Guide:

https://scikit-learn.org/stable/modules/ensemble.html#forest

AdaBoost

.blue[Learning].red[*]:

Learn a weak classifier at each iteration (sample distribution)
At each iteration increases weight of errors and decreases the rest

AdaBoost

.blue[Classification].red[*]:

$$H(X)=sign\left(\sum_{t=1}^T\alpha_th(X)\right)$$

AdaBoost

.blue[sklearn]

from sklearn.ensemble import AdaBoostClassifier
clf = AdaBoostClassifier()
...
# In a similar way with AdaBoostRegressor

Parameters:

n_estimators=10

User Guide:
https://scikit-learn.org/stable/modules/ensemble.html#adaboost

Gradient Boosting

Generalization of boosting by optimizing (gradient descent) loss functions

.tiny[Instead of training on a newly sample distribution, the weak learner trains on the remaining errors of the strong learner.]

sklearn:

from sklearn.ensemble import GradientBoostingClassifier
clf = GradientBoostingClassifier()
...
# In a similar way with GradientBoostingRegressor

Parameters:

n_estimators=10
max_depth=1

User Guide:
https://scikit-learn.org/stable/modules/ensemble.html#gradient-boosting

XGBoost: another implementation
https://pandas-ml.readthedocs.io/en/latest/xgboost.html

Outline

.brown[Introduction]
.brown[Distances]
.brown[Probabilities]
.brown[Rules]
.cyan[Hyperplanes]
- .cyan[Kernels]
- SVM
- Neural Networks
Learning Theory
References

Centroids

Linear Classifier

Given:

$P$: positive centroid
$N$: negative centroid
$\langle,\rangle$: dot product

Formulae:

$h(T)=sign\left(\langle W,T\rangle+b\right)$

where:

$W=P-N$

$b=\frac{1}{2}(\langle P,P\rangle-\langle N,N\rangle)$

Implementation:

(html / ipynb) ] .col2[

]]

Linear Separability

Next data set is .blue[linearly not separable]

Kernels

Given a projection function $\phi(X)$ so as.red[*]:

a .blue[kernel] will be:

$$\kappa(X,Z)=\langle\phi(X),\phi(Z)\rangle$$

Kernels

Kernel matrix:

Usual Kernels.red[*]

.blue[linear]: $\langle X,Z\rangle$
.blue[polynomial]: $(\gamma\langle X,Z\rangle+r)^d$
.blue[rbf] (radial basis function): $\exp(-\gamma\Vert X-Z\Vert^2)$

Kernel Centroids

Prediction function:

Implementation:

(html / ipynb)

Some issues

Kernels has been applied to many algorithms:
- Centroids
- PCA
- k-means
- SVMs
Kernels can be adapted to the problem as another way to represent data
- There are many kernels for structured data: trees, graphs, sets...
  Example: kernel for sets

$$\kappa(X,Z)=2^{\vert X\cap Z\vert}$$

Kernel PCA

Representation of previous linearly no separable data set:

sklearn

from sklearn.decomposition import KernelPCA
kpca = KernelPCA(kernel='rbf', gamma=1)

User Guide:

https://scikit-learn.org/stable/modules/decomposition.html#kernel-pca

Outline

.brown[Introduction]
.brown[Distances]
.brown[Probabilities]
.brown[Rules]
.cyan[Hyperplanes]
- .brown[Kernels]
- .cyan[SVM]
- Neural Networks
Learning Theory
References

Support Vector Machines

Which are the best .blue[hyperplanes]?

Those that maximize the .blue[margin].

Support Vectors

What are the .blue[support vectors]?

Those nearest the margin.

Classification

Prediction functions:

Linear:

$$h(T)=\langle W,T\rangle + b = b + \sum_{\lbrace i\vert X_i\in SVs\rbrace} y_i \alpha_i \langle X_i,T\rangle$$

General kernel:

$$h(T)=\langle W,\phi(T)\rangle + b = b + \sum_{\lbrace i\vert X_i\in SVs\rbrace} y_i \alpha_i \kappa(X_i,T)$$

Soft Margin

SVMs allows some errors in the hyperplanes:

This is called .blue[soft margin].

Kernels in SVMs

SVMs support kernels.red[*]:

It also support .blue[custom kernels].

Support Vector Regression

Which is the model for .blue[regression]?

It has an additional parameter: the $\varepsilon$-tube

sklearn

Classification:

from sklearn.svm import SVC
clf = SVC()

]

Regression:

from sklearn.svm import SVR
rgs = SVR()

]]

Parameters:

kernel = 'linear', 'poly', 'rbf', 'precomputed'...
degree = 2, 3...
gamma = 'scale', 1, 0.1, 10...
C = 1, 10, 0.1  # penalty of soft margin
epsilon = 0.1
max_iter = -1, 1000...

User Guide:

https://scikit-learn.org/stable/modules/svm.html#svm-classification

Outline

.brown[Introduction]
.brown[Distances]
.brown[Probabilities]
.brown[Rules]
.cyan[Hyperplanes]
- .brown[Kernels]
- .brown[SVM]
- .cyan[Neural Networks]
Learning Theory
References

Artificial Neuron Model

Perceptron

Classification and regression
Linear model
Classification:

$$h(x)=f(\sum_{i=1}^n w_i x_i + b)$$

$$f=step\ function$$

Learning rule:

$$w_i'=w_i+\eta(h(x)-y)$$ ] .col2[

]]

sklearn

Classification:

from sklearn.linear_model import Perceptron
clf = Perceptron()

Parameters:

max_iter: default=1000

User guide:

Multi-layer Perceptron

Hidden layers
Non-linear model
Classification & regression
Forward propagation of perceptrons
Backpropagation as training algorithm
Gradient descent (optimization)

$$W^{t+1}=W^t-\eta\frac{\partial loss}{\partial W}$$

$$W={w,b}$$

$$\eta=learning\ rate$$

$$loss=training\ error$$ ] .col2[ ]]

Components

Loss functions

Regression: minimum squared error or root minimum squared error
Binary classification: binary cross entropy
Multiclass classification: categorical cross entropy

Activation functions

Hidden units: ReLU $f(x)=max(0,x)$
Output
- Regression: linear $f(x) = x$
- Binary classification: sigmoid $f(x) = 1 / (1 + exp(-x))$
- Multiclass classification: softmax (one unit per class)
  maximum value after normalizing as distribution

sklearn

Classification:

from sklearn.neural_network import MLPClassifier
clf = MLPClassifier(hidden_layer_sizes=(25,))

Regressor:

from sklearn.neural_network import MLPRegressor
clf = MLPRegressor(hidden_layer_sizes=(25,))

User Guide:

.tiny[https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier]

Parameters:

# numer of units per hidden layer
hidden_layer_sizes: default (100,)
# training examples per step 
batch_size=default min(200, n_samples)
# epochs: training of all training examples one time
max_iter: default=200
# learning rate (eta)
learning_rate_init: default=0.001
...

Learning rate ($\eta$)

Experimentally adjusted
- low values $\rightarrow$ good performance, high computational time
- high values $\rightarrow$ erratic (jumps) search, low computational time
- trade-off (experimentally)
Could be adaptative

Outline

.brown[Introduction]
.brown[Distances]
.brown[Probabilities]
.brown[Rules]
.brown[Hyperplanes]
.cyan[Learning Theory]
References

Bias & variance example

if 11 red points are the available data set:

it can be approximated by:
- a 10-degree polynomial: $R^2=1.0$
- a straight line: $R^2=0.122$
What happens to 5.5?

Bias & Variance

.footnote[.red[Source]: https://towardsdatascience.com/regularization-the-path-to-bias-variance-trade-off-b7a7088b4577]

Bias & Variance

trade-off
overfitting == low bias, high variance
underfitting == high bias, low variance
noise domine ]]

.footnote[.red[Source]: https://towardsdatascience.com/regularization-the-path-to-bias-variance-trade-off-b7a7088b4577]

Underfitting and overfitting

.blue[Underfitting] (bias)

Symptoms: Training error too high
Causes:
- model too simple
- not enough training
Solutions:
- increase model complexity
- train longer ] .col2[

.blue[Overfitting] (variance)

Symptoms: test error low
Causes:
- model too complex
- too much training
- training set too small
Solutions:
- reduce model complexity
- stop training (early stopping)
- get more training data / data augmentation

]]

Gold Rules in Learning

.blue[Occam's razor in learning]:

simpler models are more likely to be correct than complex ones
.blue[No free lunch theorem]:

there is no method which outperforms all others for all data sets
.blue[Curse of dimensionality]:

when the dimensionality increases the amount of data needed to support the result often grows exponentially

Outline

.brown[Introduction]
.brown[Distances]
.brown[Probabilities]
.brown[Rules]
.brown[Hyperplanes]
.brown[Learning Theory]
.cyan[References]

References

Aurélien Géron. Hands-On Machine Learning with Scikit-Learn, Keras & Tensorflow, 2nd Edition. O'Reilly, 2019.
Samir Kanaan. Neural Networks and Deep Learning. Improving your Neural Networks, 2020.
Gerard Escudero. Machine Learning for Games, 2019. (url)
UCI Machine Learning Repository (url)
jupyter: interative computing (url)
pandas: python data analysis library (url)
scikit-learn: machine learning in python (url)
pandas-ml: pandas machine learning (url)
tutorial markdown: lightweight syntax for writing (url)

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
codes		codes
figures		figures
LICENSE		LICENSE
estils.css		estils.css
index.html		index.html
readme.md		readme.md

License

gebakx/ml

Folders and files

Latest commit

History

Repository files navigation

Artificial Intelligence

Machine Learning

Outline

Classification Data Example

Main objective

Regression Data Example

Some Applications

Classification

Regression

Outline

Distance-based methods

Induction principle: .blue[distances]

Distance-based methods

Documentation in .blue[sklearn]

Some algorithms

Distance-based methods

Classification Data Example

Regression Data Example

1 Nearest Neighbors algorithm

Algorithm

Examples

k Nearest Neighbors algorithm

Algorithm

Examples

Some issues

sklearn

.blue[Classification]:

.blue[Regression]:

Usual parameters:

User guide:

Outline

Classification Data Example

Centroids algorithm

.blue[Learn]: model=centroids (averaging columns for each class)

Example: Iris

.blue[Classify]: apply 1NN with centroids as data

Example: Iris & euclidean distance

sklearn

.blue[Classification]:

It has no parameters

User guide:

Outline

Probability-based methods

Probability-based methods

Naïve Bayes

.blue[Learning Model]

Naïve Bayes

.blue[Classification]

Naïve Bayes

.blue[Notes]:

Gaussian Naïve Bayes

What about numerical features?

sklearn:

Outline

Linear Discriminant Analysis

Dimensionality reduction

PCA vs LDA:

sklearn

.blue[Dimensionality reduction]:

.blue[Classification]:

.blue[User Guide]:

Outline

Logistic Regression

sklearn

.blue[Classification]:

.blue[Parameters]:

.blue[User guide]:

Outline

Rules-based methods

Decision Trees

Learning Decision Trees

Example:

Learning Decision Trees

Learning Decision Trees