layout | title | tags | mathjax | ||
---|---|---|---|---|---|
post |
GBO notes: Machine learning basics (Part 1) |
|
true |
In this series of notes we will review some basic concepts that are usually covered in an Intro to ML course. These are based on this course from Cornell.
In Part 1, we will look at the basic problem of supervised learning, simple classifiers such as k-NN and perceptron, and MLE and MAP estimators.
In supervised learning, we are given a dataset
We usually have to make some assumption about the type of function
Once we have selected the hypothesis class, we find the best function
To ensure that we are generalizing and not just memorizing the training data, we keep a held-out validation set on which we periodically measure the loss while training. If the test data is large enough, it is an unbiased estimator of the model performance, because of the weak law of large numbers.
It is a non-parametric classification method. The idea is that given a test input
Theorem: For large
The proof uses the fact that for large
However, if we consider a hyperplane in the high-dimensional space, the distance of points to the hyperplane does not change regardless of the dimensionality! This means that in high dimensions, distances between points are large but distances to hyperplanes are tiny. Classifiers such as SVMs and perceptrons use this idea.
It is used for binary classification and assumes that the data is linearly separable, i.e.,
separable by a hyperplane. The hypothesis class is defined as
To learn
Theorem: Suppose there is a hyperplane separating the points in a unit-normalized hypersphere,
defined by the normal
Maximum likelihood estimate (MLE): We first assume some distribution from which the data was sampled from, and then compute the parameters of the distribution which maximizes the probability of generating the data, i.e.,
Example: for estimating the probability of head in a coin toss given some observations,
we can assume that the probability comes from a Bernoulli distribution with parameter
However, if we have small number of observations, this method of estimating
Maximum a Posteriori (MAP): The MAP estimate is given as:
Quick summary:
- MLE prediction:
$P(y|\mathbf{x};\theta)$ Learning:$\theta = \textrm{arg}\max_{\theta}P(D;\theta)$ . Here$\theta$ is purely a model parameter. - MAP prediction:
$P(y|\mathbf{x},\theta)$ Learning:$\theta = \textrm{arg}\max_{\theta}P(\theta|D) = \mathrm{arg}\max_{\theta} P(D|\theta)P(\theta)$ . Here$\theta$ is a random variable.