layout | title | tags | mathjax | ||
---|---|---|---|---|---|
post |
GBO notes: Machine learning basics (Part 4) |
|
true |
In this series of notes we will review some basic concepts that are usually covered in an Intro to ML course. These are based on this course from Cornell.
In Part 4, we will look at kernels, including kernel SVMs, and Gaussian processes.
How can we use linear classifiers (like perceptron, SVM, logistic regression, etc.) when the data is not linearly separable? One way might be to increase the data dimensionality by adding features that are non-linear combinations of other features. But the problem with this is that the data may become too high-dimensional, which would make the learning algorithm very slow.
This is where the kernel trick comes in. Consider learning a linear classifier with gradient
descent. Since the updates at each step are just linear combinations of the inputs, we can
show by induction that
A kernel function is defined as the product of the high-dimensional transformation of two vectors:
Usually, all pairwise kernels are pre-computed and stored in a kernel matrix
So what exactly is a kernel function. A kernel function must correspond to real inner-products
after some transformation
Some popular kernels:
- Linear:
$K(\mathbf{x},\mathbf{z}) = \mathbf{x}^T\mathbf{z}$ - Polynomial:
$K(\mathbf{x},\mathbf{z}) = (1+\mathbf{x}^T\mathbf{z})^d$ - Gaussian:
$K(\mathbf{x},\mathbf{z}) = e^{-\frac{\Vert\mathbf{x}-\mathbf{z}\Vert^2}{\sigma^2}}$ - Exponential:
$K(\mathbf{x},\mathbf{z}) = e^{-\frac{\Vert\mathbf{x}-\mathbf{z}\Vert}{2\sigma^2}}$ - Laplacian:
$K(\mathbf{x},\mathbf{z}) = e^{-\frac{\Vert\mathbf{x}-\mathbf{z}\Vert}{\sigma}}$ - Sigmoid:
$K(\mathbf{x},\mathbf{z}) = \mathrm{tanh}(a\mathbf{x}^T\mathbf{z}+c)$
In general, we can combine these simple kernels using a set of rules (such as addition, scalar multiplication, function transform, etc.) to get well-defined kernels.
How to kernelize an algorithm?
- Prove that the solution is spanned by the input vectors.
- Rewrite updates/algorithm to only use dot products of inputs.
- Define some kernel function
$k$ and substitute dot products with$k(\mathbf{x}_i,\mathbf{x}_j)$ .
Let us see how to apply this idea to linear regression and SVMs.
We want to express
where we have defined
Based on primal-dual optimization methods, the original SVM formulation can be written
as dual form where the constraints become the parameters. Recall that in the primal problem,
the parameter was the weight vector
$$ \begin{array}{ll} & \min {\alpha{1}, \cdots, \alpha_{n}} \frac{1}{2} \sum_{i, j} \alpha_{i} \alpha_{j} y_{i} y_{j} K_{i j}-\sum_{i=1}^{n} \alpha_{i} \ \text { s.t. } & 0 \leq \alpha_{i} \leq C \ & \sum_{i=1}^{n} \alpha_{i} y_{i}=0 \end{array} $$
In some ways, kernel SVM can be thought of as a "soft" version of the k-NN algorithm, where instead
of considering only the
A Gaussian random variable is characterized by the following distribution
It often occurs in the real world because of the Central Limit Theorem, which states that sum of independent random variables tends to be close to a normal distribution.
Recall that in linear regression, we first made the assumption that the data lies approximately
on a straight line with some normally-distributed error, and then used either MLE or MAP
to solve for
But this is a frequentist approach, i.e., we are committing ourselves to a single
Here, both the terms within the integral are Gaussian (if we assume
where
and
where