layout | title | tags | mathjax | ||
---|---|---|---|---|---|
post |
GBO notes: Machine learning basics (Part 3) |
|
true |
In this series of notes we will review some basic concepts that are usually covered in an Intro to ML course. These are based on this course from Cornell.
In Part 3, we will look at SVMs, empirical risk minimization, model selection, and the bias-variance tradeoff.
SVMs are an extension of the perceptron model discussed earlier. They are again a linear
classifier, so we define the hypothesis class as
The difference between perceptron and SVM is that while perceptron returns any hyperplane
separating the classes, SVM returns the hyperplane with the maximum margin. Let us first define
the margin: it is the minimum distance from hyperplane
Such a hyperplane can be computing by maximizing
The points for which the constraint is satisfied with equality are called the "support vectors". The hyperplane only changes if we change the support vectors. Later, we will look at the dual formulation of SVMs, which makes use of support vectors.
If the data is not linearly separable, we can add slack variables to allow points to be misclassified a little. We can also write a loss formulation for SVMs that encompasses the constraints into the loss function (in the form of a hinge loss):
$$ \min {\mathbf{w}, b} \underbrace{\mathbf{w}^{T} \mathbf{w}}{l_{2}-\text { regularizer }}+C \sum_{i=1}^{n} \underbrace{\max \left[1-y_{i}\left(\mathbf{w}^{T} \mathbf{x}+b\right), 0\right]}_{\text {hinge-loss }}. $$
We can see that if the constraint is satisfied, the hinge loss term is 0, otherwise it is equal
to the "slack" times the constant
It means minimizing some continuous loss function
Binary classification losses:
- Hinge loss: used in SVMs
- Log-loss: used in logistic regression
Regression losses:
- Mean squared error: sensitive to outliers
- Absolute error: not differentiable at 0
- Log-cosh: best of both worlds
Regularizers: L1, L2, Lp norms.
Overfitting and underfitting are equivalent to high variance and high bias, respectively. Usually we need to select a regularization coefficient which avoids both high bias and high variance.
What to do when we have high variance (overfitting)?
- Add more training data
- Reduce model complexity
What to do when we have high bias (underfitting)?
- Add more features
- Increase model complexity (e.g., non-linear models)