layout | title | tags | mathjax | ||
---|---|---|---|---|---|
post |
GBO notes: Machine learning basics (Part 5) |
|
true |
In this series of notes we will review some basic concepts that are usually covered in an Intro to ML course. These are based on this course from Cornell.
In this final part, we will look at k-dimensional trees, decision trees, bagging, and boosting.
In the k-NN algorithm, at test time, we have to compute the distance between the test point
and every other point in the training set. This can take a long time if
The idea is that we partition the feature space into boxes. For each feature, we split the
dataset in half based on the feature, and repeat this process for top
A limitation of this approach is that in high dimensions, all the points are far apart, and so it is likely that every point will be in its own box. Also, all the splits are axis-aligned.
An improvement to the simple k-d tree is the Ball-tree, in which instead of boxes, we split based on hyperspheres.
Although k-d trees improve upon the time complexity of k-NN algorithm for classification, they still require storing a lot of information about the training data and what boxes they lie in. For classification, we don't really need all this storage overhead. Decision trees help to build a non-linear classifier which does not need to store the training data.
This is done by recursively splitting the dataset (i.e., building a tree) based on thresholding some chosen feature. We stop the splitting when the set at the leaves are pure (i.e., contain all points with the same label). Once we have constructed the tree, we do not need to store the training points anymore. They are also super fast at test time and non-parametric since they only require feature comparisons.
Although it is NP-hard to find a decision tree with the minimum size, we can find good approximations using some impurity functions like Gini impurity.
An important model related to decision trees are CART (Classification and Regression Trees).
It can be used to perform regression task, i.e.,
Models such as perceptron, logistic regression, etc. are parametric, meaning that they have a fixed number of parameters independent of the size of training data. Algorithms like k-NN are non-parametric since the apparent number of parameters scales with the size of the training data (we need to store all training data to make predictions).
Kernel SVMs are an interesting case since their category depends on the type of kernel used. Linear kernel SVMs are parametric since they are functionally similar to perceptrons. RBF kernel SVMs need to store the support vectors, and so are non-parametric. Polynomial kernels are technically parametric, but in practice, it is much more convenient to just store the support vectors, and so they behave like non-parametric models.
Decision trees also behave both ways. If trained to full depth ($\mathcal{O}(\log n)$), they are non-parametric. But in practice, we limit the depth by a maximum value and so they behave like a parametric model.
Bagging is an ensemble technique used to reduce variance (overfitting). It stands for "bootstrap aggregating".
The idea comes from weak law of large numbers. If we have
Random forest: It is the most popular bagging-based method --- it is nothing but decision
trees with bagging, with one small modification: at each split, we randomly sample
While bagging is used to reduce variance, boosting is a method to reduce bias, i.e., solve underfitting. Suppose we have a bunch of "weak learners" (i.e., high training error). The question is how can we combine these to form a strong learner?
The idea is to create an ensemble classifier as
i.e,
To find such an
Gradient descent in functional space: Again, using Taylor approximation, we get
So the solution is
$$ h_{t+1} = \operatorname{argmin}{h\in H} l(H_t + \alpha h) = \operatorname{argmin}{h\in H} <\Delta l(H),h>, $$
where we have fixed
This means that if we have some algorithm to compute
GBRT: The weak learners are fixed-depth decision trees (or CART). In this case, the hypothesis
that minimizes the loss turns out to be the one that is closes to
AdaBoost is another powerful boosting algorithm over binary learners, but we will skip it here.