layout | title | tags | mathjax | ||
---|---|---|---|---|---|
post |
GBO notes: Machine learning basics (Part 2) |
|
true |
In this series of notes we will review some basic concepts that are usually covered in an Intro to ML course. These are based on this course from Cornell.
In Part 2, we will look at Naive Bayes, logistic regression, gradient descent, and linear regression.
If we have an estimate
So how can we estimate
To solve this problem, we can use Bayes rule and the Naive Bayes assumption. First, recall that
i.e., we assume that the feature dimensions are independent of each other.
Suppose we have a real-valued feature vector. We will assume that for every class, each feature is generated from a Gaussian distribution. This is equivalent to saying that
where
Logistic regression is the discriminative counterpart of Naive Bayes. It defines
Note that this form is the same as what is obtained using Naive Bayes by taking a Gaussian form for the likelihood function. The model is trained by maximizing the conditional likelihood, or equivalently, minimizing the negative log-likelihood, which gives
There are no closed form solutions to this optimization problem, so we have to use gradient descent.
In addition to the MLE estimate, we can also compute the MAP estimate by treating
which again needs to be solved by gradient descent.
Hill climbing methods use Taylor's apprximation of the function. Suppose we want to minimize
a function
where
In gradient descent, we choose
Newton's method sets
So far we have looked mainly at classification tasks. We will now see a regression task, i.e.,
where
where
Estimating with MLE:
$$ \begin{aligned} \mathbf{w} &=\underset{\mathbf{w}}{\operatorname{argmax}} P\left(y_{1}, \mathbf{x}{1}, \ldots, y{n}, \mathbf{x}{n} \mid \mathbf{w}\right) \ &=\underset{\mathbf{w}}{\operatorname{argmax}} \prod{i=1}^{n} P\left(y_{i}, \mathbf{x}{i} \mid \mathbf{w}\right) \ &=\underset{\mathbf{w}}{\operatorname{argmax}} \prod{i=1}^{n} P\left(y_{i} \mid \mathbf{x}{i}, \mathbf{w}\right) P\left(\mathbf{x}{i} \mid \mathbf{w}\right) \ &=\underset{\mathbf{w}}{\operatorname{argmax}} \prod_{i=1}^{n} P\left(y_{i} \mid \mathbf{x}{i}, \mathbf{w}\right) P\left(\mathbf{x}{i}\right) \ &=\underset{\mathbf{w}}{\operatorname{argmax}} \prod_{i=1}^{n} P\left(y_{i} \mid \mathbf{x}{i}, \mathbf{w}\right) \ &=\underset{\mathbf{w}}{\operatorname{argmax}} \sum{i=1}^{n} \log \left[P\left(y_{i} \mid \mathbf{x}{i}, \mathbf{w}\right)\right] \ &=\underset{\mathbf{w}}{\operatorname{argmax}} \sum{i=1}^{n}\left[\log \left(\frac{1}{\sqrt{2 \pi \sigma^{2}}}\right)+\log \left(e^{-\frac{\left(\mathbf{x}{i}^{\top} \mathbf{w}-y{i}\right)^{2}}{2 \sigma^{2}}}\right)\right] \ &=\underset{\mathbf{w}}{\operatorname{argmax}}-\frac{1}{2 \sigma^{2}} \sum_{i=1}^{n}\left(\mathbf{x}{i}^{\top} \mathbf{w}-y{i}\right)^{2} \ &=\underset{\mathbf{w}}{\operatorname{argmin}} \frac{1}{n}\sum_{i=1}^{n}\left(\mathbf{x}{i}^{\top} \mathbf{w}-y{i}\right)^{2} \end{aligned} $$
Therefore, computing the MLE for linear regression is equivalent to minimizing the mean squared error. This can be done using gradient descent (see above), but it also has a closed form solution given as
Estimating with MAP:
We assume that
$$\mathbf{w} = \underset{\mathbf{w}}{\operatorname{argmin}} \frac{1}{n}\sum_{i=1}^{n}\left(\mathbf{x}{i}^{\top} \mathbf{w}-y{i}\right)^{2} + \frac{\sigma^2}{n\tau^2}\Vert \mathbf{w} \Vert^2,$$
which is also known as ridge regression. Conceptually, it adds an L2-regularization to the mean squared loss.