## Wednesday, July 23, 2014

### Deriving Least Squares Error for Linear Regression

In this article we'll take a deeper look into machine learning, in specific we'll delve deeper into linear regression and see why we didn't just pull a rabbit out of a hat to get the cost function for linear regression, which is denoted by $$J(\theta) = \frac{1}{2} \sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^{2}$$

This formula is derived the fact that we can model distributions over continuous values, as is the case with linear regression,  with a Gaussian (Normal) distribution.  From a probabilistic point of view, we're asking ourselves the question, $P(y^{(i)}|x^{(i)};\theta)$, or in other words, what is the probability that $y^{(i)}$ takes on a specific value given a training example $x^{(i)}$ and paramaterized by vector $\theta$. This problem blends in nicely to be modeled by a Gaussian, so we can rewrite our probability to say

$$P(y^{(i)}|x^{(i)};\theta) = \frac{1}{\sqrt(2\pi)\sigma}e^{-\frac{1}{2} \left(\frac{\large(y^{(i)} - h_{\theta}(x^{(i)})^2}{\sigma^2}\right)}$$

Each training example, and hence target values for training examples, are independent of each other. This allows us to express the probability of the entire vector $y$ as a product of probabilities.

$$P(Y|X;\theta) = \prod\limits_{i=1}^{m} P(y^{(i)}|x^{(i)};\theta)$$

Given that $\theta$ is paramaterizing the Gaussian, we can find the likelihood of this vector $\theta$ by setting it equal to the probability of the entire Y vector. $L(\theta)$ will denote the likelihood of $\theta$.

$$L(\theta) = P(Y|X;\theta) = \prod\limits_{i=1}^{m} P(y^{(i)}|x^{(i)};\theta)$$

Given this equation, our primary objective know will be to find the argument $\theta$ that maximizes this term. In other words, we want to do maximum likelihood estimation. Our optimization goal becomes:

$$\mathop{\arg\,\max}\limits_\theta L(\theta)$$

Although we could approach this problem in its raw form as it is right now, a common first step is to take the logarithm of the optimization objective and try to optimize that instead. The reason this a valid step is because log is a monotonic function, so the larger the parameter plugged into log, the larger the result and it works the same way for minimization. Logs also have the nice property that the log of a product such as $log(ab) = log(a) + log(b)$. So if we take the log of our likelihood function, $L(\theta)$, let's call it $l(\theta)$.
$$l(\theta) = \log(L(\theta))$$
$$l(\theta) = \log( \prod\limits_{i=1}^{m} P(y^{(i)}|x^{(i)};\theta))$$
$$l(\theta) = \log(\prod\limits_{i=1}^{m}\frac{1}{\sqrt(2\pi)\sigma}e^{-\frac{1}{2} \left(\frac{\large(y^{(i)} - h_{\theta}(x^{(i)})^2}{\sigma^2}\right)})$$
$$l(\theta) = \sum\limits_{i=1}^{m}\log(\frac{1}{\sqrt(2\pi)\sigma}) + \log(e^{-\frac{1}{2} \left(\frac{\large(y^{(i)} - h_{\theta}(x^{(i)})^2}{\sigma^2}\right)})$$
$$l(\theta) = \frac{1}{\sqrt(2\pi)\sigma}\sum\limits_{i=1}^{m}-\frac{1}{2} \left(\frac{\large(y^{(i)} - h_{\theta}(x^{(i)})^2}{\sigma^2}\right)$$

The first equation is our definition of what $l(\theta)$ is. The following 2 equations are direct substitutions of what we stated previously with regards to modelling our probabilities as a Gaussian distribution. The following equation exploits the product property of logarithms I stated earlier where we turn a log of products into a sum of logs.  The final equation factors out the normalizer term since it does not depend on the iterated value, and we also realize the $\log(e^{x}) = x$, so by taking the logarithm of the exponential term in the Gaussian, we just get back whatever was in the exponent.

For the purposes of optimization, the normalizer term $\frac{1}{\sqrt(2\pi)\sigma}$ can be ignored and so can the term $\sigma^2$ inside the summation. By taking out the negative  inside the sum we can write our new optimization objective as
$$l(\theta) = -\sum\limits_{i=1}^{m} \left(\frac{\large(y^{(i)} - h_{\theta}(x^{(i)})^2}{2}\right)$$

Our objective is still to maximize this term. However, by minimizing the term inside the sum we also maximize the resulting value of $l(\theta)$. Why is this the case? We have a negative outside of the sum. So if we minimize the term inside the sum we'll have $l(\theta) = -(minimal)$ instead of $l(\theta) = -(maximal)$. Thus, our new optimization objective is to minimize:

$$\sum\limits_{i=1}^{m} \left(\frac{\large(y^{(i)} - h_{\theta}(x^{(i)})^2}{2}\right)$$

Factoring out the $\frac{1}{2}$ becomes

$$\frac{1}{2}\sum\limits_{i=1}^{m} \left(\large(y^{(i)} - h_{\theta}(x^{(i)})^2\right)$$

We can reorder the term $\large(y^{(i)} - h_{\theta}(x^{(i)})^2$ because it's squared. So, our final, resulting optimization objective becomes:

$$\frac{1}{2} \sum\limits_{i=1}^m (h_{\theta}(x^{(i)}) - y^{(i)})^{2}$$.

Look familiar? It should! This is the least squares error equation we "pulled out of a hat" earlier :) Minimizing this can be done in several ways. The first approach that is taught  is usually gradient descent, where we continue optimizing based on the gradient of our optimization objective. This will converge (eventually) because of the fact that a quadratic function in multiple variables ends up producing a bowl-shaped figure which has 1 global minimum. We can also use Newton's algorithm, which would involve deriving the Hessian of the function and inverting it but can converge in 1 iteration.  A closed form solution exists due to the Normal equations. Although I won't derive them in this article, you can figure out optimal values of $\theta$ with this equation:

$$\theta = (X^{\intercal}X)^{-1}X^{\intercal}Y$$

Several of the ideas and equations in machine learning can be derived from first thinking about how to model your data with a certain distribution. In a future article, we'll see how to derive the logistic regression algorithm by modelling our data as a Bernoulli distribution.

I hope this article was informative! Let me know what you think :)