How to make a Linear Regressor? (theory)

A beautiful sight

This post presents the formulas for coming up with the best-fitted linear regression line. If you want to practically make a regressor, read the post about How to make a Linear Regressor in Python.

Prerequisites: some Linear Algebra and Calculus are required to follow this post.

Test your knowledge

How to make a Linear Regressor? (theory) - Quiz 1

1 / 3

What is an OLS?

2 / 3

For OLS, does Gradient Descent guarantee to converge to the optimal states?

3 / 3

Which word is closest in meaning to Stochastic?

Your score is


Please rate this quiz

There are essentially 2 ways to make a Linear regressor from given training data and a cost function. Those 2 are:

  • Closed-form Linear regression algorithm
  • Gradient descent for Linear regression

For the cost function, we will use the Least Squared Error (LSE) in this post. The cost function’s formula is:

J_{(w)} = \frac{1}{2} \sum_{1 \leq i \leq n} (h_w(x^{(i)}) - y^{(i)})^2


  • w: the coefficients (the weights) in Linear regressor,
  • J_{(w)}: the cost of the regressor,
  • n: the size of training data,
  • x^{(i)}: the i-th training data,
  • y^{(i)}: the true response-value of the i-th training data,
  • h_w(x^{(i)}): the predicted-value for the i-th training data,
  • \frac{1}{2}: a constant added for mathematical convenience.

Note that the Linear regression that uses this cost function is also called an Ordinary Least Squares (OLS) regression.

1. Closed-form Linear Regression algorithm

Caveat: This closed-form algorithm of Linear Regression is only applicable if the cost function is LSE. For other cost functions, we should use the Gradient Descent approach.

The main idea behind this method is to minimize the cost function (J), we find its derivatives with respect to w and set this value to 0 , which is:

\begin{aligned}\nabla_wJ(w)    &= \nabla_w \frac{1}{2} \sum_{1 \leq i \leq n} (h_w(x^{(i)}) - y^{(i)})^2 \\                            &= \nabla_w \frac{1}{2} (Xw - y)^T(Xw - y) \\                            &= \frac{1}{2} \nabla_w ((Xw)^TXw - (Xw)^Ty - y^T(Xw) + y^Ty) \\                            &= \frac{1}{2} \nabla_w (w^T(X^TX)w - y^T(Xw) - y^T(Xw)) \\                            &= \frac{1}{2} \nabla_w (w^T(X^TX)w - 2(X^Ty)^Tw) \\                            &= \frac{1}{2} (2X^TXw - wX^Ty) \\                            &= X^TXw - X^Ty \\\end{aligned}

where X is the training data.

In order to minimize J(w), we set its derivatives to 0, thus:

X^TXw = X^Ty

which gives:

w = (X^TX)^{-1}X^Ty

Here we have the best-fitted line for the given data!

2. Gradient Descent

Unlike the Closed-form solution which can only work for LSE cost function, the Gradient Descent method can be applied for all cost functions, as long as they are differentiable.

If you haven’t heard about Gradient Descent, I suggest taking some time to read tutorials about it, as Gradient Descent is the basic technique for many optimization problems that we will encounter frequently.

In simple words, Gradient Descent (or Accent) is a greedy-like, iterative method that simulates the process of mountain climbing. Imagine optimizing the objective function is like climbing a mountain filled with fog, the higher position you can get to, the better your objective function is (or say the lower your cost function). Because of the fog, you cannot see far away, but only a radius of 1 meter around you is visible. At each step of the iteration, you look around to choose which direction to go next. As Gradient Descent is a greedy algorithm, you choose the local-best choice – the direction that gets you to the highest position in the 1-meter-radius around.

illustration of Gradient Accent method

More specifically, the mountain climbing process is a metaphor for Gradient Descent as below:

Mountain ClimbingGradient Descent
Climb the mountain.Minimize the cost function.
Get to the peak of the mountain.The cost function converges to the global minimum.
The current position.The current regressor (the current set of parameters of the regressor).
The visibility is 1 meter.We examine 1 or some data samples at each iteration.
Look around and choose the local best direction to go.Take derivatives of the cost function on the sample data and change the regressor a little bit accordingly.

For simplicity, in the following, we examine 1 sample at a time.

Let’s call d the number of predictor variables. Our regressor is consisting of d+1 parameters (one for each predictor variable and a constant). The Gradient Descent method works as:

Loop until converge:
  For each sample data i (1 \leq i \leq n):
   Simultaneously for each j (0 \leq j \leq d):
   w_j = w_j - \alpha \frac{\partial}{\partial w_j}J_{(w)}^{(i)}


  • \alpha being the Learning rate, which adjusts the intensity of our modification at each step.
  • J_{(w)}^{(i)} is the cost (error) of i-th sample.

We can simplify the formula with:

\begin{aligned}\frac{\partial}{\partial w_j}J_{(w)}^{(i)} &= \frac{\partial}{\partial w_j} \frac{1}{2}(h_w(x^{(i)}) - y)^2 \\                                                                        &= 2 * \frac{1}{2} * (h_w(x^{(i)}) - y) * \frac{\partial}{\partial w_j} (h_w(x^{(i)}) - y) \\                                                                        &= (h_w(x^{(i)}) - y) * \frac{\partial}{\partial w_j} (\sum_{0 \leq k \leq d}w_kx^{(i)}_k - y) \\                                                                        &= (h_w(x^{(i)}) - y)x_j \end{aligned}

More generally,

\frac{\partial}{\partial w}J_{(w)}^{(i)} = (h_w(x^{(i)}) - y)x

In short, this is how we iteratively update the regressor’s parameters:

Loop until converge:
  For each sample data i (1 \leq i \leq n):
   w = w - \alpha (h_w(x^{(i)}) - y)x

It is so beautiful that after math transformations, the formula becomes very intuitive. We adjust the regressor’s parameters directly by the difference between predicted value and the true value.

In general, Gradient Descent, like many other optimization algorithms, has problems with struggling around local minima. However, in this case, our cost function (LSE) is convex and has only 1 global minimum, no local minimum, hence the algorithm always converges to the best possible result.

Practically, to reduce the number of updates on the regressor’s weights (to make training faster), at each step, we can examine a batch of samples instead of 1 sample alone (Mini-batch Gradient Descent). These batches can be mutually exclusive or just randomly chosen (Stochastic Gradient Descent).

Test your understanding

How to make a Linear Regressor? (theory) - Quiz 2

1 / 5

What best describes Mini-batch Gradient Descent?

2 / 5

Screenshot From 2020 03 20 17 08 08

Given the above cost function of OLS, why do we add Screenshot From 2020 03 20 17 09 42 at the beginning of the function?

3 / 5

Screenshot From 2020 03 20 17 08 08

Given the above cost function of OLS, what does Screenshot From 2020 03 20 17 15 15 mean?

4 / 5

In general, the Gradient Descent method ...

5 / 5

What are the 2 types of fitting a Linear regression to data?

Your score is


Please rate this quiz


Above, we walked through 2 methods to derive the best-fitted regressor from input data.

  • The Closed-form formula gives a straight-forward recipe to make the regressor using Matrix multiplication. Note that this only works for LSE cost function.
  • Gradient Descent is an iterative approach to this problem. This optimization algorithm can be used with any differentiable cost function, but can only guarantee to produce the optimal solution if the cost function is convex (e.g. LSE).

You can find the full series of blogs on Linear regression here.


  • Andrew Ng’s Machine Learning course CS229, note 1: link

Leave a Reply