Regression Objective and Evaluation Functions

Test your knowledge

Objective function vs Evaluation function

The objective function is the target that your model tries to optimize when training on a dataset.

The Evaluation function (or evaluation metrics) is, as suggested by its name, a function to evaluate the performance of a Machine learning model on a dataset.

New researchers often get confused about objective functions and evaluation functions. Their difference is: the objective function is perceived by your model/algorithm, it targets to optimize the objective function. On the other hand, the evaluation function is only observed by the researchers themselves and is evaluated after the training complete.

A common question is: Why do we care about the evaluation function but have the model to optimize the objective function? If we want the evaluation function, why don’t we use it as the target for our model to optimize?

The answer is: yes, in the perfect world, a separate objective function should not exist, our model should optimize the evaluation function, which is the function we – researchers really care about. But in our world, it is not so easy. Some evaluation function is not optimizable by the machine, which is why we need an objective function to act as a proxy function to approximate the evaluation function. To go a bit farther, the reason for being not optimizable is because they are not differentiable, which is a needed condition for optimization algorithms like Gradient Descent.

On a side note, a model can have only 1 objective function but can have many evaluation functions. It is also advisable for researchers to evaluate model performance on various points of view (various evaluation functions).

Every objective function can work as an evaluation function, but not vice versa.

There are other terms that are closely related to Objective function, like Loss function or Cost function.

In high-level usage, you can just assume that those terms have the same meaning and are just other names for Objective function.

But in some literature, the authors may use them a little bit differently:

The Loss function is sometimes referred to as a function to compute the error of your prediction on 1 data point, while
The Cost function measures the total error of your predictions on the entire dataset.

The functions

Below, I will list out some of the most common objective/evaluation functions for regression models. Note that these functions measure the error of the whole dataset, not just an individual sample like loss functions.

Depending on the problem we want to solve that we choose a suitable objective function (and one or more evaluation functions).

Mean Absolute Error

Mean Absolute Error (MAE) is also called the L1 cost function.

MAE = $\frac{1}{n} \sum_{1 \leq i \leq n}|y_i - y_i'|$ ,

where:
n is the number of samples,
$y_i$ is the true response of the i-th sample,
$y'_i$ is the predicted response of the i-th sample.

Value of MAE is in range $[0, +\infty)$ .

For our model to be better, we should minimize MAE.

MAE can be used as an Objective function.

Mean Squared Error

Mean Squared Error (MSE) is also called the L2 cost function.

MSE = $\frac{1}{n} \sum_{1 \leq i \leq n}(y_i - y_i')^2$ ,

where:
n is the number of samples,
$y_i$ is the true response of the i-th sample,
$y'_i$ is the predicted response of the i-th sample.

Value of MSE is in range $[0, +\infty)$ .

For our model to be better, we should minimize MSE.

MSE can be used as an Objective function.

Max Absolute Error

Max Absolute Error (MaxAE) is also called the L $\infty$ cost function.

MaxAE = $max_{1 \leq i \leq n}(|y_i - y'_i|)$ ,

where:
n is the number of samples,
$y_i$ is the true response of the i-th sample,
$y'_i$ is the predicted response of the i-th sample.

Value of MaxAE is in range $[0, +\infty)$ .

For our model to be better, we should minimize the MaxAE.

MaxAE cannot be used as an Objective function.

Root Mean Squared Error

RMSE = $\sqrt{\frac{1}{n} \sum_{1 \leq i \leq n}(y_i - y_i')^2}$ ,

where:
n is the number of samples,
$y_i$ is the true response of the i-th sample,
$y'_i$ is the predicted response of the i-th sample.

Value of RMSE is in range $[0, +\infty)$ .

For our model to be better, we should minimize RMSE.

RMSE can be used as an Objective function.

R-Squared

R-squared (or $R^2$ or $r^2$ ) represents how much variance of the response variable is predictable from the predictor variables.

Suppose:

$\overline{y}$ is the mean of all observed responses, we have:

$\overline{y} = \frac{1}{n} \sum_{1 \leq i \leq n} y_i$

The total sum of squares, which measure the variance of responses:

$SS_{tot} = \sum_{1 \leq i \leq n}(y_i - \overline{y})^2$

The sum of squares of residual (residual is the difference between true response and predicted response, sometimes this term can be used interchangeably with error):

$SS_{res} = \sum_{1 \leq i \leq n} (y_i - y'_i)^2$

The R-squared is computed by:

$R^2 = 1 - \frac{SS_{res}}{SS_{tot}}$ .

Normally, value of $R^2$ is in range [0, 1]. $R^2$ can only be negative if the model you use is worse than a simple model that gives the output as the mean of the responses for any sample.

A value of $R^2$ , for e.g. 0.78, means that using our model, 78% of the difference in the response variable can be explained by the predictor variables.

For our model to be better, we should maximize $R^2$ .

Note that $R^2$ should not be used for non-linear models since those models are extremely complex and can perfectly-fit almost any input data, using $R^2$ will over-rate those models’ strength. Even for linear models, $R^2$ is not enough to determine if a model is acceptable or not.

$R^2$ cannot be used as an Objective function.

Adjusted R-Squared

A disadvantage of R-squared is that it prefers models with a higher number of predictor variables. Notice the more the predictors, the more your linear model will over-fit the training data, which results in a higher $R^2$ . In other words, when we want a measurement that gives higher value if the model is better (not just on the training data, but in general), R-squared gives higher value when the model fits the training data better, even if it over-fits.

What should be emphasized here is that $R^2$ will only be higher or equal when we add more predictors to the model, which is not good because we cannot compare 2 models with a different number of predictors. Hence, the Adjusted $R^2$ was born to solve this problem.

For Adjusted $R^2$ , adding a new predictor will only increase Adjusted $R^2$ if that predictor increases the model’s performance more than expected by chance. That is to say, adding a good predictor will help the Adjusted $R^2$ increase, while adding a bad one will make Adjusted $R^2$ decrease.

Adjusted $R^2 = 1 - (1 - R^2) \frac{n - 1}{n - k - 1}$ ,

where:
n is the number of samples,
k is the number of predictor variables in the model.

You may also noticed that:

Adjusted $R^2 = 1 - \frac{VAR_{res}}{VAR_{tot}}$ ,

where:
$VAR_{res} = \frac{SS_{res}}{n - k - 1}$ is the unbiased variance of residuals.
$VAR_{tot} = \frac{SS_{tot}}{n - 1}$ is the unbiased variance of responses.

Adjusted $R^2$ is usually in range [0, 1] and sometimes can be negative. Adjusted $R^2 \leq R^2$ .

For our model to be better, we should maximize the Adjusted $R^2$ .

Adjusted $R^2$ cannot be used as an Objective function.

Predicted R-Squared

Predicted $R^2$ is simply $R^2$ of the testing data. While using $R^2$ or Adjusted $R^2$ , we don’t separate our data to a training set and a testing set, yet, with Predicted $R^2$ , we do.

We can use many methods to split data into training and testing sets (e.g. random split, k-fold cross-validation). K-fold is preferred because it is stronger over over-fitting. Here is a guideline for splitting.

The idea here is: we training our model on the training set, and Predicted $R^2$ is the $R^2$ measured on the testing set. By calculating $R^2$ on a separated set of data, we can be more confident with the generalization power of the model.

Note that how we split the data is very important. If any information on the testing data is leaked to the training data, it will ruin our measurement.

Predicted $R^2$ cannot be used as an Objective function.

Mean Squared Log Error

MSLE = $\frac{1}{n} \sum_{1 \leq i \leq n} [ln(y_i) - ln(y'_i)]^2$ ,

where:
n is the number of samples,
$y_i$ is the true response of the i-th sample,
$y'_i$ is the predicted response of the i-th sample.

MSLE is often used when the response can be exponentially large. E.g. when your problem is to predict house prices, the response variable can vary, from several thousand to some millions. The error will be proportional to the ratio of predicted response over true response, rather than absolute difference.

Given an example below:

	sample 1	sample 2
true response	20	2000
predicted response	30	3000
Squared Log Error	0.164	0.164

Using MSLE, the errors of 2 samples are the same, even if the absolute differences are not so.

A note is that MSLE penalizes under-estimation more than over-estimation. Hence, the final model will more likely over-estimate the samples rather than under-estimate. Look at the following example.

	sample 1	sample 2
true response	20	20
predicted response	15	25
Squared Log Error	0.083	0.05

One more note: to use MSLE, the responses must be positive as $ln$ cannot take zero as its argument ( $ln(0)$ is undefined). However, there is quite a high chance that some values $y_i$ or $y'_i$ be zero, so in practice, we often add 1 to $y_i$ and $y'_i$ when computing MSLE. In other words, we often use this modified version of MSLE:

MSLE = $\frac{1}{n} \sum_{1 \leq i \leq n} [ln(1 + y_i) - ln(1 + y'_i)]^2$

Value of MSLE is in range $[0, +\infty)$ .

For our model to be better, we should minimize MSLE.

MSLE can be used as an Objective function.

Test your understanding

Summary

In this blog, we introduced the Objective function and the Evaluation function along with their differences. The Objective function is the target that your model tries to optimize, while the Evaluation function is what we – humans see and care about (and want to optimize).

We also presented some of the most common objective/evaluation functions for regression problems. Each of them has its own advantages and disadvantages.

MAE is simple and easily interpreted.

MSE is an alternative for MAE if you want to emphasize on penalizing higher error.

MaxAE is less common than the 2 above. Use it if you only care about the worst-case scenario.

RMSE is quite similar to MSE. Use it instead of MSE if you want the error to have the same unit as the response.

$\boldsymbol{R^2}$ , Adjusted $\boldsymbol{R^2}$ and Predicted $\boldsymbol{R^2}$ are used only for linear regression. $R^2$ is the original version, so it is the simplest among the 3. Adjusted $R^2$ penalizes models that have useless predictors. Predicted $R^2$ is the most robust one against over-fitting.

MSLE should be used when your response is non-negative and is exponential, and you want the error to be proportional to ratio rather than absolute difference. Notice that MSLE penalizes under-estimation more than over-estimation.

So, to close this topic, I would say that choosing which objective/evaluation function to use depends on your specific problem and how you want your outcome to be. You can choose one from above, or just create your own custom function.

References:

primo’s page about different evaluation functions: link
a StackExchange question about objective, cost, and lost functions: link
an algorithmia post about loss function: link
Wikipedia’s page about Loss function: link
Wikipedia’s page about Norm: link

Tung M Phung's Blog

Regression Objective and Evaluation Functions

Leave a ReplyCancel reply