Test your knowledge |
|
Confidence Interval of Coefficients?
Not only does Linear regression give us a model for prediction, but it also tells us about how accurate the model is, by the means of Confidence Intervals.
If you are not familiar with the term Confidence Intervals, there is an introduction here: Confidence Level and Confidence Interval.
For simplicity, let’s consider a simple linear regression (SLR): . As and are estimated, we are not 100% sure if these and are really the best parameters for this problem. The actual best-parameters might be some other values, and the Confidence Interval tells us how close our parameters (i.e. and ) are to these true, best parameters.
For example, suppose our computation gives a regression line , while the truth, rightful regression for the population is . The differences of 0.1 in and 0.2 in are the coefficients’ errors. These errors exist because the way we derive our regression is not perfectly suitable, we did not do the work well enough.
To solve this problem, Linear Regression allows us to compute the Confidence Intervals, which tells the range of regressor coefficients at some Confidence Levels.
Note that, the resulting Confidence Intervals will not be reliable if the Assumptions of Linear regression are not met. Hence, before calculating the Intervals, we should test the above assumptions to ensure none of them is violated.
How to compute the Confidence Interval of the Slope?
In this blog post, we are going to find the confidence interval of the slope ().
In Hypothesis Testing, the Confidence Interval is computed as:
CI = Mean value (t-statistic or z-statistic)*std
where:
- t-statistic (or z-statistic) is deduced from the Confidence Level (e.g. the Confidence Level of 95% yields a Z-statistic of around 2).
- std is the standard deviation of the value to be measured.
The formula is exactly the same for Confidence Intervals of Regressor Coefficients. We use t-statistic instead of z- because what we have in hand is sample data instead of the whole population. Thus, the Confidence Interval of the slope is:
CI = t-statistic*std
where:
- the value of t-statistic depends on the Confidence Level, and we use the degree of freedom = n – 2 instead of the classical n – 1, because our regressor has 2 coefficients ( and ).
- std: the formula for this value is a little bit involved. Ocram on StackExchange gave a full explanation here using Matrix computation. In simple words, you can think of the factors that can make the standard deviation of increase or decrease:
- The prediction errors (or residuals) should have a direct effect on std, because the higher the errors, the more erroneous our regressor is, hence the wider the Confidence Interval.
- The standard deviation of should have an inverse effect on std because the more diverse is, the more information gives, hence the more accurate of our regressor.
- The sample size (n) should have an inverse effect on std, because the bigger the sample set, the better it represents the whole population, hence the more accuracy of our regressor.
In short,
std =
where:
- a =
- b =
Why do we compute the Confidence Intervals?
Test your understanding |
|
Conclusion
This blog post gives an introduction to the Confidence Intervals of Linear Regression Coefficients. The Confidence Intervals help us test if the predictor variable is valuable and if it is well utilized or not.
Note that we should make sure the assumptions of Linear Regression are held before computing the CIs, as violating some of those might make our CIs inaccurate.
You can find the full series of blogs on Linear regression here.
References: