Skip to content

Exploration of feature colinearity in linear regression

Notifications You must be signed in to change notification settings

luciensc/colinear-wine

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Colinear Wine

Introduction and Motivation

Linear regression is a widely used statistical method to model the relationship between predictor variables and dependent variables. Practitioners are typically advised to check several assumptions, one of which is avoiding strong colinearity among predictor variables.

Here, we explore what happens if this assumption is violated using a practical example:

Alice wants to buy wine aged between 5 and 15 years from either shop A or shop B, and she wants to compare their prices. She decides this is an ideal scenario to quickly fit a linear regression model (price ~ age + shop) and collects a few data points from each shop.

If both shops don't offer wines across the same age range, the variables "shop" and "age" are strongly colinear.

Data

Let's generate some synthetic data to explore the effect of colinearity, in three different scenarios:

  • "full overlap" - both shops offer wines across the same age range. No colinearity.
  • "no overlap" - shop A only sells young wines up to 10 years, and store B is more high-end with wines ages more than 10 years; high colinearity.
  • finally, an intermediate "small overlap" - medium colinearity.

We define the true data distribution as follows: $$ price = \beta_0 + \beta_1 \cdot age + \beta_2 \cdot \mathbf{1}(\text{shop} = \text{'B'}) + \varepsilon $$ with $\beta_0 = 10$, $\beta_1 = 1$, $\beta_2 = 5$, and $\varepsilon \sim N(0, 1.5)$. I.e., the price is a linear function of age and shop with some noise. Shop B is more expensive than shop A by 5 units.

Fig. 1 Illustration of a single draw from each scenario.

Experiments

Measuring colinearity

Let's start by sampling one dataset from each scenario and doing a colinearity check by computing the VIF (Variance Inflation Factor) for the shop feature. (VIF: quantifies the colinearity by calculating how much of the variance of a feature can be explained by the other features. As a rule of thumb: value of 1 is ideal, 5+ is indicative of high colinearity.)

Scenario VIF (shop)
Full overlap 1.03
Small overlap 4.15
No overlap 5.65

We observe that this test would have successfully warned us to be careful with the "partial overlap" and even more so for the "no overlap" scenario.

Fitting the regression

Focusing on the most problematic "no overlap" dataset, let's fit a linear regression and look at the coefficients.

For this particular draw of data, we get the following estimates

Parameter Estimate True Value
Intercept 10.70 10.00
Coef (Age) 0.93 1.00
Coef (Shop B) 5.23 5.00

We can see that the coefficient for shop B is a bit too high, and the coefficient for age is too low. To get a better understanding of the situation, let's look at the distribution of the estimates across multiple draws.

Coefficient estimate distribution across 5000 new draws

Fig. 2 We see an increase in variance of all estimates as the overlap in age decreases and colinearity increases. However, the estimates look unbiased.

To get a clearer picture of what is happening, let's look at the scatterplot between age coefficient and shop B coefficient:

Fig. 3

In full overlap datasets, there is no correlation between the two variables - they contribute to the price independently after all. For the overlap scenarios, there is an inverse correlation: if the coefficient for age increases, the shop coefficient decreases, and vice versa.

This shows that the model has trouble disentangling the effects of the shop and the age coefficient - together, they sum up to the right (unbiased) estimate, but the effect is incorrectly distributed between the two.

Experiment: bootstrapping

As the distribution across new data draws is unbiased, I wonder if we could do bootstrapping to recover the true estimates.

A brief experiment shows that this doesn't work - we recover almost exactly the same estimates.

In hindsight this is unsuprising - bootstrapping doesn't introduce additional information. OLS is unbiased and not path-dependent. Thus, we can only recover the distribution implied by the data.

Conclusion

What does this mean for Alice?

It depends on what Alice wants to achieve.

Colinearity is mostly a problem for coefficient estimates, not for goodness of fit. As such, if Alice uses the regression to identify a good deal (wines priced lower than predicted), colinearity should not be an issue. Caveat: predictions outside the range of the collected data are unreliable due to unstable coefficient estimates.

However, if Alice wants to know which shop is more expensive (controlling for age), colinearity is an issue. In this case, the only robust solution is to collect additional data covering overlapping ages at both shops. Without such data, Alice simply lacks sufficient information to resolve the ambiguity.

Idea for further exploration: in some situations, it might make sense to look at the explained variance of the model e.g. with price ~ age, vs price ~ age + shop. While this would also not solve the colinearity issue, it could provide some understanding of the importance of the shop effect relative to the age effect. This could/should be expanded to non-linear models.

About

Exploration of feature colinearity in linear regression

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published