Linear regression is a widely used statistical method to model the relationship between predictor variables and dependent variables. Practitioners are typically advised to check several assumptions, one of which is avoiding strong colinearity among predictor variables.
Here, we explore what happens if this assumption is violated using a practical example:
Alice wants to buy wine aged between 5 and 15 years from either shop A or shop B, and she wants to compare their prices. She decides this is an ideal scenario to quickly fit a linear regression model (price ~ age + shop) and collects a few data points from each shop.
If both shops don't offer wines across the same age range, the variables "shop" and "age" are strongly colinear.
Let's generate some synthetic data to explore the effect of colinearity, in three different scenarios:
- "full overlap" - both shops offer wines across the same age range. No colinearity.
- "no overlap" - shop A only sells young wines up to 10 years, and store B is more high-end with wines ages more than 10 years; high colinearity.
- finally, an intermediate "small overlap" - medium colinearity.
We define the true data distribution as follows:
$$
price = \beta_0 + \beta_1 \cdot age + \beta_2 \cdot \mathbf{1}(\text{shop} = \text{'B'}) + \varepsilon
$$
with
Illustration of a single draw from each scenario.
Let's start by sampling one dataset from each scenario and doing a colinearity check by computing the VIF (Variance Inflation Factor) for the shop feature. (VIF: quantifies the colinearity by calculating how much of the variance of a feature can be explained by the other features. As a rule of thumb: value of 1 is ideal, 5+ is indicative of high colinearity.)
Scenario | VIF (shop) |
---|---|
Full overlap | 1.03 |
Small overlap | 4.15 |
No overlap | 5.65 |
We observe that this test would have successfully warned us to be careful with the "partial overlap" and even more so for the "no overlap" scenario.
Focusing on the most problematic "no overlap" dataset, let's fit a linear regression and look at the coefficients.
For this particular draw of data, we get the following estimates
Parameter | Estimate | True Value |
---|---|---|
Intercept | 10.70 | 10.00 |
Coef (Age) | 0.93 | 1.00 |
Coef (Shop B) | 5.23 | 5.00 |
We can see that the coefficient for shop B is a bit too high, and the coefficient for age is too low. To get a better understanding of the situation, let's look at the distribution of the estimates across multiple draws.
We see an increase in variance of all estimates as the overlap in age decreases and colinearity increases. However, the estimates look unbiased.
To get a clearer picture of what is happening, let's look at the scatterplot between age coefficient and shop B coefficient:
In full overlap datasets, there is no correlation between the two variables - they contribute to the price independently after all. For the overlap scenarios, there is an inverse correlation: if the coefficient for age increases, the shop coefficient decreases, and vice versa.
This shows that the model has trouble disentangling the effects of the shop and the age coefficient - together, they sum up to the right (unbiased) estimate, but the effect is incorrectly distributed between the two.
As the distribution across new data draws is unbiased, I wonder if we could do bootstrapping to recover the true estimates.
A brief experiment shows that this doesn't work - we recover almost exactly the same estimates.
In hindsight this is unsuprising - bootstrapping doesn't introduce additional information. OLS is unbiased and not path-dependent. Thus, we can only recover the distribution implied by the data.
What does this mean for Alice?
It depends on what Alice wants to achieve.
Colinearity is mostly a problem for coefficient estimates, not for goodness of fit. As such, if Alice uses the regression to identify a good deal (wines priced lower than predicted), colinearity should not be an issue. Caveat: predictions outside the range of the collected data are unreliable due to unstable coefficient estimates.
However, if Alice wants to know which shop is more expensive (controlling for age), colinearity is an issue. In this case, the only robust solution is to collect additional data covering overlapping ages at both shops. Without such data, Alice simply lacks sufficient information to resolve the ambiguity.
Idea for further exploration: in some situations, it might make sense to look at the explained variance of the model e.g. with price ~ age, vs price ~ age + shop. While this would also not solve the colinearity issue, it could provide some understanding of the importance of the shop effect relative to the age effect. This could/should be expanded to non-linear models.