Colinear Wine

Introduction and Motivation

Linear regression is a widely used statistical method to model the relationship between predictor variables and dependent variables. Practitioners are typically advised to check several assumptions, one of which is avoiding strong colinearity among predictor variables.

Here, we explore what happens if this assumption is violated using a practical example:

Alice wants to buy wine aged between 5 and 15 years from either shop A or shop B, and she wants to compare their prices. She decides this is an ideal scenario to quickly fit a linear regression model (price ~ age + shop) and collects a few data points from each shop.

If both shops don't offer wines across the same age range, the variables "shop" and "age" are strongly colinear.

Data

Let's generate some synthetic data to explore the effect of colinearity, in three different scenarios:

"full overlap" - both shops offer wines across the same age range. No colinearity.
"no overlap" - shop A only sells young wines up to 10 years, and store B is more high-end with wines ages more than 10 years; high colinearity.
finally, an intermediate "small overlap" - medium colinearity.

We define the true data distribution as follows: $$ price = \beta_0 + \beta_1 \cdot age + \beta_2 \cdot \mathbf{1}(\text{shop} = \text{'B'}) + \varepsilon $$ with $\beta_0 = 10$, $\beta_1 = 1$, $\beta_2 = 5$, and $\varepsilon \sim N(0, 1.5)$. I.e., the price is a linear function of age and shop with some noise. Shop B is more expensive than shop A by 5 units.

Illustration of a single draw from each scenario.

Experiments

Measuring colinearity

Let's start by sampling one dataset from each scenario and doing a colinearity check by computing the VIF (Variance Inflation Factor) for the shop feature. (VIF: quantifies the colinearity by calculating how much of the variance of a feature can be explained by the other features. As a rule of thumb: value of 1 is ideal, 5+ is indicative of high colinearity.)

Scenario	VIF (shop)
Full overlap	1.03
Small overlap	4.15
No overlap	5.65

We observe that this test would have successfully warned us to be careful with the "partial overlap" and even more so for the "no overlap" scenario.

Fitting the regression

Focusing on the most problematic "no overlap" dataset, let's fit a linear regression and look at the coefficients.

For this particular draw of data, we get the following estimates

Parameter	Estimate	True Value
Intercept	10.70	10.00
Coef (Age)	0.93	1.00
Coef (Shop B)	5.23	5.00

We can see that the coefficient for shop B is a bit too high, and the coefficient for age is too low. To get a better understanding of the situation, let's look at the distribution of the estimates across multiple draws.

Coefficient estimate distribution across 5000 new draws

We see an increase in variance of all estimates as the overlap in age decreases and colinearity increases. However, the estimates look unbiased.

To get a clearer picture of what is happening, let's look at the scatterplot between age coefficient and shop B coefficient:

In full overlap datasets, there is no correlation between the two variables - they contribute to the price independently after all. For the overlap scenarios, there is an inverse correlation: if the coefficient for age increases, the shop coefficient decreases, and vice versa.

This shows that the model has trouble disentangling the effects of the shop and the age coefficient - together, they sum up to the right (unbiased) estimate, but the effect is incorrectly distributed between the two.

Experiment: bootstrapping

As the distribution across new data draws is unbiased, I wonder if we could do bootstrapping to recover the true estimates.

A brief experiment shows that this doesn't work - we recover almost exactly the same estimates.

In hindsight this is unsuprising - bootstrapping doesn't introduce additional information. OLS is unbiased and not path-dependent. Thus, we can only recover the distribution implied by the data.

Conclusion

What does this mean for Alice?

It depends on what Alice wants to achieve.

Colinearity is mostly a problem for coefficient estimates, not for goodness of fit. As such, if Alice uses the regression to identify a good deal (wines priced lower than predicted), colinearity should not be an issue. Caveat: predictions outside the range of the collected data are unreliable due to unstable coefficient estimates.

However, if Alice wants to know which shop is more expensive (controlling for age), colinearity is an issue. In this case, the only robust solution is to collect additional data covering overlapping ages at both shops. Without such data, Alice simply lacks sufficient information to resolve the ambiguity.

Idea for further exploration: in some situations, it might make sense to look at the explained variance of the model e.g. with price ~ age, vs price ~ age + shop. While this would also not solve the colinearity issue, it could provide some understanding of the importance of the shop effect relative to the age effect. This could/should be expanded to non-linear models.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
figures		figures
.gitignore		.gitignore
README.md		README.md
main.ipynb		main.ipynb
resources.py		resources.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Colinear Wine

Introduction and Motivation

Data

Experiments

Measuring colinearity

Fitting the regression

Coefficient estimate distribution across 5000 new draws

Experiment: bootstrapping

Conclusion

About

Releases

Packages

Languages

luciensc/colinear-wine

Folders and files

Latest commit

History

Repository files navigation

Colinear Wine

Introduction and Motivation

Data

Experiments

Measuring colinearity

Fitting the regression

Coefficient estimate distribution across 5000 new draws

Experiment: bootstrapping

Conclusion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages