Multicollinearity in Multiple Linear Regression

Mari Galdina
Analytics Vidhya
Published in
4 min readJan 17, 2021

--

Photo by Murai .hr on Unsplash

I often hear the different variants of questions about multicollinearity in linear regression on interviews. They can sound like: How would you tackle multicollinearity in multiple linear regression? How do you solve for multicollinearity? Why is multicollinearity a potential problem? I decided to unite this question and find a precise answer for it.

Let’s start from the beginning. What is multiple linear regression? Multiple linear regression (MLR) or multiple regression is an extension of simple linear regression. We use it to estimate the relationship between two or more independent variables and one dependent variable. So, we can perform MLR with the formula:

MLR formula

Where:

All information from the formula

Multiple regression is close to real-life than simple regression because it can show how strong the relationship between variables and how they interact with each other. Like how rainfall, temperature, and amount of fertilizer added affect plants or how the interest rate and unemployment affect the stock index price.

Use MLR when you need to know:

— How strong the relationship is between two or more independent variables and one dependent variable.

— The value of the dependent variable at a certain value of the independent variables

Why is Multicollinearity a Potential Problem?

Every regression analysis has a goal to find and isolate the relationship between each independent variable and the dependent variable. Relationships expressed in regression coefficient. When we interpret the regression coefficient, we know that it represents the mean change in the dependent variable for each one-unit change in an independent variable when we hold all of the other independent variables constant. In other words, one independent variable can be linearly predicted from one or multiple other independent variables. But here we face a phenomenon when one independent variable is highly correlated with one or more of the other independent variables in a multiple regression equation. And it brings us problems with the explanation of the model and trustworthy results.

Multicollinearity occurs when two or more independent variables are highly correlated with one another in a regression model.

How can we detect multicollinearity in our data?

There are two simple ways to indicate multicollinearity in the dataset on EDA or obtain steps using Python.

  1. Variance Inflation Factor (VIF).
  2. Heat map or correlation matrix.

Variance Inflation Factor is a measure of colinearity among predictor variables within a multiple regression. It is calculated by taking the ratio of the variance of all a given model’s betas to divide by the variance of a single beta if it were fit alone.

For example, we are creating a variable for BMI from the height and weight variables that would include redundant information in the model.

Dataset for BMI

Dataset has three independent variable:

  • gender — we can create dummies for this column
  • height
  • weight

I use statsmodels.stats.outliers_influence library to calculate the VIF factors. If the VIF is between 5–10, multicollinearity is likely to present and you should consider dropping the variable.

# import necessary libraries
from statsmodels.stats.outliers_influence import variance_inflation_factor
# the independent variables set
X = data[['Gender', 'Height', 'Weight']]

# VIF dataframe
vif_df = pd.DataFrame()
vif_df["feature"] = X.columns

# calculating VIF for each feature
vif_df["VIF"] = [variance_inflation_factor(X.values, i)
for i in range(len(X.columns))]

How we can see here variables height and weight highly correlated because of VIF for them more than 10.

The heat map is a correlation matrix with a color gradient background. This scale will be from 0–1 with 1 being perfectly correlated:

  • between 0.9 and 1.0 indicates very highly correlated variables;
  • between 0.7 and 0.9 indicate variables that can be considered highly correlated;
  • between 0.5 and 0.7 indicate variables that can be considered moderately correlated;
  • between 0.3 and 0.5 mean low correlation.

How can we fix multicollinearity in our data?

When we want to avoid highly correlated variables in our prediction we can use one of the solution:

  1. Feature Engineering.
  2. Drop One.

Feature Engineering is aggregate or combine the two highly correlated features and turn them into one variable.

Drop one only sounds easy but needs a proper EDA before you decide which variable you should remove.

Summary

Multicollinearity is an interesting problem to solve. When we are checking for multicollinearity, we should check multiple indicators and look for patterns among them.

--

--