Multi-Collinearity: Problems and Solutions
Part I
Modelling the drivers of anything is a costly process. What clients want from marketing models is a clean, hierarchical list of mutually exclusive drivers. Instead, invariably what they receive is a list of drivers suffering from some degree of multi-collinearity. Multi-collinearity is said to exist when there is a linear relationship amongst the drivers (predictor variables) within the same regression model. Multi-collinearity becomes a problem when the predictor variables are so highly correlated that it becomes challenging to distinguish their individual explanatory power.
Take for example the Manager who wishes to explain customer satisfaction with provision of information. Two of the significant predictors are satisfaction with brochures and satisfaction with the website. Assuming multi-collinearity, the regression model estimates may be unstable and highly dependent on the presence or absence of other predictor variables.
Given that the purpose of the study is to estimate the contributions of individual predictors of satisfaction with provision of information, multi-collinearity detracts from the ability of the manager to allocate resources in line with the variable’s importance. In this context, multi-collinearity represents duplication of effort for no additional return.
By definition, regression models will almost always suffer from the condition of multi-collinearity. The ideal model is one where each of the variables in such a group of predictor variables is linearly related to say, satisfaction with provision of information but not to each other – however, by definition; they all have the dependent variable in common.
Often, where multi-collinearity is not considered to be dangerously high, it is allowed to remain in the model. Whilst it might not be causing estimate instability problems, it none-the-less may represent budgetary inefficiencies.
Technically speaking, multiple regression analysis is a statistical technique that is used to analyse the relationship between a single dependent variable and a set of independent (predictor) variables. The objective of such analysis is two-fold:
- To predict the unknown dependent variable using the known, predictor variables
- To identify the relative contribution each predictor variable makes to this prediction.
This process however, can be complicated by the presence of multi-collinearity. Multi-collinearity limits the explanatory power of the model (R2) and makes it harder to increase unique explanatory power from additional variables. It also makes it more difficult to determine the individual contribution made by each predictor variable because their effects are overlapping.
Take for example, a predictor variable, X1, that has a correlation of 0.70 with the dependent variable and a second predictor variable, X2, that has correlation of 0.40. If there was no multi-collinearity then X1 would explain 49% (0.70 squared) of the variance in the dependent variable and X2 would explain16%. With no overlap, together they contribute 65% unique variance explained. Any overlap, or multi-collinearity would reduce the collective predictive power of the independent variables because some of the predictive power is being shared.
A useful approach to assessing multi-collinearity commonly involves a two-step process:
- Identify the degree of collinearity
- Assess the extent to which the parameter estimates are impacted.
The degree of multi-collinearity is very simply obtained via examination of a correlation matrix. Correlations between predictor variables of 0.50 – 0.75 are described as moderate to strong multi-collinearity, with 0.90 reflecting extreme multi-collinearity.
The impact of the multi-collinearity is commonly assessed by the following two measures:
- Tolerance value and/or the Variance Inflation Factor VIF (inverse of tolerance value)
- Condition Index (CI).
These measures tell us the extent to which each predictor variable is explained by the remaining predictor variables. High tolerance values indicate little multi-collinearity and correspond with a low VIF value. Common rules of thumb describe tolerance values as high if they are greater than 0.2 and the VIF as low if it is 5.0 or below.
Condition Indices indicate the multi-collinearity of combinations of the predictor variables in the data. The CI rule of thumb cut-off for tolerable levels of multi-collinearity ranges between 15 – 30 with 30 being the most commonly used.
Again, we can follow a two-step approach:
- Identify any variables whose CI exceeds the given cut-off, say 30
- For those variables, identify which ones have variable proportions greater than 0.90. Multi-collinearity is problematic if this occurs for two or more variables.
Now we know what multi-collinearity is, why it is important and the conventional wisdom relating to how to measure it. In next month’s instalment, we’ll examine some other important factors that play a role in the multi-collinearity scenario and look at what can be done to address problematic multi-collinearity.
Part II
So far, we have discussed what multi-collinearity is, how to measure it and how to assess the degree of its impact on parameter estimates using single, diagnostic measures (such as the VIF & CI) and their associated rules of thumb.
That completed, we need now to look at possible solutions. Assuming that the multi-collinearity in a particular regression analysis has been identified as problematic, there are various data transformations that can be made. However, the results lose the ability for management to intuitively interpret their meaning.
The most attractive option to address multi-collinearity is to simply remove one of the offending variables – but which one? Empirically the choice process is well prescribed – of the two offending variables, the one that has the weaker relationship with the dependent variable, is the one that should be removed. However in practice, such a numerically driven decision may not be optimal. For example, it may be that specific variable that has been the subject of ongoing monitoring by the client, or perhaps large investments have been made in marketing efforts relating to that variable. This would render its removal impractical. In this case, the second variable would be the one to remove. In either case, after the removal it may be viable to redirect the investment to the inclusion of a new attribute that, if well chosen, may capture additional, unique explanatory power.
Better quality drafting of questions may also alleviate the problem. Avoiding collective and ill-defined terms like ‘responsiveness’ will enable the respondent to be clearer about what they are scoring and therefore remove some of the multi-collinearity. Your (explanatory power) will probably fall but your actionability will rise.
Having examined the most commonly used diagnostic measures of, and possible solutions to multi collinearity, it is now possible to go one step further and consider its impact in light of other important factors in regression modelling. This is necessary because it has been identified that a primary weakness in the standard approach is attempting to examine the impact of multi-collinearity in isolation. Comprehensive research has shown that instead, multi-collinearity consequences are mediated (or exacerbated as the case may be) by other factors within the regression analysis environment. In brief these factors are: the explanatory power of the overall model and the sample size .
The findings show that where is low (0.25 – 0.50) and the sample size is small (n=30), even moderate multi-collinearity (0.5) can cause problems such as increased instability of estimates and reversal of coefficient signs.
Conversely, the potential negative effects of strong multi-collinearity (0.75) can be rendered negligible if the power of the model is high (n=0.75) and the sample size large enough (n>300).
In addition, the rate of Type II error must also be considered when multi-collinearity is present in regression modelling. Marketers often use t-tests to determine if predictor variables contribute significantly to the explanatory power of the model. Associated with such t-tests is Type II error. It has been shown that the chance of making a Type II error increases with increasing levels of multi-collinearity. Small sample sizes and low further exacerbate this problem with Type II error rates.
A worst cast scenario would be if you have high multi-collinearity, a small sample size and low . In such a situation, it is likely that the chance of drawing incorrect inference is 50% or more! Clearly, this results in the inability to have confidence in inferences drawn from the regression model and so would render the results effectively useless.
Even when there is little multi-collinearity, the problem of Type II errors and misleading inference is severe – if the sample size is small or the low. Therefore, although multi-collinearity can indeed be a potential problem, its importance tends to be over-emphasised in comparison to the other factors that are likely to lead to incorrect inferences. Marketers cannot have confidence in inferences drawn from regression models if any combination of small sample size, low explanatory power or strong multi-collinearity exists.
The upshot of this discussion is not that it is incorrect to use measures such as VIF or CI etc to diagnose the impact of multi-collinearity, but that:
- there are other important factors that have just as serious an impact on the value of the model; and
- multi-collinearity should not be assessed in isolation of the these factors.

