Jeromy Anglim's Blog: Psychology and Statistics

Friday, September 18, 2009

Variable Importance and Multiple Regression

Many researchers are interested in questions related to the relative importance of a set of predictors in multiple regression. This is important to both consultants and academics. I assume the motivation derives from the assumption (typically wrong at least to some extent) that the predictors flagged as important will have  larger causal effects and are therefore better targets for manipulation in an  intervention. Some examples include: a) a set of risk factors on clinical symptoms in psychology; b) a set of personality measures on performance; c) a set of beliefs on overall attitude.

There are many issues with the concept of predictor importance in a multiple regression.
  1. Regression is typically based on observational data, where there is no guaranty that any relationship between the predictor and the outcome variable is causal. In fact, on theoretical grounds, most applications that I have seen in psychological contexts would suggest a third variable or a reciprocal relationship to be more likely.

  2. Measures of variable importance that depend on the other predictors in the model (e.g., semi-partial correlation, standardised beta), well, they depend on the other predictors in the model. This problem becomes increasingly important as the size of correlations between predictors (i.e.,multicollinearity) increases.
  3. Importance of a predictor from a policy perspective may depend on other factors such as the cost of manipulation.
  4. If predictors vary in their reliability and validity, better prediction of one variable may be partially due to superior measurement and not in better prediction from the underlying phenomena.
When I consult with researchers who want to say something about variable importance in their data, I tend to give the following advice:
  • Consider the above mentioned issues. i.e., 1) consider alternative causal explanations; 2) also use a measure of variable importance that is independent of the other predictors in the model (e.g., zero-order correlation); 3) consider other factors that might be weighted in policy makers decisions about variable importance; 4) examine the reliability of the data as indicated by standard measures. Consider how well the variables actually measure the latent theoretical concept. If reliability is substantially different, SEM based approaches adjusting for reliability may be better. Having predictors that are all of high reliability and validity in the first place is even better.
  • A quick and easy way to look at variable importance, which I find reasonable, involves examining and reporting the zero-order correlation and the semi-partial correlation for each predictor. The zero-order correlation (i.e., the standard correlation) tells you the degree to which the predictor is related to the outcome variable independent of any other predictors. The squared semi-partial correlation tells you the unique percentage of variance explained in the outcome variable by the target predictor over and above the other predictors. If you want to look at whether the differences between the zero-order correlations for the different predictors are statistically significant, check out my posts on examining the difference between non-independent correlations. Andy Field describes how to read this information from SPSS output. The zero-order and semi-partial correlations may rank order the predictors in the same way. If they do not, consideration should be given to the role of multicollinearity in influencing the semi-partial correlations.
  • For more information about the literature on relative importance have a look at some of the links on Ulrike Grömping's site.
  • For further information on multiple regression, one online reference is this book by Cohen and colleagues. My own material on multiple regression is here.

1 comment:

  1. Hmmmm, Long time, and no-one followed up. It's an important topic. I have clients who want to use Shapley Value regression - decomposes the explained variance by building models with every combination of "leave some predictors out", and some weighting.

    Also, I use TreeNet (from Salford Systems), and it assesses the contribution of each variable to a predicted value (for classification) or something similar for regression. TreeNet is necessarily non-linear, and includes additive interactions - can include more complex interactions because it's based on Stochastic Gradient Boosting to combine a large number of fairly weak predictors.

    I'm currently working on differences between Shapley Value and the Variable Importance in TreeNet, because I think I'd like clients to use TreeNet.