Jeromy Anglim's Blog: Psychology and Statistics

Tuesday, September 29, 2009

Difference Scores | Are They Okay to Use?

A difference score is a variable that has been formed by subtracting one variable from another.
i.e., DIFFSCORE = VAR1 - VAR2.
Some researchers have heard that difference scores are 'bad'. This post discusses some of the issues, provides some additional references, and discusses calculating reliability of difference scores.


The following are some scenarios where either I have thought about or researchers have asked me about difference scores:

  • Examining change on a variable over two time points
  • Comparing scores in two conditions in a repeated measures experiments (e.g., conscientiousness in an honest versus a job applicant role play condition)
  • Comparing scores before and after an intervention

General References

  • Jeffrey Edwards (2001) provides a good starting point for learning about difference scores: 10 Myths about Difference Scores
  • For a discussion in the longitudinal data analysis context, check out Singer and Willet (2003).

My casual observations

  • The appropriateness of difference scores depends on the general concepts of validity and reliability.
  • A difference score is valid to the extent to which it actually measures what you intend to measure.
  • A difference score is reliable if whatever it estimates, it estimates it with little error. Reliability can be defined in terms of accuracy (difference between observed and true) or in terms of correlation (correlation between observed and true).
  • If you are interested in the effect of time on a variable, then you should try to measure the dependent variable at more than two time points. The aim of the research should usually be to describe the functional form of the relationship between time and the dependent variable. Thus, designs with just two time points are often inadequate. In short the difference score is not the best summary of the change process.
  • If you are interested in the differences in scores between two conditions (e.g. honest versus role play) the difference score is a natural measure. An important strategy for increasing the reliability of the difference score is increasing the reliability of the two variables used to form the difference score. For example, I have typically used the 20 items per scale version of the IPIP instead of the 10 items per scale version when looking at difference scores in personality across experimental conditions.
  • Reliability of a difference score also depends on there being actual variability in the difference. If there are no real differences or if the differences are the same for all individuals, it makes no sense to use individual differences in the difference score.
  • If the difference score is not reliable, sample correlations between the difference score and other variables will be reduced.
  • At first, the behaviour of difference scores can seem a little strange. For example, low conscientiousness scores in an honest condition tend to be correlated with response distortion (a difference score between conscientiousness in an honest and a job applicant role-play condition). My interpretation of this correlation is mainly that low conscientiousness respondents have greater scope for increasing their score. Another example can be seen in training research. A participant who already knows the content of a training program may learn (as defined as a difference between pre and post training) less than someone who does not know the material of the training material at the start.
  • If you have a multilevel dataset with each person rating a series of objects on two dimensions and you are getting a difference scores based on differences between dimensions, reliability of the difference score is likely to vary between individuals.

Calculating reliability

Page 995 of McFarland and Ryan (2006) lists a formula for calculating the reliability of a difference score citing Rogosa, BRandt, and Zimowski (1982). Marley Watkins has some software to calculate the reliability of a difference score, but I have not used it.


  • Edwards, J. R. (2001). Ten difference score myths. Organizational Research Methods, 4, 264-286.
  • McFarland, L., & Ryan, A. (2006). Toward an Integrated Model of Applicant Faking Behavior 1. Journal of Applied Social Psychology, 36(4), 979-1016.
  • Rogosa, D., Brandt, D., & Zimowski, M. (1982). A growth curve approach to the measurement of change. Psychological Bulletin, 92(3), 726-748.
  • Singer, J. D., & Willett, J. B. (2003). Applied longitudinal data analysis. Modeling change and event occurrence. New York: Oxford University Press.


  1. You don't list the Edwards (2009) citation in your references. I assume that you are actually referring to his 2001 ORM article, correct?

    1. Thanks for spotting that. I've corrected the post.

  2. I've been studying change in personality variables lately too, specifically well-being (guess I might as well plug my dissertation here too: I've become somewhat concerned with the risk of combining measurement errors across longitudinal observations in ways that might increase kurtosis. To avoid the issue, I've been looking into latent change regression, and thought I'd mention it here. Check out McArdle (2009) if you haven't yet; I'm only about 1/3 through it and already learning a ton!

    McArdle, J. J. (2009). Latent variable modeling of differences and changes with longitudinal data. Annual Review of Psychology, 60, 577–605.

  3. Hi Nick. Thanks for the comment. It's fun for me to look back at a post I wrote four years ago. I find myself using Bayesian hierarchical models a lot these days, for among others things, the flexibility that they bring to modelling individual differences in change.

  4. Nice post. Curious about an issue I've had with a few different datasets. The difference scores have been between two conditions in a repeated measures experiment. I want to predict the difference across conditions with a continuous variable. What would be the arguments for and against using 1) a repeated measures ANOVA with an interaction effect between the condition variable and the continuous variable versus 2) a linear regression where the continuous variable predicts the difference scores across the conditions versus 3) some third alternative I am not aware of.