Jeromy Anglim's Blog: Psychology and Statistics

Wednesday, March 25, 2009

Calculating Composite Scores of Ability and Other Tests in SPSS

Researchers in psychology often have a large number of variables. Science aims for parsimonious explanations of the world. Thus, the challenge is to develop a principled approach to dealing with the multiplicities that arise in psychological research. One common approach is to combine tests that measure similar things into composites. This post looks at how to form composites. The emphasis is on settings where you have multiple ability tests and you want to create a composite ability factor. Also, particular emphasis is given to how to do it in SPSS.

This scenario comes up in many settings: items of a psychological scale; many scales measuring similar constructs (e.g., personality, symptoms, performance, ability, etc). The example that I will be talking about here is one where you have a number of ability tests. For example, if you have scores for participants on ten ability tests, it may be useful to form one or more composites. These composites can then be used in subsequent analyses.

Step 1: Decide which variables should form composites

Step 2: Compute composites

Step 3: Use composites in subsequent analyses.

Step 1: Decide which variables should form the composites

Three major sources of information inform which variables should be grouped together to form composites: (1) data, (2) aims, and (3) theory.

1.1) Data: All else being equal it makes more sense to combine variables that are correlated with each other. Thus, if you examine the correlation matrix for the set of tests and see that a subset of tests correlate highly with each other (e.g., r greater than .4 or .5 or .6 or .7), this suggests that this subset is measuring something in common. A more sophisticated approach to this task involves running a factor analysis or principal components analysis.

In the following link I provide notes for a lecture on Factor Analysis and PCA with practice questions. The example in the lecture is based on my own data where I assessed whether nine ability tests could be reduced to three abilities. Much can be said about factor analysis, and I don’t wish to discuss all the issues here (see books like Tabachnick and Fiddel or Hair et al). However, after completing your factor analysis, you should have worked out how many components you want to extract and which variables will be included in each.

1.2) AIMS: It is important to think about the purpose of forming composites in relation to your analyses. For example, if you have 10 ability tests, you might only be interested in having a general measure of intelligence. In which case, it might be sufficient to create a single composite based on all tests. In other cases, such as in neuropsychology, where particular deficits are theorised, or in settings where you are interested in the differential prediction of classes of ability tests, a fine grained split may be of interest. In general, there is a trade-off between complexity and parsimony.

1.3) THEORY: It is also useful to think about the theory of how the individual tests relate. This is particularly important if you have a small sample (e.g., n less than 50), such that factor analysis might not be possible, or even if it is possible, results might not be especially reliable. Theory and past research may suggest that the tests should be grouped in particular ways.

After thinking about the data, your aims, and theory, you should have decided which tests will be combined to form composites.

Step 2: Compute composites

Two main options for forming composites are ‘factor saved scores’ and creating your own weighted composites.

2a) Factor Saved Scores: In the case of factor saved scores, you let the factor analytic procedure compute its own composites based on the results of the factor analysis. SPSS has a button called “Scores…” which lets you save scores. See Andy Field's Factor Analysis notes for more information.

2b) Your own weighted composite: This typically involves creating a linear composite of the component variables. For example, assume you have three tests called “(EV) everyday vocabulary”, “(AV) advanced vocabulary”, and “(C) comprehension”. As a result of the factor analysis you have decided to combine these three tests into a composite. A simple procedure would be to say that: composite = EV + AV + C. The problem with this approach is that the three tests often have different metrics. One may be percentage correct, another may be the number solved, and so on. The result is that the tests with larger standard deviations will be weighted more in the composite. Generally, we want to weight all the tests equally. At the very least we want to be in control of the weighting; we don’t want to leave the weighting up to some arbitrary consequence of the metrics of the variables.

Thus, a common procedure is to:2b.1) convert the raw test scores to z-scores, and 2b.2) add-up the z-scores.

2b.1) Convert each raw score to a z-score:

The formula for a z-score is (score – mean) / standard deviation.

For general information about descriptive statistics and using the compute function follow this link.

a) SPSS will do this for you using Analyze - Descriptive - Descriptive Statistics - Save Standardized Values

b) Alternatively you can get the descriptive statistics for your variable (mean and standard deviation) and then use Transform - Compute to create a new variable.

In syntax it would look like this:

compute zvocab = (vocab – 10) / 2.

Assuming that your raw variable for the test is vocab, the mean of the raw score is 10, and the standard deviation of the raw score is 2.

Option b above is essential when you have two or more time points. This is because you will want to use a common mean and standard deviation for standardised variables at the two time points. If you standardise within a time point, you will remove any change in scores over time.

2b.2) add-up the z-scores:

In SPSS this can be done using Transform >> Compute

In syntax it might look like this:
compute verbaltot = zvocaba + zvocabe + zcompr.

This assumes that you want your new variable to called verbaltot, and that you have created three z-score versions of your tests.

Note that you will have to adjust the above approach if you have any reversed tests. For example, on measures of reaction time or error counts, low scores indicate more ability, whereas on measures based on number of items answered correctly, high scores indicate more ability. In these situations, you will need to either reverse the z-scores before forming the composite or place a minus sign, instead of a plus sign, before the test in the compute statement.

Step 3: use composites in subsequent analyses.

The composites can then be used in subsequent analyses, such as predictors in regressions, dependent variables in group comparisons, and so on. The benefit is that you have simplified the complexity in your data and are able to present a more parsimonious explanation.

When it comes to reporting your decision to combine scales, you will want to give a justification and an explanation. The justification should make reference to data, theory, and your aims. The description can be as simple as what follows in Ackerman & Cianciolo (2000): “To provide more stable measures of the underlying abilities, composites were formed with unit-weighted z scores of constituent tests” (p.264)



  1. Great and clear information! My dissertation and I thank you for this!

  2. I saw in a study that the researcher created standard scores by using a mean of 0 and standard deviation of 1 for all variables, does this mean that he was not using the raw mean and standard deviation of each variable? is it this correct?


  3. @Anonymous. Your interpretation sounds correct.

  4. your link on this paragraph does not work:

    "2a) Factor Saved Scores: In the case of factor saved scores, you let the factor analytic procedure compute its own composites based on the results of the factor analysis. SPSS has a button called “Scores…” which lets you save scores. See the following information if you wish to use this option."

  5. Okay. I've updated the link. Andy Field appears now to have his own domain name now with the slightly ominous title:

  6. Thank you so much for this! Clear and concise.

    Just a quick question, if you don't mind. If I'm combining 2 sets of Z-scores into 1 new variable, but 1 Z-score is positive and the other is negative (for the same participant), should I take the mean instead of summing?

  7. HI Viv,
    If you have two variables z1 and z2.
    Then, sum(z1, z2) = 2 * mean(z1, z2).
    That is to say the sum and the mean are perfectly correlated.
    If one is "negative", then you will have to reverse the negative variable.

    e.g., if I were developing a composite score combining reaction time (test1), where low scores meant more ability, and a vocabulary test (test2), where high scores meant high ability, I could do the following:

    1. convert both test1 and test2 to z-scores (z1 and z2)
    2. reverse z1, by subtracting it from zero.
    i.e., revZ1 = 0 - z1.
    3. take the sum or the mean of revZ1 and z2.
    The result is a measure of combined ability.

  8. Hi Jeromy. Thank you, that makes sense! The 2 measures are in the same direction though. So most of them are positive (z1) and positive (z2), although some participants have positive (z1) and negative (z2) scores. Can I still just sum them?

  9. Hi Viv,
    okay. I understand what you're asking now.
    If some variables in a set contain some negative values, you can still get the sum or the mean of the set of variables.

  10. So if you have a questionnaire that measures a construct, that consists of 3 subscales.
    Two subscales are measured by a 5-point Likert-scale and in 1 subscale the questions are open-ended to give in a number.

    If you want to become the construct:
    Do you have to take the average of each subscale first (adding the itemsscores together/number of items) and then standardize the total subscale score and add the 3 standardized subscales together? (and than divide by the number of subscales??? Do you even have to divide by the number of items when you add up standardized scores?)

    Or do you have tot standardize every single item en add the standardized itemscores to become the total subscale?
    And if you add standardized scores, do you have to divide the sum by the number of items? Because when you add standardized items, the SD becomes really high?

    Can you help me? Thx!

  11. @An
    I'm not exactly sure what you are asking, but here are some thoughts.

    If you are creating a composite from a set of variables, x1, x2, x3, etc,
    you generally need to standardise when the variables have different metrics.
    If you don't standardise, the scales with the greater variance will be weighted more in the composite.
    This applies whether you are creating a composite as a sum or as a mean.

    Thus, if you are creating a composite based on a set of items all on the same scale (e.g., 1 to 5 Likert items), then you can just take the sum or mean of those items.

    If you have a test with an overall score and a set of scale scores, based on a set of items, you have options for how to compute the overall score.
    (a) you could get mean values for scale scores and get the overall score is the mean of the scale scores;
    (b) the overall score could be the mean of all items
    Variations on the above could be applied using "sums".
    It would also be possible to use factor saved scores.

    If you are working with a standardised test, you should check what is the standard scoring protocol.
    If the test has been used previously, see what scoring protocol they used.

    If you are working with typical attitude or self-report scales, you may find the following post more relevant.

  12. Dear Jeromy,

    Thx for your reaction.

    My construct (organizational embeddedness) consists out of 3 dimensions/subscales: organizational fit (5 questions on a 5 point Likertscale), organizational links (5 open-ended-questions) and organizational sacrifices (5 questions on a 5 point Likertscale).

    So my construct, organizational embeddedness, is de composite score of fit, links and sacrifices.

    Therefore, I computed the raw means of every scale.

    I don't understand what you mean with '(a) you could get mean values for scale scores and get the overall score is the mean of the scales scores'

    Your option '(b) the overall score could be the mean of all items' seems weird, because then I have to add the means of the scales that are measured differently? Is that possible?

    While in your initial blog its seems you can add the different standardized means for the different measured subscales... But doesn't the standard deviation gets very high then?

    I also thought to do a confirmative factor analysis with 1 factor (the construct) but I think, because the construct consist of 3 dimensions (fit, link and sacrifices) the results will be not so great?

    I just don't know how to get from my 3 subscales to my construct.

    The literature dealing the (new) construct, isn't very clear in how they compute the overall score..

    I hope my question is more clear now.

    Thx in advance!

    Grtz, An

  13. Hi An,
    There are several ways that you could create your overall measure of organisational embeddedness. It's perhaps worth reading up some more about the trade-offs of the various options. However, here's one recipe:

    1. Create Mean Subscale scores.
    meanOrganisationalFit = mean(org fit items)
    meanOrganizationalSacrifice = mean(org sac items)
    meanOrganizationalLinks = However, you calculate it. (I'm not clear what you mean by open ended question and how you convert the open ended question to a scale score).

    2. convert each of the subscale means to z-scores.
    * meanOrganisationalFit
    * meanOrganizationalSacrifice
    * meanOrganizationalLinks

    3. Sum the subscale z-scores.
    Embeddedness = sum the three z-scores.

  14. Very clear!

    Thank you very much!

    Grtz, An

  15. Hi If i make a composite scale from several likert itms using mean can i then use that composite score in a logisitc regression?

  16. Hi aj,
    In general, yes you could use it as a predictor.
    If you want a more detailed answer, perhaps post it on the helpful Q&A website:

  17. Thanks - step by step procedure saved me a lot of time - I will cite you in my acknowledgments after my defence!

  18. Hi,

    Your explanation is very clear.

    However, I am confusing between z-score and average score.

    I read one blog ( and found that he used Mean to calculate composite scores.

    What is the best way to calculate "Unit-Weighed Composite"?

    What is the differences between "Unit-Weighted Composite and Average-Weighted Composite"?

    I am waiting for your response. Thank you so much.


  19. @Atikah: converting to z-scores first ensures that the variance of each test is the same (i.e., 1). A unit-weighted composite is just the sum. i.e.,
    1 * ztest1 + 1 * ztest2 + 1 *ztest3
    = ztest1 + ztest2 + ztest3.

    You could also get the mean z-score instead of the sum. Means and sums are perfectly correlated, they're just a different metric.

    e.g., the mean would be
    1/3 * ztest1 + 1/3 * ztest2 + 1/3 * ztest3
    = (ztest1 + ztest2 + ztest3) / 3

  20. Thanks Jeromy.

    However, if I would like to use Factor score weight (obtained via AMOS), I could compute the composite score through Compute variable in SPSS with formula:

    MEAN(.072 * B5,.405 * B6,.155 * B7,.132 * B8)

    where .072, .405, .155, .132 are factor score weight
    B5, B6, B7, B8 are indicator variables

    What do you think on my calculation?


  21. @Atikah
    In broad terms that's fine. It really depends on what you want to do. Make sure that you are either combining unstandardised coefficients with unstandardised variables or standardised coefficients with standardised variables.

  22. How can you decide on what makes the most sense to use - either a Factor saved score or your own weighted composite? I've used both as the outcome for a regression I'm doing, and the predictors are different depending on whether I used the factor score or the mean. How do I know which one to use?
    Thanks for your help!

  23. @leslie,
    A few thoughts:
    * See whether it makes a difference by correlating factor save scores with your own weighted composite.
    * If you adopt your own weighted composite, those weights can be used in future studies, making raw scores and results in general more comparable across studies

  24. How do i make a composite measure for items that need to reversed coded?

  25. @Anonymous
    Two basic options for calculating measures with reverse coded items:
    1. reverse the items first and then use the above strategies
    2. use a negative weight for the composite
    (e.g., if x2 needed to be reversed; 1 * x1 -1 *x2 + 1 * x3)

    You may want to read this other post on scoring psychological scales. It discusses item reversal further:

  26. Help!!!! Please!! I need to find out if there is a correlation between different factors and teacher attrition. My survey has 4 questions on personal factors and 3 questions on attrition. I want to combine the 4 personal factors and combine the 3 attrition factors to see if there is a correlation between personal factors and attrition. What test would I use and how do I enter this into SPSS? My professor says to use a Pearson Product Moment but I do not know how to combine the questions together and run this analysis. Thanks in advance!

  27. @Linda
    Google is your friend.

    It sounds like you need a basic SPSS book like the SPSS Survival Manual.
    Or check out my own PDF guide:

    You may find this post useful on scale construction:

    Once you have some specific questions you may find a useful place to ask questions

  28. Hi,
    I a bit trouble in my dissertation..
    I am using factor analysis and then want to used the result to Bivariate correlation of all the factors.

    I already get the factors. But how to put the result in Bivariate correlation?

  29. @anonymous; perhaps check out this post:

    or type "spss correlation" into Google.

    If that doesn't answer your question, I suggest posting a specific question on

  30. Thanks for all the information and examples.

    I have a few questions about creating and working with composite variables. I am researching the effectiveness of an intervention and I have scores from a bunch of neuropsychological tests before and after. Ultimately, I want to see if test scores improved. Unfortunately, I don't have a control group.

    Is it appropriate to form composite variables using raw data?


    2b.1) Convert each raw score...For option b, if I have two different time points (pretest/posttest), where are the means and standard deviations coming from? Would I combine the scores from both time points to get a common mean and standard deviation and then apply that to each time to to create a z-score?

    Thanks so much!

  31. @Anon 11/3/2011
    With regards to multiple time points, it probably wont make much difference whether you use the mean and standard deviation of time 1, time 2, or a pooled estimate, or even the means and standard deviations from the test manual.
    The core thing is that you use the same mean and standard deviation for both time 1 and time 2 so that the two time points are on a common metric.

    Yes you can create composites using raw data, you just need to think carefully about the weights you give to the variables so that they align with the importance of the individual tests.
    Particular care is required when the individual tests have different standard deviations.

    Also, the lack of a control group wont stop you from knowing whether or not test scores have improved.
    The problem of course arises as to whether any observed improvement is due to anintervention that you have applied between time 1 and 2 or whether it is due to some other effect such as regression to the mean, the general effect of time, or some other factor external to any intervention.

  32. Hi Jeromy
    I am examining the effects of High BMI of pregnant women(Body mass index) on maternal outcomes, foetal outcomes. I want to produce results as composite end points. I did very basic stats e.g., % of each incidence in each BMI group. How do I do the composite measure? i have done the basic stuff on Excel. How can i weight them because obviously fetal death will not carry the same significance as low birth weight infant. Thanks

  33. @anonymous
    I don't completely understand what you are asking.
    Perhaps it would be best to ask the question on where a more interactive discussion can be facilitated.

    If I understand you, you are trying to create an index of foetal outcome where death would be the worst outcome and otherwise outcomes are ranked in some respect relative to birth weight.
    If this is the case, I can imagine that you could form an index. You could possibly ask experts to map various possible outcomes onto a severity index and pool these responses to map your data onto the severity index. It sounds like quite a topic-specific issue, where you'll have to use a lot of subject matter expertise.
    It's also worth considering the relative merits of a composite versus individual measures.

  34. I am planning to use composite score in my master thesis. This information was really helpful.
    So, as I understood, before forming composite score, I need to check correlation among variables by using factor analysis,,or PCA?
    After factors are extracted...If three items are bundled in one factor...sum up those items as one composite score... it is correct?

  35. Interpreting z-Scores
    The z-score for a subject indicates how many standard deviations away from the mean the subject scored. Therefore, a z-score of 1.3 means that the subject scored 1.3 SD's above the mean. Similarly, a z-score of -.70 means that the subject scored .70 SD's below the mean. And, a z-score of 0.00 means that the subject scored zero SD's above or below the mean; in other words, the person scored exactly the same as the mean.

    I found this explanation of how to interpret z scores online. What I'm trying to figure out is how do you know what the what the mean response is? I know that the mean is always going to equal zero but what is that value? For example, if you have a scale with the attributes High, Moderate and Low, and low was the mean that is,on average the majority of respondents selected low. How do you know which attribute is the mean?

  36. @Anonymous/Jen What you said is basically correct in some circumstances, but not all. You may prefer to ask a question on which is more suited to iteratively discerning your question and your proposed solution.

  37. @Nolram You are confusing the mean with the mode. The mode is the most common response.
    When calculating the mean of an ordinal scale, you first have to assign numeric values to the scale points (e.g., Low = 1; Moderate = 2; High = 3).
    You can then use the standard formula for calculating the sample mean (e.g., the sum of the values divided by the number of values).

  38. Hi Jeromy.
    I saw above where someone said they had some participants with positive as well as negative z scores but needed to make a composite of the two. I have the same situation but am still confused on the procedure.

    I basically have about 1000 participants that are broken down into 3 age groups. They each have two academic scores (math and literacy) that are collected at 3 time periods (fall, winter, spring). The math and literacy scores are what I need to combine into a composite score. Therefore, I know I needed to get z scores for them since they are measured on different scales.

    So what I did was I got the mean and SD for each measure (math and literacy) for each age group at each time period, and plugged them into the z score formula in the SPSS Transform --> Compute Variable procedure. For instance, the math scores for group 1 at Time 1 had a mean of 28.99 and a SD of 14.36. So the syntax I used in SPSS was:

    IF (Group = 1) zmath1= (math1 - 28.99)/14.36 .

    When that was computed, I did the same for Time 2 and Time 3 (and plugged in the correct mean and SD for group 1 at those times)so that the syntax was:

    IF (Group = 1) zmath2= (math2 - 43.49)/15.68 .
    IF (Group = 1) zmath3= (math3 - 53.31)/16.08 .

    I repeated this same thing for the literacy values. What resulted were lots of the participants that had positive values on both math and literacy, but many also had one positive z score and one negative z score.

    For example, one participant got a z score of .47 for literacy and a -.30 for math. And you did say that the negatives should be reversed. Do I need to go through all 1000 participants for every time frame and do this individually?

    Also, I was hoping to have larger numbers to work with for easy comparison. I thought the formula for that was:
    15*(zmath1+zliteracy1)/2 + 100

    However, when I do that, the resulting figures make it look like most participants are scoring lower as the academic year goes on. For instance, one child has a time 1 score of 129.20, a time 2 score of 121.80 and a time 3 score of 116.08. This makes no sense to me, so I feel that I am doing something wrong. (For this student, all the z scores were positive).

    Your thoughts?
    Thank you so much.

  39. @Anonymous:

    1. If you are studying changes over time, you have to apply the same standardisation at each time point. For example, you could use the time 1 mean and standard deviation for english and math for standardising all three time points. It looks like that you did not do this, and this may explain why you are seeing static performance over time.

    2. You do not have negative scales. You have some negative values. Presumably on both math and english scores higher scores mean more ability (Thus, they are both positive scales). A typical example where you need to reverse is something like reaction time where lower scores (i.e., quicker times) mean more ability in contrast to typical tests where higher scores mean more ability. Tests/scales need to be oriented in the same direction before summing. Your scales are oriented in the same direction so you don't need to reverse.

    3. You could do 15*(zmath1+zliteracy1)/2 + 100.
    However, you may wish to standardise the sum of zmath and zliteracy. In short, while zmath may have a standard deviation of one, the mean of zmath and zliteracy will have a standard deviation smaller than one (assuming the correlation is imperfect).

  40. Hi Jeromy

    Is it absolutely necessary to perform a statistical procedure like Factor Analysis to calculate the weight of a variable to be added towards the composite variable.

    I am trying to keep things simple and have just given subjective weights to my responses and then I have added them up.

    e.g. on a 3 point scale, I am interpreting "don't agree" as 0 point and "always agree" as 2 points (see below), based on the logic of relative importance (i.e. because "always agree" is most important, it gets the highest score of 2).

    don't agree = 0
    Sometimes agree = 1
    Always agree = 2

    I have ten responses, so I am adding all the response scores together. e.g. if ten people said sometimes agree than the composite score is 10.

    Is this ok?

    Many thanks

  41. Further to my above post, I forgot to mention that I am using a 5 point likert scale and the 3 point scale is just an example.

    My 5 point likert scale weighting is both negative and positive as follows:

    Positive Weight
    Not important = 0
    Somewhat important = 1
    Moderately important = 2
    Very important = 3
    Extremely important =4

    Negative weight
    Strongly disagree = -2
    Disagree = -1
    Neither agree or disagree = 0
    Agree = 1
    Strongly agree =2

    I know likert scale is coded in the way above but I am also using the codes as weights to create my composite score.

    Once I have two composite scores, I want to run correlation analysis, in particular Spearman Rank Correlation because the composite scores are based on rank (e.g. relative importance).

    Many thanks again.

  42. @Anonymous:
    I'm guessing that you have subsequently asked the questions on Stats.SE where I have provided answers:

  43. Just answered another question on Stats.SE on forming ability composites:

  44. Hi, Jeromy,
    I dont know if you can help but I have gotten myself really confused!!
    I have a number of variables from coding data (videos of children interacting) at two time points. The aim of the coding is to look at fear and emotion regulation.
    1.I have calculated average scores for each participant for each coding variable and at each time point.
    2.Running correlations and factor analysis I have identified variables that I want to combine to create overall scores of 'fear' and 'regulation'at both time points.
    3. As some of the variables were coded slightly differently I created zscores of my averaged data for each variable at each time.
    4. I then combined these z-scores for each time point(by adding my z-scores together
    5. This gives me my scores for 'fear' and 'regulation' at two times and I thought I was all done....
    but then I realised that the mean of these variables is 0 (I guess because they are made from z-scores). So when I tried to assess change over time it came out as nothing. I know you wrote a bit about this above, but I cant get my head around what i've done wrong. I'd really appreciate it if you could help! :)

    Many Thanks,

  45. @Emma A very simple strategy would be to use a consistent mean and standard deviation when calculating the z-scores for your variables (e.g., use the time 1 mean and standard deviation for both time 1 and time 2 scores). You will then be able to see change over time (if there is any).

  46. Hi Jeromy,
    Thanks very much for responding so quickly, your advice has been really helpful.
    I have done as you said and it seems to have worked!
    The only thing I didnt mention is that I am also running regressions on the data from each time point. Obviously the regressions on time 1 data are the same, but when I re-ran the regressions with time 2 variables they were similar to before (when I standardised time 2 data with its own M and SD) but they were a little different- I just wanted to check that this is not a problem- that it is ok to run regressions on the variables even though the zscores are calculated using different means/sd?
    I really hope that makes sense!?
    Many thanks again for your help!


  47. Hi Jeremy,
    I have a small sample of participants (n = 8). This is b/c they are a rare clinical group of patients. I have three different cognitive tests (with different DVs - total time for 2 tests, and total correct for 1 test), and would like to obtain a composite score for these three tests. My questions are as follows -
    1) B/c I cannot run a factor analysis, is there another approach via which I can assign weights?
    2) I do not have normative data or else I could have used them. In this instance do I take the mean and SDs based on this small sample for calculating Z scores?
    How can I use the raw data from these three tests and calculate Z scores?
    Many thanks,

  48. @Emma correlations and standardised betas shouldn't change simply by changing means and sds used for creating z-scores, but unstandardised regression coefficients should change a little bit.

  49. @Anon 20 / 8 /11:
    1. You can use z-scores
    2. Yes you could take the mean and sds from your small samples. In general converting to z-scores is a relatively simplistic approach to forming a composite. Bigger samples would yield more accurate estimates of the population means and sds and thus, more accurate estimates of what z-scores would be if population means and sds are applied, and thus more accurate approximations to equal importance as implied by population standard deviations. That said, the desire for equal importance as defined by standard deviations is often approximate. If your main aim is to put the variables on roughly the same scale, then the estimate you get of means and sds from 8 participants may well be adequate.

  50. Thanks Jeromy, your help has been really appreciated, and always clearly written and easy to understand.


  51. Thank you Jeremy for your reply!

  52. Jeremy,

    I don't think anyone in this thread has asked this variation of the question -- I am comfortable that the mean of the combined z-scores is zero, but I'd also like the combined score to have SD 1 like the original z-scores do. When I add or average individual z-scores the combined metric typically has a standard deviation substantially different from one.

    I've been dealing with that by taking the z-score of the combined z-scores, which feels like a fudge. Is that statistical malpractice? If so, is there a better way to do that?

  53. Hi Russ,
    Taking the z-score of the mean of z-scores is fine. I've done this before. By choosing to adopt the mean of z-scores as your scaling, you have adopted a scale that's meaning is defined relative to the distribution in the sample. Thus, standardising a second time just makes interpretation of that composite scale a little simpler.

  54. Hi Jeromy

    I am trying to compute factor scores based on the various loading factor. I am a bit confused about its calculation. Is there any tip on this matter. I ve been able to calculate the summated scales and related statistics to this effects. Thank for your help


  55. @Roshan,
    Usually people just use their factor analysis software to save the factor scores.

  56. Dear Jeromy

    Can you please guide me how to go about it in SPSS? I am trying a couple of examples but unable to do so..


  57. @Roshan
    Most introductory treatments of factor analysis in SPSS will touch on saving factor scores.

    Here's an article that explains the process of saving factor saved scores:

    In spss the main button of interest is the "scores" button within the factor analysis dialog box. clicking "save factor scores" will then save additional variables to your dataset corresponding to factor scores.

  58. Thanks very much Jeromy Anglim. I was just looking for this.

  59. Once i've created composite sores from z-scores and i want to determine if there is significance between two groups i am comparing , what type of test can be done with the z-score composite? I certainly can't conduct an independent t-test.

  60. @Anonymous, There's no rule that says you can't run t-tests on composite variables. If the composite variable is normally distributed a t-test is fine. Even if it's a bit skewed, the t-test will generally be fairly robust.

    I can only guess that you have concerns because of the fact that your composite is composed of "z-scores" that you think a "t-test" is not appropriate (I.e., because z-tests are often taught in contrast to t-tests).

  61. Im a psychologist in Norway doing a neuropsych phd. I'm creating Z-scores as you describe above very nicely. But how can I reverse a Z-score? I have a reaction time test where higher scores means bad results. Other tests are quite the opposite - low scores, bad results. I want all the Z score to be arranged in the standard way - low is bad, high is good. How can this be done?

    Regards, Per-Ola Rike

  62. @Per-Ola

    zero minus raw-z-score

    will reverse the raw-z-score

  63. Hello,

    What about doing second order composite scores (or a composite of composite scores)?
    For example, if a have:
    vocabulary_tot = zvocab_a + zvocab_b + zvocab_c
    comprehension_tot = zcompreh_a + zcompreh_b
    reading_tot = zreading_a + zreadind_b + zreading_c
    How can I do a composite score integrating the three total scores?
    Would something like:
    Total_score = zvocabulary_tot + zcomprehension_tot + zreading_tot
    be correct?

  64. Thank you for answering. Another question: I have around 100 patients tested with neuropsychological tests. Some of the tests are validated wtih norms (DKEFS; WAIS III), other do not have a norm set (healthy controls). My patients have cognitive deficits. The question is how to make z-scores that "communicate" across these different test (with and without norm sets)? Do you recommend making z-scores based on only my population (100 patients)?, or should I make z-scores based on the normal data (for the DKEFS and WAIS tests), and on the hand make z-scores from the other tests based only on my patient population (since norms do not exist)?

  65. I don't know what you "mean by "communicate" across these different tests". If all you are doing is converting to z-scores to enable the creation of a composite, then from an individual differences perspective it probably wont make a big difference whether you use test norms or your sample norms, especially if your sample is a decent size like 100.

    The more your sample standard deviations differ from the norm standard deviations, the greater the difference you are likely to see in the two approaches.

    I suggest you calculate your composite both ways and check out the correlation between the two composites. My guess is that such a correlation will be very high (e.g., > .97 or some such). If this is the case, you will have further evidence that it really doesn't matter which option you choose.

  66. Hello Jeromy
    can you please help me with this? I'm doing prediction research on reading skills. i have three sets of predictors in which each of them include only 2 measures. my regression with factor analysis produce good predictive powers but with z-scores the prediction results are really week. my question is can i run a factor analysis based on 2 correlated measures only?
    if yes, is there any reference that i can refer to in my dissertation?
    help me please, Many thanks

  67. @Anonymous; If you are getting very different prediction with factor saved scores as you are with z-score composites, that suggests that there is more to your story than you have told here. It could be that you are using different variables to create your factor saved scores, or that you have issues with negative items, or any number of other reasons. At the very least, it is a red flag suggesting that you should look in to your analyses a little more closely; and perhaps check the analyses with someone.

    Factor analysis based on two variables doesn't make much sense. You have issues of identifiability. If you want to make a composite, the factor analysis is not going to tell you how much you should weight each variable. It might automate any item reversal, and it would handle standardisation, but that's about it.

  68. Many thanks Jeromy. I was checking for your reply every few minutes. Do you think the "item response theory" will be helpful for identifying one factor from the 2 measures in each set? and should the measures be correlated as is the case in factor analysis? if i can rely on this method, how i can get a factor from it?
    your response is highly appreciated.

  69. @Anonymous, if you wish to combine two variables to form a composite, you need to decide how you wish to do it. There is not enough information in the two variables on their own, to determine how they should be combined. Neither item response theory nor factor analysis would overcome the identifiability issue.

    Should the measures be correlated? In general, the correlation between two items is used to justify the creation of a composite. The correlation is consistent with the two variables reflecting a common underlying factor.

    In summary, if you have two variables which you want to combine into a composite, you may want to stick with the z-score approach. In the absence of theory telling you otherwise, you probably would just weight each variable equally (perhaps after converting to z-scores, and of course, reversing if required).

  70. Thank you very much for your time and support. Wish you all the best,,

  71. Dear Jeromy,

    Sorry if this is irrelevant, but may I ask you about interaction variables? If factor analysis does not work with only two variables, then is it a possible solution to just multiply the variables together? Given that I want to include this new variable into a regression model as a predictor alongside another predictor which is a factor score? Can I enter them together in one regression?

    Also my hypothesis is that this interaction variable predicts the outcome but dependently on the factor score, so it works as a mediator for the prediction between the factor score used and the outcome? Any idea how to test that? I'm kind of lost on this point!

    I really appreciate your help and support,

    Many thanks.

  72. If it's important to know.. these two variable multiplied together are of two tasks measuring one aspect of a language, and they are significantly correlated..

    Thanks again.

  73. @Anon
    I don't see how the issues of creating a composite between variables X1 and X2 has anything to do with creating an interaction term by multiplying X1 and X2. If you want a composite, then create a composite out of X1 and X2 by making a weighted composite (converting to z-scores first is one way of handling this).

    If you are talking about creating an interaction between X1 and X2, that suggests that you see the two variable as discrete predictors, and that you would most likely want to include X1, X2, and the interaction in a regression model.

    The rest of your question is probably a bit off topic for this thread. I suggest you ask the question at .

  74. sorry you have a mentioned many cases same with my case ...but just wan be sure...
    i want to composite between 3 variables 1 is test yes or no with 14 questions and i make to 5 categories started by 1 to 5 one means 0 n 5 is 14 correct answers ...the second n third variables are likert scale from one to five possible to just composite the mean of every variable...??thx a lot

  75. @abid
    It might be okay. It would depend on the details. You'd have to look to see whether the standard deviations of the three variables are approximately equal to know whether they will be roughly weighted equally in the composite.

    You are standardising all the variables to ensure that they have a consistent min and max (i.e., 1 to 5) rather than a consistent standard deviation, which is what happens with the z-score approach.

    And of course, there is still the issue of whether it makes sense to combine the variables in the first place. Without knowing your research context, I'd be careful with combining Likert and ability based measures. I can think of cases, where it makes sense, but in other cases, such variables can be fairly uncorrelated.

  76. im happy jeromy u r here ...pls i wan ask something not related to Composite Scores ...

    i need to run MANOVA in my study via SPSS... i have 7 IVs and 3 DVs... im trying but still giving me under running ..very slow ..n ive waited 3 hours n no results ...why? pls help me

  77. @abid
    please ask other types of statistics questions on

    or check out these instructions:

  78. @nokil asked me a question on a different post, but refers to this page, so I've moved the question here.
    The question was:

    "One of your posts (dated 3/25/2009) regarding adding up z-scores to create a composite your provided the following approach:
    2b.2) add-up the z-scores:

    In SPSS this can be done using Transform >> Compute

    In syntax it might look like this:
    compute verbaltot = zvocaba + zvocabe + zcompr.

    I was just wondering why (zvocaba + zvocabe + zcompr) is NOT divided by 3 to get the average? "

    In these situations, the unit of the variable is generally not of particular interest. The correlation between (zvocaba + zvocabe + zcompr) and (zvocaba + zvocabe + zcompr)/3 is 1.0. Thus, if I'm only going to be using the variable in correlations, and other standardised measures of effect, it wont make a difference whether I divide by three or not. If I want an interpretable unit for the composite, I might convert (zvocaba + zvocabe + zcompr) into a new z-score so that it has a z-score interpretation.

    1. Thanks Jeromy for the input. A little unclear: how and why convert (zvocaba + zvocabe + zcompr) into a new z-score and they're already z-score values?

    2. The sum of two or more z-scores is not a z-score. Sums of z-scores will have means equal to zero and standard deviations larger than one. So, if you want your sum of z-scores to be a z-score, then you need to standardise the sum.

  79. Thanks Jeromy for the clarification. Very helpful.

  80. hi..
    im using SEM as my analysis tool. my analysis provided model fit but my multivariate kurtosis was my supervisor suggested me to use composite score to overcome my problem. for example esq consist 4 indicators (gb, info, wd, tc). do i have to calculate z score for each indicators then sum up them?


    1. Converting to z-scores and then summing is one way to form a composite. It's not the only way.

  81. Hi Jeromy,

    I am forming standardized composite scores for neuropsychological domains (e.g., verbal learning and memory). Although the means and standard deviations for each of the individual tests within the cognitive domains are zero and one respectively, the standardized cognitive domain composite scores have a mean of zero but the standard deviations are not equal to one (e.g. 0.707). Is there any way to fix this so the standard deviation is one? I want the results to be easily comparable between groups and cognitive domains. Thank you!

    1. You could always make a z-score out of the composite variable. Thus, you are applying standardisation to both the initial tests and the composite.

  82. Could you possibly tell me the reasons why individual scores might be significant predictors in a regression equation but when a composite score is computed using these scores and entered into the regression equation the regression equation is no longer significant?

    Thanks a bunch.


    1. There are a bunch of possible reasons. If you combine some variables that predict and some others that do not, then the combined variable might be less predictive overall. Thus, you've washed out the effect. This would be particularly likely if there are many more variables that don't predict than do. It could be even worse if you combined (and did not reverse) variables that predicted in opposite directions.

      Generally, there is a rationale for combining predictor variables. This is usually based on a combination of theory (the variables should be combined because in combination they represent a construct of interest) and statistical evidence (e.g., based on the variables correlating; often accompanied by PCA or factor analytic evidence). When the component variables correlate well then the problem you describe is less likely to occur.

      Finally, there is just the general issue of multiple testing. Assume that you have 100 predictors, X1 to X100, and you create a composite C1 which is the mean of X1 to X100. Now assume that none of these predictors are related to the outcome variable at the population level. However, just by chance some of these predictors are likely to be significant (5 on average at the .05 level), yet there's only a 5% chance that the composite will predict significantly.

  83. Hi Jeromy,

    I have a Likert type attitude scale measuring negative to positive attitude towards a treatment administered to 1 large group. All negative responses were coded 1 or 2 and all positive responses were coded 4 or 5. Neutral values were always 3. I want to be able to say that respondents were either positive or negative in their attitude towards each composite. Once i have created composite scores from factor analysis is it fair to say that there is a midpoint that reflects a neutral value in each composite and therefore scores below that are negative and scores above the midpoint reflect positive attitude?

    Thanks, R

    1. Factor saved scores are typically computed as z-scores that generally make it difficult to understand the meaning of the scores relative to the underlying scale (i.e., 1 to 5).
      This is one reason why it is often more intuitive to form composite that is the mean of the original items (e.g., you know that a mean of 4 on a set of 1 to 5 items means that the person tends to score in the upper part of the scale).

      If you convert to z-scores then the zero value is relative to the distribution in the sample that was used to run the factor analysis. For example, if everyone was scoring 4 or 5 on the original items, then the zero point on the z-score would still reflect a generally positive attitude.

      It is often difficult to say exactly what a set of responses indicate especially when dealing with composites. However, in general, I think that sticking to the original 1 to 5 metric and even exploring accompanying means and frequencies on each item can be informative in understanding what absolute values indicate.

  84. Hello

    I'm in the middle of my final year undergrad psychology dissertation and I am having a bit of difficulty with data analysis.

    I'm looking at accuracy so I have variables in SPSS relating to the correct responses for each variables and then i have variables referring to the number of false positives.

    I have been told to create corrected scores by computing a new variable and subtracting the false positives away from the positive positives. When i do this though all the values int he new corrected score variables are negative.

    I just don't understand this corrected score business at all.

    1. It sounds like you might be interested in calculating one of the measures used in signal detection theory. Perhaps you'd be interested in calculating d-prime?

  85. Thanks Jeromy, I think you've developed a great thing here with your blogs. I am trying to figure out stats for research in social psychology of language studies.

    I also found in conjunction with your site, this other site was helpful too: I think your site has really touched on computer assisted learning, and I am really grateful!

    -Dana, USA

  86. Hello,

    I have read in the earlier posts that you can reverse/ invert z-scores. Can I do the same with raw data scores?

    Currently I have two measures (neuropsyc test scores) where scores on Measure1 increase means participants doing better and Measure2 increase is interpreted as participants doing worse.
    Would I have to convert them into Z-scores first before inverting and using the inverted z-scores for analysis?
    or can I just invert the raw scores and use that for analysis.
    Both measures are not measured by likert scales/questionnaire like.

    Please advice. Thanks!


    1. You can reverse any score whether it is a z-score or not.
      E.g., if you have a variable called X that ranges from 1 to 10, you can calculate XNEW = 0 - X and it will have a range of -10 to -1.

      The variance, sd, and range will be the same for X and XNEW. The correlation between of X with Y and XNEW with Y will be the same magnitude, but reversed.

    2. Hi Jeromy, correct me if I'm wrong but I think that to reverse code X that ranges from 1 to 10, the formula should be:
      XNEW = 10 - X + 1.

    3. @Anonymous I agree that in general when creating a mean for a typical self-report scale on a common scale that you typically reverse items by taking:


      e.g., for 1 to 10 scale

      10 + 1 - SCORE.

      This is a convenient scaling and keeps all the items, reversed or not, on the same scale.

      However, any equation of the form $a + bx$ where $b < 0$, $a$ is a constant, and $x$ is the score would be sufficient to reverse the scoring (i.e., induce a perfect negative correlation between the original and the reversed.

  87. Thanks Jeromy!


  88. Hi, Jeromy. Ok, my factor analysis results are drawn from the same measure. However, I found a three factor solution with a different number of factors loading on each factor. The third factor showed poor fit. I am trying to decide how to construct the composite variables since I have unequal positive loadings. Do I just count loadings over .4? Do I add up the top 5 loading variables? Or do I just follow the factor analysis?

    I don't have the raw data, so I can't do the factor saved route.

    Let me know your thoughts. I am trying to get this analysis done pronto for a presentation....

  89. @Rae It depends. If the test is well established, often researchers will just use the factor solution suggested by the developers of the test. This can make your results more comparable with others. Furthermore, factor solutions can be a little unstable, from sample to sample. So if it is a well developed test and the obtained factor solution is similar, even though not exactly the same, then you might want to go with the standard composite.

    Once you move away from the standard solution, you move into the realm of art and choices. You say that you don't have the raw data, but you could probably do something similar to factor saved scores by doing something like weighting the items by their loadings (assuming they are all on the same response scale and have similar sds).

    Alternatively, as you are considering, you could create a unit weighted composite where you make black and white decisions about which items are included and which are not. If you do that, there are a range of plausible decision rules that you could employ. There is often a trade-off between including enough good items and the quality threshold you set in terms of high loadings and the absence of cross-loadings. Loadings over .4 seems like a reasonable starting point, but it can be a bit of an art form.

  90. I looked through the comments to see if this question was answered already, but if I missed it - I'm sorry! Please just direct me to the questioner's name and I'll read.

    I'm doing a meta-analysis and I'm looking at mindfulness and self-compassion. In mindfulness research, there's a measure called FFMQ which has multiple subscales. The standard self-compassion measure also has multiple subscales. The problem is that sometimes authors only report means and standard deviations for the subscales and do not report information on the composite scale or even correlations of the composite scale with depression. I have tried contacting the authors, but only one has answered (and it was to tell me that he was retired). Since I don't have the original data, is there any way I can use the means and standard deviations for subscales reported in these articles to make a composite scale?

    1. I'm starting to understand what you are asking, but I'm still not clear about a few things.

      Perhaps you could post your question either

      here if you think it's more statistical

      or here if you think it's more psychological

      and post back a link to the question.
      We can then engage in a bit more dialogue about what you are asking exactly.

  91. Hi
    This is very concise and easy to follow, thank you so much.

    I have one small question after making my composite measure (of three factors from my factor analysis) how do I get a relaibility alpha for this new meaurew. i put the z scores for the 3 scales into a relaibility test and an error came up??? Now i am confused as i have to report an alpha...

    Any hint would be much appreciated? I tried going back to the factor analysis and manually created 3 scales as per the saved scores and I then combined these in a reliability test and got an alpha. However something tell me this is wrong as i didnt use the z scores. Am i doing this right??

    1. Factor saved scores are often designed to be orthogonal (i.e., uncorrelated). Internal consistency measures of reliability such as cronbach's alpha use the correlation between items to estimate reliability. This might be part of the problem.

      Ultimately I don't quite understand what you're trying to do (i.e., what are your items, and how do these relate to the z-scores, and your final scale):
      So you've made three component composite scores out of some items and then you want to create a higher level composite out of the component composites?

      Or are you really wanting to get the reliability of each component scale? This would be the more common situation when doing factor analysis (i.e., you have a measure with three scales and you want to get the reliability of each scale).

      Also, if you are getting an error message, it is generally clearer, if you say what error message you are getting.

  92. big thanks clearly time and effort went into this page, saved me loads of time thanks you

  93. respected sir
    I want to standardize self made ict perception scale. Already i have formulated 32 items under 4 dimensions. responses are ranging from strongly agree to strongly disagree in five point scale. responses have been taken from forty teacher educators. how can i standardize it in spss. please suggest.

    1. I imagine you would want to create four separate scales. In such cases when all items are on the same response scale (e.g., 1 to 5), there is no need to standardise. I have a separate post that relates to typical issues related to scoring a multiple item personality test. It provides an example of computing scores in SPSS:

  94. Dear Jeremy,

    I have to use the scores for a scale and I'm not sure if transforming into z-scores is the right option .. It's a scale with four items and to each item a number of points from 1 to 100 must be given, such as after giving these items points according to their importance, at the end summing all the points from all the items a score of 100 must emerge. So 100 points are split between four scales according to their importance. If I want to use just one scale, do I standardize it's score and use it or this approach is wrong?
    Thanks in advance for an answer!

  95. So it sounds like your composite is the weighted sum of four items, where the sum of the weights sum to one. Thus, each item is on a 1 to100 scale and the weighted sum will also be on a 1 to 100 scale.

    In general, this approach is fine. Because each item is on a 1 to 100 scale, then each item will most likely have a similar standard deviation. Thus, there wont be much difference between using z-scores and raw items.

    The main thing is to justify the weights.

  96. Dear Jeremy, thanks for your amazing blog.

    I am now doing a longitudingal study, so to make the composite I should consider de mean and SD of the 2 or 3 time (now i have two) points for each variable to make the Z score, and then sum up with the other variables Z scores. I understood corretly? Do you have a paper which can i cite regarding this issue?
    thanks in advance,

    1. The key thing is to use a consistent standardisation over the time points. Obviously if you were to standardise within time points, you would remove any effect of time, because the means for each time point would be zero. I think that's sufficiently obvious that a reference is not required.

  97. Hi Jeromy,
    I was wondering if you know any sources to cite that give some indication of how highly variables should correlate to justify forming a composite. I have two variables that are widely assumed to measure the same construct, with a correlation of about .5. I would like to be able to cite something saying a correlation of .5 is generally agreed to be high enough to form a composite. Thanks!

    1. I wouldn't trust such a reference. Any argument for forming a composite should combine information from various sources: theoretical, conceptual, correlational, statistical. That said, I imagine the intercorrelations of many ability subscales on things like the WAIS would be in the .4 to .7 range.

  98. Hi there, thanks for this site!

    I'm trying to make an index based on factor loadings, but I'm turned around due to the complexity of the dimensions. We created 3 separate subscales (A, B, C), but within each is 3 identical components (1, 2, 3) - so 9 subscales in total (A1, A2, A3; B1, B2, B3; etc.)

    Ultimately, we'd like to have one factor score (or maybe two - our first and second subscales are more closely related than I expected, but the third is definitely uncorrelated). If I run a FA and save the scores, I end up with a score for each factor. How do I combine the factor scores? Also, these items are Likert items - is there a way to combine the factor scores so they are meaningful for interpretation?



    1. Hi Stacey, I'm not quite clear on your terminology for items, subscales, scales and factors.
      Perhaps post a question to following this process and submit the link here.

  99. Hi Jeromy ,
    I'm Sanjay from India . Gone through few posts and I must say it's a great job by you. Small doubt , need your help.
    Bit emergency. Will appreciate a quick response.

    I've got a doubt regarding standardizing scores of a psychometric test.
    Firstly, the details of the test are as follows:
    It's a 96 item questionnaire with all items being measured on the same scale(1-5) divided into 3 categories. category 1 has 40 questions , category 2 has 40 questions , category 3 has 16 questions.

    There is no problem with scoring whatsoever. The score sheet is really clear with total scores in all the categories and the ultimate total score which is nothing but adding category1 +cat.2+cat.3 scores.

    To standardize, I've used z-score method that calculates z-score on the total ultimate score , and then took a cue from IQ tests that use mean 100 and std deviation of 15 to arrive at a standardized score for any given individual.

    My doubt is :

    The current situation: when a candidate takes a test , he doesn't know it's of 3 categories. It appears to him as if he's attempting a straight 96 item questionnaire .

    So, to arrive at final standardized score that's done like this in my case: 100+(z-value*15) . I would like to know if this z-value is to be obtained after calculating individually z-scores for all the 3 categories , then sum them up and look a value in z-table for that summed up z-score or else, my current method of obtaining only 1 z-score will suffice ?
    Awaiting your expert views on it .

    1. I don't understand why you need to standardise. As you say, the test manual is clear on how to create subscale and overall scores. In none of these steps do you need to create z-scores. If you want to report normative data on either subscales or overall scales, presumably the test manual provides information. For example, you might want to convert raw-scores to normative percentiles or some such.

      In general, most of the points in this post about converting to z-scores and so on are not especially relevant to creating scale scores for standard self-report psychological scales that use closed ended questions (e.g., 5-point, 7-point, etc.).

  100. Thanks Jeromy for the quick and simple clarification to some one who's just started learning statistics .

    The main aim was to slot a candidate into one of the 3 classes : High performer , Average Performer, Low Performer. That entire thing of calculating z-scores and converting into a standardized score using 100+(z-value*15)has been done by taking a cue from IQ tests , which already employ this method to classify people scoring between 85-115 fall into 'average' , >115 into'high performing', and <85 to be 'low performing'.

    If that's not the right way of slotting people into the intended categories, then what could be the way out/ what are the things to be taken care of to classify test takers into these 3 categories ?

    Hope I'm not taking away too much of your time .


    1. A few things:

      * You should think about what norms you want to use to define the mean and standard deviation. It sounds like you are using the mean and standard deviation from your sample, which is fine if that is what you want. However, you may in the future get other samples. So, then you'd need to decide if you were going to use the original sample or the newer sample. Alternatively, you could use some normative data from the test manual or some such.

      * With regards to the IQ approach to categorisation, it's equivalent to just looking at z-scores where you are classifying -1 to 1 as average and above 1 as high performing, and below -1 as low performing. Using the standard deviation as the basis for cut-offs seems like a reasonable heuristic, but you certainly lose a lot of information in the categorisation process. For many purposes you would be better using the continuous version of the variable rather than the three category version.

  101. Hi Jeromy, thank you for this wonderful blog and the above tute.

    I have a question:

    1. When you compute a composite z score (z score from 2 or more tests), do you simply add up the z scores as you have suggested above? I read somewhere that one should divide the added z score by the number of constituent tests. Yet, else where, I read that you have to compute a z score of the composite score!

    Thanks for your advice.

    1. So you are talking about three options: sum, mean, and restandardisation of the sum.

      All three options will be perfectly correlated. If you don't care about the metric of the composite then it doesn't matter.

      However, if you want the composite to be a z-score, then you should standardise the sum.

  102. Thank you for the clarification Jeromy.

  103. Jeromy, I have another question:

    If for example have 'x' cases and 'y' controls and have several test scores on the cases and controls, how should I compute 'z' scores?

    1. for cases and controls separately?
    2. for cases and controls together?

    (i.e, for each group, should I use the groups mean and SD or can I use the mean and SD of the cases and controls together?)

    Thank you in advance for your help.

  104. If you want to compare groups on z-scores you have to use a common standard deviation across groups other you will find that your group differences disappear. So in general compute z-scores using the entire sample. One slight variant on this is that you could use the pooled within group standard deviation as opposed to the overall standard deviation

  105. Hi Jeromy

    This blog has been an amazing help to me so far. Thank you so much.

    Would you mind clarifying why it might be more desirable to reverse code by multiplying original item by -1 rather than reversing the sequence order? Or is multiplying by -1 only done for zscores but actual item scales should be reversed coded the usual way?

    Any advice you can provide would be greatly appreciated! Thanks again!

  106. Hi Jeromy -

    Great blog and thanks for all of your time/effort!

    My problem is not with creating a composite score, but with trying to wrap my head around an existing one and work with it. We are working with a large dataset that uses a composite variable (globcog = global cognition), which is described as a Z-score created by averaging z-scores from 17 different cognitive scales. (So, technically, I'm guessing it is not really a Z-score, but rather an average, as you have described. Mean at baseline is 0.17 and STD is ~0.6.) So far, so good. The data is longitudinal, so I'm assuming that all of the Z-scores are calculated using the baseline values.

    Here is my problem: I am trying to do power calculations using this data, and am being asked to consider the "rate of change" of this variable among various subgroups. More specifically, the question is: What sample size would be necessary to detect a 25% reduction in the rate of change in globcog at 3, 5, and 10 years?

    When first presented with this task, I tried to calculate rates of change as a % as per the standard formula: 100* (Endpoint - baseline)/baseline. I quickly realized that this didn't make sense with the Z-scores, because of a.) negative numbers and b.) zeroes. For example, anyone who started with a negative value and ended with a positive one would show as having a negative change; anyone who ended with a value of 0 would have a 100% change, no matter what they started with; anyone who started with a value of 0 would have an infinite change, etc.

    So I was advised to just use the absolute change instead of the percent. Mathematically, of course, I can do this, but I just don’t feel confident that it is really legitimate. (Maybe the whole percentage thing scared me too much!) So here are my specific questions:
    1. Is it appropriate to compare the absolute change in this composite variable between people, or groups? (E.g., women had a mean decline of 0.36 over 5 years, while men had a mean decline of 0.41). I’m guessing this is okay….
    2. Does it make any difference if we translate this into a “rate” by dividing both numbers by 5, to indicate a “mean annual decline”? Decline is probably not linear, but still probably okay in theory….
    3. Can we take a % of these declines, as per the P.I.’s instructions? That is, if the overall mean change in 5 years was -0.38 with a STD of 0.6, could I then say that reducing that decline by 25% would result in a mean change of (-0.38 * 0.75) = -0.285, and then plug those two values (-0.38 and -0.285) along with a STD of 0.6 (assuming the STD wouldn’t change much) into a Power Calculation calculator to get a sample size estimate?

    My concern here is that by the time we have z-scored 17 things, averaged those z-scores, then measured that variable over time, what we have is at best an “interval” value, but to me, intuitively, it seems even more like just an “ordinal” value. It is certainly NOT a “ratio” value, so does it even make sense to talk about a “25% reduction” in the “rate of change” of this measurement?

    Would love to hear your thoughts on this before I proceed! Thanks very much.

  107. I've just had a quick read. It sounds like you understand what's going on. As you note, percentage reduction does not make sense on a variable that lacks an absolute zero. Cognitive ability is typically conceptualised as a latent variable. It is often assumed to have a normal distribution. In this sense, you could see it as an interval variable.

    You can certainly model the effect of time on the composite z-score measure of cognitive ability. I think your issue is that you want your statements to have meaning. You may find it useful to convert the composite z-score into an actual z-score based on the standard deviation at time 1. Then, change over time will be on a meaningful z-score metric. Another alternative metric would be intelligence scores (i.e., mean = 100, sd=15), although there would be questions of what norm the value of 100 relates to.

    So avoid percentage change statements, and instead talk about change in standard deviation units.

    1. it is fine to compare change by groups: That would be some form of time by group interaction in ANOVA, random-effects modelling, etc.

    2. You can code time in a variety of ways. E.g., if you have observations for baseline plus five years, you could code it 0,1,2,3,4,5, or 0,.2, .4, .6, .8, 1. How you could time will influence the meaning of any coefficient. So in short it's fine to try to get an average yearly change.

    3. Don't use % decline. With regards to power calculations, just use mean change. I think G*Power would allow you to examine time by group interaction effects.

    1. Thanks so much for your very prompt reply!

      You wrote, "I think your issue is that you want your statements to have meaning." Yes, yes, yes!

      Alas, the same qualities that have always made me very successful as a student seem to cause me grief in the working world, even when that world is academia. :( Thanks for understanding!

    2. Okay, now I've gone beyond the stats question to more of a thought experiment. In fact, I'm playing devil's advocate by now arguing the OPPOSITE side of my prior argument, and suddenly I'm confused.

      I am in agreement that it doesn't make sense to talk about "25%" of a value that doesn't have an absolute zero. For example, using temperature (in degrees F or C) as our scale, we can't talk about 25% of 40 degrees (because what would be 25% of 0 degrees?) Similarly, we can't talk about a CHANGE of 25%: if it's 40F degrees on Tuesday, and 0F degrees on Saturday, has the temp dropped 100%? Obviously not, since if we used Celsius instead we would go from 4.4C to -17.8C, a drop of over 500%.

      BUT: Could we talk about a % change of a change?

      Here’s an example: Say you are concerned about the poor insulation in your attic. You complain to your spouse/roommate/landlord about this, but s/he ignores you. So you first take measurements to show how bad the problem is. Every day for two months in the summer, you measure the temperature in the attic at 8:00 a.m. (which is usually comfortable, close to the temp in the rest of the house), and then you measure it again at 5 p.m., when it is much hotter. After 60 days, you calculate that the mean daily temperature change is 24 degrees F (say, from 80F to 104F here in Philadelphia.) You present this info, and the response is, "Okay, but I can't put in enough insulation to keep the temperature constant all day - that would be way too expensive!" So you say, "I understand. But could you at least insulate it enough to cut that temperature increase in half?" Here, half of 24 degrees F is 12 degrees F; you are saying you could live with that level of average increase (e.g., from 80F to 92F.) Maybe the antique books you store in the attic won’t suffer as a result of that level of daily temperature fluctuation.

      In Melbourne I presume you would present the same scenario as an intolerable mean daily increase of 13.3 degrees C, from 26.7 to 40C, which might be acceptable if it were cut in half: from a daily increase of 13.3 to a daily increase of just 6.65 degrees C. Thus, an average afternoon would top out at 26.7 + 6.65 = 33.35C = 92 F. In other words, the same PERCENT CHANGE OF THE MEAN CHANGE gives the same results no matter which scale you use, C or F.

      This “% change of a change” is really what the PI is asking me to use in doing the power calculations for the “Global-Cog” measurement. The population here is people with Alzheimer’s and other neurodegenerative diseases. There is no known cure for these conditions, which inevitably lead to declines in memory and other measures of cognition.

      But: we want to do a study to see if our proposed intervention will reduce the RATE OF DECLINE (or the mean total decline over Y years) by, say, 25%.

      So now I’m thinking that this actually DOES make sense; that is, if the mean decline in our composite variable GLOBCOG is X points per year, saying we hope to change that to 0.75X is just as legitimate as saying that we want to change it to X-Z.
      Or is it?!?

      Dina (now with a headache…..)

    3. Percentage change of a change on an interval variable sounds fine.

      I've often talked about this in the context of personality faking research.
      For example, the presence of warning not to fake might reduce the change from an honest to applicant condition from d=.5 (without warning) to d = .3 (with warning) so the warning reduced faking by 40% (i.e., (.5 - .3) / .5). Even though the dependent variable is interval at best with no absolute zero, this still makes sense.

  108. Hi Jeromy,

    Happy new year and me again! I tried, however, the response (yours) was on R software, which I wouldn't know from a bar of soap. I am using SPSS and GraphPad Prism.

    Thanks for your help.
    I have composite Z scores for a few cognitive tests. The Z scores are not normally distributed. If I log transform (log 10), the values less than 0 are not transformed. Funnily, the normality plot is nearly gaussian in appearance, however KS, D'Augostino and Shapiro-Wilks all show that the data is not normally distributed (p<0.02 to <0.0001). Hence, as a rule of thumb, I used Log transformation rather than SQRT or the other methods. What am I doing wrong? How best do I proceed? Your suggestion to add a constant so the minimum value is '1' makes perfect sense, but I am not sure how to do it in SPSS or GraphPad Prism.

    Please help and a million thanks in advance.
    Regards, Vaidy

    1. In SPSS use "transform - compute" to create a new variable. You can do basic arithmetic on existing variables in addition to apply a wide range of functions.

  109. Hello Jeromy! This post is a great resource and I've learned a lot from it. For creating some basic (i.e., equally weighted) composite variables it was perfect! Now I'm trying to take the OPQ32r and map it to a 6 factor model of personality. I have the sten scores for the 32 components of the OPQ. The publishers manual provides a mapping to the 6 factor model (i.e., the rotated component matrix). I am trying to create composite variables using the components that have weights above .3 (including negative weights; i.e., humility has a negative relationship with extraversion). Once I create the composite variables, however, the means and standard deviations are no longer meaningful. I've seen people mention that you should rescale your composite to make it meaningful/interpretable again, but the only example that I found was in R and I couldn't follow it at all. I'm using SPSS. Any tips or guidance would be much appreciated. Thanks!

  110. There are several options; the simplest would be to convert to a z-score using analyze - descriptives -descriptives save as standardised variable.

  111. Just came across this post: Have you seen composite scores based on quartiles? I have seen tests which cuts the scores into quartiles and give them points , like 12. What is the advantage of doing this way? Thank you for your postings!

    1. I'm not really sure what you are talking about. I know when people get reaction time data with many observations per participant that they sometimes extract individual quantiles to reflect aspects of each individual's reaction time distribution. This gives you more than just the mean which is particularly useful where the data is non-normal or contains outliers or where you where the variance is relevant and differs between people.

  112. I know composites gives an overall assessment of multiple tests. But is there any statistical benefits to using composites?

    I have noticed in my composite scores, the composite turned out to be significant, but the individual components are not. I used the z scores to calculate composites of 3 variables.

    Why is this? Thank you so much

    1. You are going to get a different variable when you take a composite. Often it will be a more reliable variable.

      It's certainly possible to get a significant composite and for all the components to be non-significant.
      There are several explanations for this.
      The general statistical answer is that if several of the composites are close to significant and in the same direction (e.g., p=.09, p=.1, p=.15) then the combined variable may be enough to be statistically significant at .05.

      Equally, you could get the opposite where one or more of the components is significant but the composite is not.

      It's interesting to ponder whether the underlying effect size would be larger for composites or components. In general the increased reliability associated with composites and their more general nature, may lead to generally larger effect sizes. But this is certainly not guaranteed.

      Another nice thing about the composite is that it provides an overall test. When you have many component variables, you have issues of multiple testing (i.e., with associated increased risks of Type I errors).

    2. Is it because the standard deviation is similar with z scores ? But when you combine scores by adding the standard deviations could be very different.

      Also what do you when someone couldn't do the test or scores a zero ( like balance). And this test is time based so if I give zero it means she did it very very good. What do you suggest in this case? Thank you so much

    3. the standard deviation of component variables is the same with z-scores. Thus, composites of z-scores effectively weight components equally.

      I'm not sure what you mean by could not do a test. You could mean did not sit the test, which means you have missing data. Or you could mean failed the test, which presumably means you have an issue of how to score a test where people vary on both how well they did something and whether they could do it at all.

      With missing data, there are many approaches. One simple one, i just to take the mean score for the data you do have on the person as long as they still have a certain number of scores. This is equivalent to imputing the person mean to the missing data. That said, missing data analysis is a complex issue.

      If it's an issue of how to score a test where you have some that failed, then that is really going to require some domain specific knowledge. In general, if you want to put all these people on the one scale, then I think at the very least someone who can not do a test at all should generally get a score worse than those who could do it but did it poorly.

  113. Ahh i see.

    I meant the subject couldn't perform the test ( single leg balance). She couldn't even hold it for a second. The actual tests gets over this by using a scale which gives points from 1- 100. So for unattempted task you get 0 points. But i don't have their scores hence.

    And you should write an E book. I will buy it.