Analysis of a Multiple Choice Test

This post discusses how to perform a basic reliability analysis of a multiple choice test.

Analysis of ability test items is a big topic in psychometrics. This post aims to provide some basic advice to get an interested reader started. Common tests include multiple choice knowledge tests as part of an educational or training setting and multiple choice tests in the context of cognitive ability testing as found in many psychological experiments and educational an selection settings.

A basic procedure:
1. Score items
2. Get basic item information: percentage answering each response; percentage correct; corrected-item-total correlations;
3. Choose items to retain
4. Get item information on revised scale and calculate reliability
5. Get total scores for individuals

The example:
I'm going to use an example of a 53 item scale used as a practice example in a subject that I teach.

1. Score items
I set out in a previous post how to convert raw score variables to variables that indicate whether a response was correct (1) or incorrect (0). The syntax is explained in the earlier post and shown again below.

DO REPEAT  xraw = ability1 to ability53/ 
  xkey = 3, 2, 4, 1, 2, 2, 1, 1, 1, 3, 4, 
         4, 1, 2, 4, 2, 4, 1, 2, 4, 1, 4, 
         2, 1, 2, 1, 2, 2, 4, 2, 1, 2, 4, 
         3, 4, 1, 2, 1, 2, 2, 1, 4, 3, 2, 
         1, 4, 4, 1, 3, 2, 1, 2, 3/
  xscore=score1 to score53.
COMPUTE  xscore = 0.
IF ( xraw = xkey ) xscore = 1.
END REPEAT.
execute.

2. Get basic item information:
2(A) percentage answering each response:

DO REPEAT  xraw = ability1 to ability53/ 
optionA = optionA1 to optionA53 .
COMPUTE  optionA = 0.
IF ( xraw = 1) optionA = 1.
END REPEAT.

DO REPEAT  xraw = ability1 to ability53/ 
optionB = optionB1 to optionB53 .
COMPUTE  optionB = 0.
IF ( xraw = 2) optionB = 1.
END REPEAT.

DO REPEAT  xraw = ability1 to ability53/ 
optionC = optionC1 to optionC53 .
COMPUTE  optionC = 0.
IF ( xraw = 3) optionC = 1.
END REPEAT.

DO REPEAT  xraw = ability1 to ability53/ 
optionD = optionD1 to optionD53 .
COMPUTE  optionD = 0.
IF ( xraw = 4) optionD = 1.
END REPEAT.
execute.

DESCRIPTIVES VARIABLES=optionA1 to optionA53
  /STATISTICS=MEAN.
DESCRIPTIVES VARIABLES=optionB1 to optionB53
  /STATISTICS=MEAN.
DESCRIPTIVES VARIABLES=optionC1 to optionC53
  /STATISTICS=MEAN.
DESCRIPTIVES VARIABLES=optionD1 to optionD53
  /STATISTICS=MEAN.

The above code generates a series of 0-1 variables for the four options on this test. it then calculates the percentage of respondents who gave that option (i.e., the mean of the 0-1) variable. If you had five options you would create another copy of the DO REPEAT and the DESCRIPTIVES syntax giving the code some suitable name such as optionE.

2(B) Percentage correct:

DESCRIPTIVES VARIABLES=score1 to score53  
  /STATISTICS=MEAN.

2(C) Corrected-item-total correlations:

RELIABILITY
  /VARIABLES= score1 to score53
  /SCALE('ALL VARIABLES') ALL
  /MODEL=ALPHA
  /STATISTICS=DESCRIPTIVE SCALE
  /SUMMARY=TOTAL.

2(D) Combine the above information into a table:
Excel can be useful to take the various SPSS output and arrange into columns.
You can use functions in extra columns to implement rules chosen in the next section.
For example, if A1 was the column with percentage correct and you decided that any item with a percentage correct less than .3 was a bad item, you could use the following syntax.
=if(A1 < .3, "bad", "good")
Then copy and paste the function down the column.

3. Choose items to retain
3(A) Decide on a set of rules which will guide the assessment of item quality; and develop some rules for deciding whether items will be retained based on the item quality assessment.
Some simple rules include:

Item is not too difficult: if the probability of getting the item correct is too low, the item will not differentiate between participants on the construct measured. This can also be a sign that one of the distractors may be true or appear to be reasonably true or that the options thought to be correct actually is incorrect. In general an item that is too difficult may either be very difficult or it may have some fundamental flaw. A simple rule would be that items with probability less than .20 or .30 or .40 on a four choice test should be considered for deletion.
Item is not too easy: if the probability of getting the item correct is too high, the item will not differentiate participants on the construct measured. There may be other reasons to retain such an item (e.g., you want to reassure participants, or you plan to apply the test to populations different from the test sample). A simple rule would be that items with probability greater than .85, .90, or .95 on a four choice test should be considered for deletion.
Item correlates with the corrected-total: The corrected total is the total correct participants would get if the focal item was excluded. Good items have high item-total correlations. It means that the item is measuring a similar thing as the overall scale. A Negative item-total correlation suggests that one of the distractors may be a correct response. An item-total correlation close to zero suggests one of several problems with the item including low differentiation or a question without a clear correct answer. A simple rule would be that items with corrected item-total correlations less than 0, .05, .10, .15, or .2 should be considered for deletion.
Qualitative assessment of the item: It is good to think about the item text. Some of the links at the bottom can provide assistance.

3(B) Implement rules on items. There is a trade-off between having strict rules which results in only high quality items being retained and concerns about retaining a sufficient number of items. Studies differ. Sometimes the dataset is being used to actually measure the ability of interest (e.g., in an exam situation). Other times the sample is being used purely to develop the test (e.g., as part of a larger test construction process). If ability measurement is of actual interest, you are going to want to retain any item which adds to the overall reliability and validity of the scale. If the study is part of a larger test construction process, you may discard items which are mediocre with the aim of ultimately retaining only excellent items.

4. Get item information and reliability on revised scale
Rerun reliability analysis above but only include retained items

5. Get total scores for individuals
See this post on computing scales scores for ability tests.
These can then be used in subsequent analyses (e.g., to correlate with other variables).

Additional Resources:

Raymond Levesque's Syntax for running reliability analysis in SPSS.
Andy Field has some notes on running reliability analysis in SPSS for a multiple item scale.
SPSS provides a few points on interpreting their reliability analysis output.
Guidelines on writing multiple choice tests from Monash and Special Connections
Online book and resources on Item Response Theory .
Information on running a basic reliability analysis in R. A more complete outline of options for item analysis in R can be found under the Psychometrics Task View. I have used the score.multiple.choice function in the past and it is easy to use and gives the core information with a single function call.
Jim Ramsay's TestGraf is free software for doing item analysis

Jeromy Anglim's Blog: Psychology and Statistics

Monday, October 5, 2009

Analysis of a Multiple Choice Test | Getting Started

Disclaimer