Analysing ordinal variables

Ordinal variables create challenges for analysis. This post discusses: (a) definitions and distinctions related to ordinal variables, (b) theoretical issues related to ordinal variables, and (c) options for analysing ordinal variables.

Definitions:
Ordinal variables can be defined in different ways (e.g., UCLA, Colorado State, wiktionary). Ordinal variables are ordered categorical variables. In this framework, they are distinguished from unordered categorical variables (i.e., nominal variables like favourite colour, country of birth, first name and so on) and from numeric variables (i.e., variables where the distance between each point on the scale is equal). Ordered categorical variables (along with unordered categorical variables and discrete numeric variables) are also distinguished from continuous variables (e.g., weight, height, time), where there is assumed to be an infinite number of points between any two points on the scale.

Ranks are a form of ordinal variable, but not all ordinal variables are ranks. Examples of ranks include the position of a football team on the ladder, the position of a runner in a race, and position of a student in a class. Ranks may or may not allow for ties, and where ties are allowed there are several options for how to define the rank of tied cases (see rank function in R).

Another major type of ordinal variable is a scale with a limited number of values where the assumption that the distance between each point on the scale is equal is assumed to be unreasonable.

An interesting side point about rank variables is the case of norm scores. In many psychological applications raw scores (e.g., percentage correct) on a test, such as an intelligence test, are converted to a norm score. A norm score is then often converted into a percentile rank and then projected onto a normal distribution. Whether the conversion of raw score to norm score substantively alters the relative distance between values depends on whether the distribution of raw scores and norm scores differ. Whether the raw score or the norm score is more appropriate will depend on your conception of the phenomena. In most cases that I have encountered it makes little difference, although all else being equal I tend to prefer the raw score for inclusion in statistical models.

The latent numeric dimension:
When ordinal variables are used, it is often theorised that a latent numeric variable is causing the observed ordinal variable. In the case of a rank variable based on running times, it is clear that the rank is an imperfectly correlated indicator of running time. In the case of a five point likert item measuring job satisfaction, the five points on the scale might be theorised to represent some underlying points on a numeric dimension of job satisfaction, where the actual distance between points is not equal.
When such latent numeric variables are assumed, and of interest, it may be better to think about designing the measurement procedure so that the latent numeric variable can be measured more directly and validly. If you are interested in running times, measure running times. If you are interested in job satisfaction, aggregate many items, and hope (and, ideally, validate) that the resulting scale reflects more of the numeric dimension of job satisfaction.

Treating ordinal variables numerically:
A variable is not ordinal by definition. A variable becomes ordinal when a researcher declares it to be so. A variable stays ordinal when it is treated in ways consistent with the ordinal declaration.
For example, a five-point likert scale item can be treated numerically. This involves assuming that the distance between all points on the scale are equal: i.e., the distance between 1 and 2 is assumed to be equal to the distance between 4 and 5. This assumption is implicitly made when ever the variable is included linearly in a statistical model.
While the declaration of an observed variable as ordinal is something done by a human, the decision is rarely arbitrary. Some variables are naturally thought of as ordinal. The decision to label a variable as ordinal is governed by several conventions. In general the declaration is tied to the theory of what is being measured and the latent variable that is not measured. As set out above, in the case of ordinal variables, it is typically assumed that there is a numeric variable that is monotonically related to the observed ordinal variable, and that it is this latent variable that is of theoretical interest.

Analysis Options:
The following are a list of a subset of possible analysis options if you have ordinal variables. Optimal scaling is relevant to both ordinal predictor and outcome variables. Ordinal measures of association and polychoric correlations are options when analysing pairs of variables. Ordinal regression is an option when the dependent variable is ordinal. Assuming that the variable is numeric allows you to apply many standard tools such as pearson's correlation and multiple regression, but such a decision should generally be justified.

Optimal scaling: Optimal scaling optimises the scaling of the variable based on some criteria (e.g., maximising prediction of some outcome variable) based on the measurement properties assigned to the variable. In particular, ordinal and spline ordinal are two natural choices for ordinal variables. Ordinal allows any set scaling that preserves the same rank order as the original scaling. Spline ordinal attempts to draw a smoother function through the original scale. Ordinal scaling makes sense when there are only a few categories, perhaps less that 6 or 10. Spline ordinal tends to make more sense when there are more categories and when there is less data. (information on the homals package in R; SPSS resources; Richard Bell presents slides on the topic)
Ordinal measures of association: Kendal's Tau, Spearman's Rho, etc. (David Garson discusses many of them; R has several built into the correlation function)
Assume variable is numeric: This involves making the assumption that the distances between each consecutive pair of points on the observed variable are equal, or at least that numbers can be assigned to each point. This is often a reasonable simplifying assumption. Optimal scaling or simply looking at scatter plot can be used as a check that this assumption is reasonable. If the line of best fit is a straight line, then it may be a reasonable assumption.
Examining various polynomial effects: Ordinal variables can be entered as a series of polynomial effects: e.g., linear, quadratic, cubic. The combined effect might be assumed to capture the effect of the variable. Significant quadratic and other higher order effects might be used as evidence that the equal-distance assumption is invalid. An alternative explanation of such higher order effects is that the effect of the ordinal variable is actually nonlinear. (Here's an online article on testing for trend in SPSS)
Polychoric correlations: This technique assumes that there are continuous normally distributed latent variables underlying the observed ordinal variables and attempts to estimate the correlations between these latent variables (see here for more information)
Ordinal regression: If the dependent variable is ordinal, ordinal regression may be a useful tool (see here).

Additional Resources:

Alan Agresti has several books that discuss the issues involved with analysing ordinal variables.

Jeromy Anglim's Blog: Psychology and Statistics

Monday, October 19, 2009

Analysing ordinal variables

Disclaimer