Cluster analysis and single dominant factors

I often chat with researchers wanting to use cluster analysis to group cases. I just wanted to point out a common scenario where cluster analysis may not be a good way of proceeding.

Many researchers have heard the advice to not form median splits (see, Howell for a discussion), or other kinds of binary splits for that matter. The same arguments also tend to apply with other forms of abrupt grouping into a small number of factors.

Some arguments FOR running median splits are: 1) it allows you to do an ANOVA or t-test and compare group means; 2) group differences are easier to communicate to a lay audience; 3) it reflects the important distinction in the underlying continuous variable.
Some arguments AGAINST running median splits are: 1) you can always find an equivalent analysis that respects the continuous nature of the variable (e.g., regression); 2) when creating median splits, you lose a lot of information; 3) the cut-off tends to be relatively arbitrary and it varies between samples; 4) the resulting model based on a median split does not reflect the underlying nature of the variable; 5) in most cases a binary split will have less statistical power; 6) if the purpose is to communicate to a scientific audience, respecting the continuous nature of the variable is a necessary complexity.

From the above you can see that there are generally more reasons in favour of maintaining the continuous version of the variable. The two occasions where splits are tolerable are where it makes it easy to communicate findings to a lay audience and where the underlying effect of interest occurs in a stepwise fashion. In the case of the latter, the presence of a stepwise effect can be tested empirically; a quick look at a scatter plot should give some sense if there is a point where the effect changes dramatically. Likewise decisions based on test scores are often based on pass-fail kinds of categories, and there is often a concrete desire to draw inferences about these specific groups.

However, the point of this post is to discuss cluster analysis, and how it can be just a fancy way of performing a median split.

Typical examples that I am thinking of are situations where a researcher wants to form clusters based on a set of variables that have a single common factor explaining the majority of variability: e.g., reaction time measures, perceptions of performance, attitude measures, emotional/affective measures, and so on. In essence the scores for a individual i on variable j is a function of the individual i's level on a large latent general factor plus some small specific factors plus error. If you were to run a PCA on such data, the first component would be three or more times larger than the next factor. If you run a cluster analysis of cases on these datasets, the result almost always involves creating clusters based on this underlying first component. The result is that if you ask for two clusters, you get a low and a high group, and if you ask for three groups, you get a low, medium, and high group. Thus, in essence, it's a fancy way of performing a median split.

The motivation for doing a cluster analysis is often similar to that of a marketer trying to profile the market place in terms of discrete segments. Marketers often hope to find particular constellation of levels of variables that can represent meaningful segments. In such cases cluster analysis may be a useful way of discovering these naturally occurring groupings. While such an exploratory orientation in social science research can be interesting, it is also worth noting that the dominant scenario that I have encountered is the one mentioned earlier, where a continuous and large first factor drives the cluster creation process.
None of the above is meant to turn researchers off cluster analysis completely. Rather I just think that researchers should be self-aware so that they can make an informed choice about when cluster analysis is useful, and that they know when they are using cluster analysis as just a fancy means of doing a median split.

Cluster analysis Resources:

My lecture notes on cluster analysis.

SPSS Resources:

R Resources

Quick-R sets out how to use kmeans, hclust, and some other cluster analysis options.

Jeromy Anglim's Blog: Psychology and Statistics

Tuesday, September 8, 2009

Cluster analysis and single dominant factors

Disclaimer