Clustered Samples and Assuming Independence of Observations

I sometimes speak to researchers who have a design where units are nested within clusters (e.g., 200 employees nested within 50 stores). While this is often called cluster sampling, the research that this post addresses is often more about convenience than about following a rigorous sampling plan. At some point, the researchers discover that this clustering has implications for the assumption of independence of observations, which in turn has implications for the validity of standard statistical techniques, such as t-tests and regressions that assume independence. This post discusses what to do in such a situation and when if ever it is appropriate to ignore the clustered nature of the sampling.

The Context:
Common examples that I encounter include:

employees nested within stores,
students nested within schools,
students nested within class rooms, and
participants nested within geographic regions.

The typical examples that I see are largely convenience samples at both the cluster-level and the participant-level. The clusters are not necessarily a random sample of possible clusters. The samples within clusters are not necessarily random within clusters. For example, a researcher may have collected data from 300 participants from 80 stores. And The distribution of employees per store is skewed such that some stores provide only one or two participants and other stores provide 15 to 20 participants. Such a sampling design is not trying to be a representative sample of a defined population as might be encountered in a national pole. Rather these social science studies are using the limited resources available regarding the availability of participants to maximise sample size for analyses.

The Problem:
The main issue is that statistics that assume independence of observations will yield standard errors that are smaller than they should be. Thus, the researcher is more likely to conclude that an effect (e.g., a difference between group means; a correlation; a regression; etc.) is statistically significant regardless of whether an effect is actually present in the population.

However, introducing a multilevel modelling framework or some other standard error adjustment procedure often introduces greater complexity into the modelling task. In addition, having a large number of clusters some of which only have a few cases may introduce estimation issues. Thus, researchers sometimes want to justify the use of standard techniques that assume independence.

General Discussion:
The following three pages provide useful information on the topic.

UCLA discusses the issue on a page called Analyzing correlated data. The page discusses options for adjusting standard errors. It also shows the effect of sample size and intraclass correlation (ICC) on p values.
Hedges and Hedberg (2007, Intraclass correlation for planning Group Randomized Experiments) discusses the implications of cluster sampling in experiments with and without pretest scores.
Gene Shackman discusses how to calculate the Design Effect from the ICC and the average size of groups.

A Few Thoughts on How to Deal with This Situation:

1. Think about the Cluster Effect: Are there reasons to expect participants to be more similar within clusters on the outcome variable? One way to approach this is to think about what correlates with the outcome variables and think about whether the clusters are likely to differ in their mean levels on these predictors. For example, if all stores tend to be fairly similar, the participants do not interact, and the outcome variable has little to do with geography or the workplace itself, then the effect of store may be close to zero.

2. Estimate the Cluster Effect: Assess the extent to which cluster explains variance in the outcome variable. Most assessments are based on first estimating some form of intraclass correlation (ICC). SPSS has the MIXED procedure. R has the multilevel package and the psychometric package which both uses the nlme package. The typical procedure involves running a model with cluster as a random effect on the outcome variable potentially with additional predictors in the model.
ICC = var(cluster) / [var(cluster) + var(error)] (see Wikipedia or UCLA or Hedges and Hedberg for details)

3. Adopt a Procedure:
Thus, if researchers choose to ignore the clustering, they need to make a strong argument from theory and from the data that it is appropriate. This argument should include some or all of the following points if they are applicable:

ICC is close to zero. It should be noted that the UCLA post suggests that ICCs equal to .01 are still likely to bias standard errors.
Theory suggests that there is no reason to expect clusters to effect the dependent variable
Prior research suggests little to no clustering effect
The intended audience is less likely to comprehend the more sophisticated techniques
It is standard in the literature to ignore the clustering
p values are sufficiently small that results would be robust anyway
the purpose of the analyses is exploratory
the number of participants per cluster is small
a multilevel model or some other more sophisticated procedure was tried and yielded the same substantive results.
a multilevel model or some other more sophisticated procedure was tried and could not be run due to estimation issues.

The most important element of this argument is that the ICC is close to zero. However, even with all this, a reviewer may still not accept the argument, and expect a more sophisticated approach to be adopted.

If the ICC does suggest that cluster explains variance in the outcome variable, then the above links (under General Discussion) suggest ways of modelling the data (e.g., multilevel modelling, adjusting standard errors, etc.).

Jeromy Anglim's Blog: Psychology and Statistics

Friday, February 26, 2010

Clustered Samples and Assuming Independence of Observations

Disclaimer