Jeromy Anglim's Blog: Psychology and Statistics


Wednesday, October 7, 2009

Efficient Variable Selection in R

This post sets out my procedure for efficiently and reliably selecting variables from a data.frame in R
THE PROBLEM
For years I have been tormented by the hassle of selecting variables in a data.frame. I often work with psychological datasets with hundred or even thousands of variables. Many analyses in psychology will operate on a hundred or more variables (e.g., factor analysis; analysis of large numbers of survey items, and so on). The task of selecting the variables for analysis is a tedious component of data analysis. I have experimented with a number of systems over the years, but finally I have something that I am fairly happy with.

MY CURRENT SOLUTION
I maintain metadata in a spreadsheet file. There is a separate worksheet for each data frame in R. I also maintain metadata worksheets for several other structures common in psychology including: psychological tests; output; value labels. I will discuss my system for managing metadata more in a future post. For now I want to focus on the task of variable selection.

Step 1) Copy cells of variable names from the metadata spreadsheet (I use Excel, but any spreadsheet is fine)
Step 2) Run the function below V() - this will transform the clipboard contents into a format suitable for inclusion in R (i.e., c("variable1", "variable2", ..., "varaibleK") ).
Step 3) Paste the transformed clipboard into R

This is the function:

V <- function(addC = TRUE) {
    # Typical application: 
    # Step 1: paste a column of variable names from a meta database;   
    # Step 2: Run this function in R: V()
    # Step 3: Press control + V: Paste vector of clipboard
    characterVector <- scan("clipboard", what = "character")
    
    # Format string ready for inclusion in c(...).
    formattedString <-    paste(characterVector , collapse="\", \"") 
    formattedString <- paste("\"", formattedString, "\"", sep="")
    if (addC) formattedString <- paste("c(", formattedString , ")", sep = "")
    writeLines(formattedString, con = "clipboard-128", sep = " ")
}

A SIMPLE CASE STUDY
A simple scenario is examining the means for 10 extraversion items of the IPIP in a data frame called personality. Before I can run these analyses, I need to be able to select the variables in the data frame that correspond to these 10 items. Below I apply the general strategy above to this specific case.

Step 1) Copy cells of variable names from Excel
The following is a table from an Excel worksheet with my meta data for the IPIP. Notice that it has item number, item text, scale, whether the item should be reversed and the name of the variable in my data.frame. It is set up as a a table so that I can quickly and reliably sort and filter the table.


My goal is to extract only the extraversion items. So, I select from the header called scale, the term "extraversion".




I then turn off the filter and paste the variables into a space in the worksheet and then select and copy the variable names again (this ensures that they are a consecutive set of cells).

Step 2 & 3)
I then run the code shown below. The code assumes that the V function above has been defined.
I then paste the variable names and assign them to an identifier. The identifier is then used in the subsequent analysis to select the 10 variables of interest. In this case I present the means for the 10 IPIP extraversion items.



# Run the V() function.
V()
# Read 10 items

# Paste the clipboard and give vector a name (e.g., variableNames).
variableNames <- c("ipip1", "ipip6", "ipip11", "ipip16", 
  "ipip21", "ipip26", "ipip31", "ipip36", "ipip41", "ipip46")

# Run Analysis.
round(cbind(sapply(personality[variableNames], mean)), 2)
#        [,1]
# ipip1  2.84
# ipip6  2.38
# ipip11 3.91
# ipip16 2.74
# ipip21 3.65
# ipip26 2.08
# ipip31 3.23
# ipip36 3.06
# ipip41 3.25
# ipip46 3.21