Jeromy Anglim's Blog: Psychology and Statistics


Tuesday, February 16, 2010

A Case Study in Optimising Code in R

This post presents an experience I had optimising the efficiency of code for a data analysis task in R. I'm not an expert in programming nor code optimisation. However, I thought my experience might make an interesting case study for others at a similar level in their R programming development. The post sets out: (1) the context of the problem; (2) the strategies and tools in R that I used to diagnose the problem and optimise the code; (3) some lessons learnt from the experience; and (4) some links to additional resources.

THE CONTEXT
I was analysing data from a skill acquisition experiment looking at how participants edit passages of text using keyboard text editing keys. Each participant completed a series of trials. Each trial involved the participant interacting with a specialised text editor. Participants moved the cursor to a point and then deleted a designated section of text. On each trial key presses made by the participant were recorded. These key presses were stored in a data.frame with around 1.3 million rows (i.e., around 30 key presses per trial, 150 individuals, and 300 trials per individual leading to around 46,000 trials). Here's a look at the data.frame showing the first few key presses for the first participant on the first trial:

>   head(cKeys[,c("id", "trial", "subjectTrialId", "keyName", "shift", "control")])
        id trial subjectTrialId keyName shift control
1177654  1     1          10001    down FALSE   FALSE
1171094  1     1          10001    down FALSE   FALSE
465784   1     1          10001 control FALSE    TRUE
843042   1     1          10001     end FALSE    TRUE
488773   1     1          10001 control FALSE    TRUE
888996   1     1          10001      up FALSE    TRUE

As part of the analyses I wanted to divide the key presses into those that were concerned with movement of the cursor and those that were concerned with deletion of the text. I'd written a function that processed the keys from one trial and returned a character vector indicating whether each key was part of the movement or the deletion subtask. The function worked.

The only problem was that the function took around a quarter of a second per trial to run. As there were 46,000 trials, it would have taken several hours to run the function for all trials. I would have also needed to run the code multiple times in order to debug it and re-run it if the input data changed. This was too slow. Thus, I began a journey to try to optimise the code .

THE JOURNEY
Rprof: I tried the Rprof() command. This produced an output file that was summarised by the summaryRprof() command (Rprof(NULL) to turn it off). Given my knowledge, the output was not very meaningful. It suggested that a function data.frame[ was the consuming most of the time. In hindsight this did offer some clues.

System.time and the function: On the basis that the function would be run around 45,000 times I tried going through different lines of code in the main function and seeing how long it took to run the line of code 45,000 times. Below are some artificial examples that are typical of the results.

> system.time(for(i in 1:45000) (
+                         i 
+                         ))
   user  system elapsed 
   0.14    0.00    0.14 
> 
> system.time(for(i in 1:45000) (
+                         seq(i)
+                         ))
   user  system elapsed 
   7.72    0.08    8.46 

Most commands were completed  in under a second. A few took a little under 10 seconds. Assuming that there were around 50 lines of code in the function, the program should have run in 10 minutes at most. Instead I had projected that it would take around 3 hours to run the code.

system.time and data splitting: Then it became clear that it was not the function that was the problem. Rather the code was slow because of how the data frame was split into separate trials. My original code was:
unsplitSubtask <- unsplit(lapply(
        split(cKeys, cKeys[, "subjectTrialId"]), 
        function(x) moveOrDeleteSubtaskS2(x[,"keyName"], 
              x[,"shift"], x[,"control"])), 
    cKeys$subjectTrialId)

Subsetting a data frame with around 1.3 million rows took around a quarter of a second for a single trial.

> timeFor10 <- system.time(for(i in unique(cKeys$subjectTrialId)[1:10]) cKeys[cKeys$subjectTrialId == i, ]) 
> timeFor10 * 4600 / 60 / 60 
    user   system  elapsed 
3.335000 0.000000 3.347778 

The code above calculated the time to access the data for ten of the 46,000 trials. I assumed that ten trials would be more accurate than one. I then multiplied this 10-trial-estimate by 4,600 to get an estimate of how long it would take for the full set of trials to be processed. I  then divided this estimate by 3,600 to convert from seconds to hours. The analysis suggested it would take over 3 hours to run this code on all 46,000 trials.

Thus, the problem was diagnosed.

Developing a solution: The code I settled on is shown below. I tried to minimise the number of times that the 1.3 million row data frame (i.e., cKeys) was subset. The code below took around two minutes to run. Perhaps it could be further optimised, but for my purposes two minutes was fast enough.

cKeys <- keys
source("keyProcessing.R")

splitKeyNames <- split(cKeys$keyName, cKeys$subjectTrialId)
splitShifts <- split(cKeys$shift, cKeys$subjectTrialId)
splitControls <- split(cKeys$control, cKeys$subjectTrialId)
uniqueSubjectTrialIds <- unique(cKeys$subjectTrialId)

splitSubtask <- lapply(uniqueSubjectTrialIds, function(i) 
      moveOrDeleteSubtaskS2(  
          splitKeyNames[[as.character(i)]],
          splitShifts[[as.character(i)]],
          splitControls[[as.character(i)]]
      ))

unsplitSubtask <- unsplit(splitSubtask, cKeys$subjectTrialId)
cKeys$subtaskLevel2 <- unsplitSubtask

LESSONS LEARNT
  1. R is very fast most of the time. The main function moveOrDeleteSubtaskS2() had around 50 lines of code. The function was called 46,000 times. A data.frame with over one million rows was split and reformed a few times. And this was all done in around 2 minutes on an old 2007-model laptop.
  2. A single slow command can be the cause of a slow analysis. In my case my analyses were going to take 3 hours instead of 2 minutes because of the way that I was subsetting the data. I'd initially assumed that it was the complexity of my 50 line function. This was not the case.
  3. System.time is a very useful function.
  4. Optimisation can proceed from theory or from experimentation. I have read assorted wisdom over the while about how to ensure your R code is efficient (e.g., remember to vectorise). However, when doing once off data analysis, I tend to treat optimisation on a needs-only basis. Thus, experimentation can help to diagnose the slow code. With the tools of experimentation, it becomes possible to gradually acquire an appreciation of the ways to make R code fast.
  5. Optimisation proceeds from diagnosing the cause of the problem to exploring solutions.
  6. For my purposes optimisation is about orders of magnitude. Saving a couple of seconds was not an issue. Saving hours was an issue. There are several implications of this. First, rough estimates based on  applying system.time to a small subset of the data was sufficient. Second, clarity and ease of programming is more important than small improvements in speed.
ADDITIONAL RESOURCES