THE CONTEXTI was analysing data from a skill acquisition experiment looking at how participants edit passages of text using keyboard text editing keys. Each participant completed a series of trials. Each trial involved the participant interacting with a specialised text editor. Participants moved the cursor to a point and then deleted a designated section of text. On each trial key presses made by the participant were recorded. These key presses were stored in a data.frame with around 1.3 million rows (i.e., around 30 key presses per trial, 150 individuals, and 300 trials per individual leading to around 46,000 trials). Here's a look at the data.frame showing the first few key presses for the first participant on the first trial:
> head(cKeys[,c("id", "trial", "subjectTrialId", "keyName", "shift", "control")]) id trial subjectTrialId keyName shift control 1177654 1 1 10001 down FALSE FALSE 1171094 1 1 10001 down FALSE FALSE 465784 1 1 10001 control FALSE TRUE 843042 1 1 10001 end FALSE TRUE 488773 1 1 10001 control FALSE TRUE 888996 1 1 10001 up FALSE TRUE
As part of the analyses I wanted to divide the key presses into those that were concerned with movement of the cursor and those that were concerned with deletion of the text. I'd written a function that processed the keys from one trial and returned a character vector indicating whether each key was part of the movement or the deletion subtask. The function worked.
The only problem was that the function took around a quarter of a second per trial to run. As there were 46,000 trials, it would have taken several hours to run the function for all trials. I would have also needed to run the code multiple times in order to debug it and re-run it if the input data changed. This was too slow. Thus, I began a journey to try to optimise the code .
THE JOURNEYRprof: I tried the Rprof() command. This produced an output file that was summarised by the summaryRprof() command (Rprof(NULL) to turn it off). Given my knowledge, the output was not very meaningful. It suggested that a function data.frame[ was the consuming most of the time. In hindsight this did offer some clues.
System.time and the function: On the basis that the function would be run around 45,000 times I tried going through different lines of code in the main function and seeing how long it took to run the line of code 45,000 times. Below are some artificial examples that are typical of the results.
> system.time(for(i in 1:45000) ( + i + )) user system elapsed 0.14 0.00 0.14 > > system.time(for(i in 1:45000) ( + seq(i) + )) user system elapsed 7.72 0.08 8.46
Most commands were completed in under a second. A few took a little under 10 seconds. Assuming that there were around 50 lines of code in the function, the program should have run in 10 minutes at most. Instead I had projected that it would take around 3 hours to run the code.
system.time and data splitting: Then it became clear that it was not the function that was the problem. Rather the code was slow because of how the data frame was split into separate trials. My original code was:
unsplitSubtask <- unsplit(lapply( split(cKeys, cKeys[, "subjectTrialId"]), function(x) moveOrDeleteSubtaskS2(x[,"keyName"], x[,"shift"], x[,"control"])), cKeys$subjectTrialId)
Subsetting a data frame with around 1.3 million rows took around a quarter of a second for a single trial.
> timeFor10 <- system.time(for(i in unique(cKeys$subjectTrialId)[1:10]) cKeys[cKeys$subjectTrialId == i, ]) > timeFor10 * 4600 / 60 / 60 user system elapsed 3.335000 0.000000 3.347778
The code above calculated the time to access the data for ten of the 46,000 trials. I assumed that ten trials would be more accurate than one. I then multiplied this 10-trial-estimate by 4,600 to get an estimate of how long it would take for the full set of trials to be processed. I then divided this estimate by 3,600 to convert from seconds to hours. The analysis suggested it would take over 3 hours to run this code on all 46,000 trials.
Thus, the problem was diagnosed.
Developing a solution: The code I settled on is shown below. I tried to minimise the number of times that the 1.3 million row data frame (i.e., cKeys) was subset. The code below took around two minutes to run. Perhaps it could be further optimised, but for my purposes two minutes was fast enough.
cKeys <- keys source("keyProcessing.R") splitKeyNames <- split(cKeys$keyName, cKeys$subjectTrialId) splitShifts <- split(cKeys$shift, cKeys$subjectTrialId) splitControls <- split(cKeys$control, cKeys$subjectTrialId) uniqueSubjectTrialIds <- unique(cKeys$subjectTrialId) splitSubtask <- lapply(uniqueSubjectTrialIds, function(i) moveOrDeleteSubtaskS2( splitKeyNames[[as.character(i)]], splitShifts[[as.character(i)]], splitControls[[as.character(i)]] )) unsplitSubtask <- unsplit(splitSubtask, cKeys$subjectTrialId) cKeys$subtaskLevel2 <- unsplitSubtask
- R is very fast most of the time. The main function moveOrDeleteSubtaskS2() had around 50 lines of code. The function was called 46,000 times. A data.frame with over one million rows was split and reformed a few times. And this was all done in around 2 minutes on an old 2007-model laptop.
- A single slow command can be the cause of a slow analysis. In my case my analyses were going to take 3 hours instead of 2 minutes because of the way that I was subsetting the data. I'd initially assumed that it was the complexity of my 50 line function. This was not the case.
- System.time is a very useful function.
- Optimisation can proceed from theory or from experimentation. I have read assorted wisdom over the while about how to ensure your R code is efficient (e.g., remember to vectorise). However, when doing once off data analysis, I tend to treat optimisation on a needs-only basis. Thus, experimentation can help to diagnose the slow code. With the tools of experimentation, it becomes possible to gradually acquire an appreciation of the ways to make R code fast.
- Optimisation proceeds from diagnosing the cause of the problem to exploring solutions.
- For my purposes optimisation is about orders of magnitude. Saving a couple of seconds was not an issue. Saving hours was an issue. There are several implications of this. First, rough estimates based on applying system.time to a small subset of the data was sufficient. Second, clarity and ease of programming is more important than small improvements in speed.
- Tidying and profiling R Code in Writing R Extensions
- Mathew Pratola has some slides on the topic
- Ross Ihaka, Duncan Temple Lang, Brendan McArdle: Writing Efficient Programs in R (and Beyond)
- A discussion on writing efficient code on R Help
- Paul Hsieh has a general discussion of programming optimization (not R specific)
- Other options for code optimisation include Calling C code from R and Parallel Computing