Jeromy Anglim's Blog: Psychology and Statistics

A Publication Workflow for Organising Files and Directories

2020-08-04T12:43:00.001+10:00

The following describes my workflow for publishing journal articles. It defines a set of rules for organising the files and directories associated with writing and publishing a peer-reviewed journal article.

It covers issues of file organisation, version control, and collaboration. It embodies a number of lessons that I've learnt while publishing journal articles. It also helps to have a standardised approach.

In this context, the project is the publication of a journal article.

Short Project Name

Every project needs a short name that uniquely identifies the project. This name is used in several settings including the parent directory name, the data analysis directory name, the manuscript name, and when talking to colleagues about the project.

A good project name is short, descriptive, and uniquely identifies the project. Two words is usually best, but three words is okay. 8 to 15 characters is usually about right. It's a little bit like thinking of a good running head, but even shorter and more for private consumption.

Examples: Some recent short project names for my papers include: "hexaco-ei", "hexaco-wellbeing", "employee-facets", "hexaco-applicants", "subtask-learning", and "dynamic-wellbeing".

Project names can be bad for a range of reasons.

Too long. This makes file names hard to read. It makes it difficult to talk to colleagues about the paper. It makes it more mentally taxing to think about the project by name.
Conflicts with other projects. If you have multiple projects in an area, it's important to distinguish the focal project from other similar projects.
Not descriptive enough: It is best to think about the defining feature of the study. Bad names do not elicit the project in mind.

Parent Directory Name

Every project has a parent directory. This is the directory that contains all the core files of the project.

The parent directory is named:

"short-name-year"or "short-name-year-storagemode"

By storage mode, I refer to tools like dropbox, onedrive, or local for local computer. E.g.,

"short-name-year-dropbox" or "short-name-year-onedrive" or "short-name-year-local"

Appending the year the project commenced is helpful as an additional identifier for the project. In particular, the short name might be good initially, but may become less identifying over time (e.g., you do a lot more similar research). So when you're searching for the project in years to come, the year becomes particularly useful.

Appending the storage mode is particularly helpful when you are collaborating with colleagues on a project using a tool like dropbox or onedrive, but you also need to maintain some files on your local computer. For example, confidential data, private brainstorming, files you don't want colleagues interfering with, files that get corrupted on data sharing platforms, etc. In this case appending "local" to your local files and "dropbox" to the shared files helps distinguish the two.

Examples: "hexaco-ei-2017-dropbox", "hexaco-wellbeing-2017-dropbox"

Directories

The following are the core directories of the project

manuscript: Stores the authoritative version of the manuscript, the reference manager database, and any online supplement files that will be submitted to the journal.
archive: Stores old versions of files. I.e., the manuscript. It provides a simple form of version control.
submissions: This contains one directory for each journal submission and each journal submission includes folders for each step of the publication process.

Additional directories

notes: Stores any preliminary analyses, literature reviews, and files that involve analysis or reflection.
resources: Store any files related to the study. E.g., meta-data, scoring instructions, details about survey, tasks, and so on, raw data.
analysis: The data analysis files

Manuscript directory

The manuscript file name is in the following format:

manuscript-shortname-date.docx

Examples: if the date of last editing was 4th August 2020 and the short name was "hexaco-values", the manuscript file would be called:

"manuscript-hexaco-values-4-aug-2020.docx"

Note that the file never has words like "draft", "rough draft", "final", "absolutely-final", "final2", etc. The date is all that is required to indicate that it is the latest version.

The word "manuscript" is placed at the start of the file name for several reasons. First it clearly denotes this file as the manuscript file as opposed to some other file (e.g., supplemental files, etc.). Second, it is easier to identify it as the manuscript file than if the file is called ("shortname-manuscript-date"). This then leads to fewer errors when uploading files to the manuscript submission system.

When to update the date in the manuscript file name? A general idea is that whenever the manuscript reaches a key stage, a copy of the manuscript is placed in the "archive" directory and the date in the filename is updated. Key stages include: whenever the manuscript moves between authors, when submitting it to a journal, after revise and resubmit, when it has been a long time since you've touched the manuscript, when you're about to engage in some substantial edits. This essentially implements a basic form of version control. It enables you to recover any deleted content should you need to. It also more comfortable to implement edits knowing that things can be restored.

Other files in the manuscript directory:

Files associated with the reference manager. I use Endnote to manage references and I use a database that is project specific. In theory, Endnote can experience issues if multiple collaborators are trying to use Dropbox to work with the same endnote folder. That said, often there are no issues. One solution is to designate one person as the one to manage Endnote, and other authors just put comments about references.

Other files: Quite often, there are online supplement files that get submitted to the journal. These often provide additional methodological details or additional analyses. It makes sense to keep these in the manuscript folder as they will need to be submitted to the journal.

Submissions directory

The submissions directory is where all the submissions to journals are stored. The general folder structure is that there is a directory for each journal that you submit to with the prefix 1, 2, 3, etc. Obviously, you only submit to journal 2 after journal 1 has rejected you.

Example: if the first submission was to Journal of Personality it would be called "1-jopy"; If that was rejected, and we tried Australian Psychologist, the second folder would be called "2-apsych".

Within each journal submission directory are numbered directories for each stage of the submission. Here is one example set of folders:

1-initial-submission (cover letter, manuscript with anonymised title page, non-anonymised title page, online supplement, confirmation of submission email, pdf of submission)
2-first-revision (email with revision requests, updated manuscript/supplement, response notes, confirmation of resubmission email, pdf of resubmission)
3-acceptance (email confirming acceptance,
4-licence (copy of copyright agreement)
5-proofs (files associated with proofing)
6-formatted-online-first (copy of online first version)
7-preprint (preparing post-print for psyarxiv)
8-page-numbers (final journal pdf with page numbers)

Other common folders include:

3-second-revision (same files as first revision but just updated)
4-third-revision (same files as first revision but just updated)
5-rejection (copy of rejection email; optionally details brainstorming reflections)

The general principle is that these submission directories include (a) a read-only copy of the manuscript (often split up into title page and body) and related files (e.g., online supplement, figures, etc.), (b) any journal submission specific files (e.g., cover letter, responses to reviewer comments, journal specific information such as highlights), and (c) any journal correspondence, (d) PDFs generated by the submission system.

An important principle here is that everything has one authoritative source. So, you never edit the actual manuscript in the submissions folder. These edits belong in the "manuscript" folder. The only edits to the manuscript that occur in the submissions folder are things like: anonymising the title page, making the manuscript conform to journal requirements (e.g., putting tables/figures in specific places).

That said, things like cover letters and response to reviewer comments do live in their respective submission directories. And that is their authoritative home.

Resources and Notes Directories

Journal articles have lots of assorted resources (details on measures and procedure, literature searches, data analysis notes, brainstorming of ideas, etc.). The main point here is that these materials are organised in directories of the project.

Linked Directories

In some instances not all files are contained in the project directory. Resources may be relevant to more than one project. Or there maybe files that need to be stored elsewhere. In this case, I place an alias or shortcut link to these resources in the parent directory.

Workings File

I often have a file called "workings-shortname.docx" in the root folder of the project. This is used to store all project related brainstorming and notes.

Template of Project

I store a template version of a new project on github:

https://github.com/jeromyanglim/anglim-manuscript-template/

I have a bookmark in my browser which downloads a zipped up copy of the template:

https://github.com/jeromyanglim/anglim-manuscript-template/archive/master.zip

This makes starting a new project very efficient. I update it from time to time to reflect changing conventions and so on (e.g., APA 7).

Home, End, Page Up, Page Down Keys in OSX

2019-03-05T10:50:00.003+11:00

Shortcut keys for navigation are inconsistent in OSX across applications. Notionally the home and end keys are Fn + Left / Right and Page Up / Page Down are Fn + Up/Down. However, this does not always achieve the same navigational effect, especially for programs ported from Windows.

TextEdit, Chrome Text Editor (probably other native OSX apps)

Paragraph Up / Down: Alt + Up / Down
Start of line / End of line: Cmd + Left / Right
Start of document / End of document: Cmd + Up / Down
Page Up / Down: Fn + Alt + Up / Down
Previous / Next Word: Alt + Left / Right

Finder

Top / Bottom File: Alt + Up / Down

Chrome Browsing

Start / End of document:: Fn + Left /Right OR Cmd + Up / Down
Page Up / Page Down: Fn + Up / Down

Skim

Start / End of document:: Fn + Left /Right
Page Up / Page Down: Fn + Up / Down
Previous Page / Next Page : Cmd + Left / Right

Word

Paragraph Up / Down: Alt + Up / Down
Start of line / End of line: Fn + Left / Right OR Cmd + Left / Right
Start / End of document: Fn + Cmd + Left / Right
Page Up / Down: Fn + Up / Down
Previous / Next Word: Alt + Left / Right

Outlook

Paragraph Up / Down: Alt + Up / Down
Start of line / End of line: Fn + Left / Right
Start / End of document: Fn + Cmd + Left / Right
Page Up / Down: Fn + Up / Down
Previous / Next Word: Alt + Left / Right

Ways that closed access academic publishers could improve

2017-11-14T13:00:00.002+11:00

Accessing journal articles has certainly got easier over the years. Nonetheless, there are still many issues with how closed-access journals provide access. This hampers the scientific process.

Accessing a journal article that my institution has paid for should be as simple as going to the standard publications home page (with one more click to download the pdf).

While from a broader societal level, there are many reasons to like open access publishing. However, for working scientists with institutional access to most journals, the day-to-day issue with closed access publishers is usability related. Here are a few things that I wish all publishers would do:

Do not add cover pages to PDFs

Some publishers still insist on adding an initial page to their PDFs.

Many have stopped doing this. But I notice Emerald still does this.

When the user clicks download PDF, Download the PDF

Do not ask the user to consent to terms and conditions (as JSTOR requires)
Do not give a pop-up with further options (e.g., Elsevier - do you want to download article or issue; T&F do you want to download the PDF or an interactive PDF)

If you want to provide two different type of downloads, then include two different buttons on the main page.

Do not try to open the PDF in a publisher-specific proprietary PDF viewer

Th user wants to download the PDF to their computer or view it in their normal viewer. Often they want to collate it on their own computer.

Facilitate seamless access for those with institutional access from the official landing page

Users generally have access to journal articles through their institution. Publishers should do their best to ensure that anyone who has institutional access is able to quickly get access through the official manuscript landing page. They should err on the side of granting access. Whether this involves simple institutional sign-in, cookies, etc. The point is that the official page for a manuscript (i.e., the one that the doi directs to) should provide simple access to articles for people with a subscription (institutional or otherwise). Forcing users to jump through hoops using library proxies, links from Google Scholar and so on is not helpful. This all compounded by the fact that there are many different publishers. Basically, the experience for someone with institutional access should be seamless whether they are accessing the article via a mobile phone or a computer, whether they are on campus or not.

Presumably, there are many solutions to this. Cookies, simple authentication systems, and so on. The point is, it should "just work" on all devices from all locations.

Do not throttle downloads

I heard recently that someone was prevented from downloading articles, because they had reached some kind of short term maximum. This is not good. Or at the very least, such a throttle should be implemented only when downloads hit the thousands in a day, not 20 or 30 articles.

General Principles

Listen to the usability experts and not the legal department
Your priority is facilitating access to scientific knowledge

Generating APA style tables in R: Current challenges

2017-03-31T14:14:00.001+11:00

This post reviews some aspects of generating formatted tables using R suitable for inclusion in a manuscript conforming to APA style. I review my current workflow that involves a large amount of manual formatting in Excel. I then discuss what it would take to automate more of these manual steps in R.

My current workflow for incorporating tables into a journal manuscript involves the following steps:

Create data.frame in R with core table data, row names are column names carry row and column headers. This usually includes some rounding of numbers to desired precision (in order to avoid Excel rounding errors)
Export data.frame as a csv e.g., write.csv(mytab, file = "output/mytab.csv"), although sometimes I'll write to Excel to automate bolding.
Open csv in Excel and apply manual formatting
Paste adjusted table into Word

Pros and cons of manual formatting in Excel

Benefits of Excel approach

In many respects this approach is fairly efficient.
If you are not updating your table results often, then it is often quicker to do formatting adjustments in Excel.

Problems with Excel approach

If the data is updated multiple times, then the conversion of the table to formatted requirements can be time consuming.
There is also the potential for errors to be introduced in the adjustment process. And the more times that the data is updated, the more the adjustment process might lead to transcription errors.
The time it takes to manually convert the table discourages making updates that would require this.
There is scope to standardise certain tables (e.g., correlation matrices, tables of descriptives by groups) and thus work spent automating could have benefits for future projects.

Review of activities done during Excel formatting

The following is influenced by terminology and formatting requirements of APA style (see Chapter 5 in APA 6th Edition Manual).

Modify fonts

Change font type and size to align with manuscript (e.g.,12 point Times New Roman)
Add selective font formats. Bolding certain numbers is quite common (e.g., correlations or factor loadings above a threshold); Italicising certain statistical labels (e.g., M and SD, 1, 2, 3 etc in correlation tables) is common.
Superscripted fonts related to specific table notes

Add or modify content

Convert R row and column names to names used in table. In particular, variable names are almost always distinct from table names.
Ensure capitalisation meets style requirements
Add consecutive numbers and period typically to row names. E.g., it is common to number variables in a correlation matrix "1. Age", "2. Income", etc.
Add stub heading. I.e., the column heading for the first column (i.e., row.names)
Adjust numbers: e.g., a p-value less than .001 might be shown as <.001, an adjusted r-squared value less than 0 might be displayed as 0.
Convert p-values to significance stars

Adjust cell alignment.

Usually, headers are centred, numbers in body are centred, and first row is left aligned.
When row headings are nested, nested row stubs are indented (e.g., 3 spaces)

Delete cell content

Deleting lower or upper diagonal from symmetrical matrices: e.g., correlation matrix
Deleting diagonal from correlation matrices

Delete rows or columns

Ideally, the actual rows or columns of data have been specified correctly in R, but occasionally, it is simpler to remove rows or columns at the Excel stage. For example, the R output might list fit statistics for 6 models, but it is later decided that only five are relevant. In particular, rearranging the order of rows should be done in R for increased reliability.

Add lines

Lines are placed on top and bottom line of column header and bottom line of last row
Decked column headings and table spanners require additional lines

Format numbers

Common tasks include adjusting number of decimal places, removing leading zeros (e.g., correlations, multiple r, p-values), putting parentheses around certain numbers, putting two numbers together in some way (e.g., ranges, confidence intervals, often have a separator like a comma or hyphen and may be surrounded by brackets).

Add line breaks in cells

Some cells have two or more bits of information that should be presented on distinct rows. column names will include sample size on second row (e.g., "Treatment {line-break} (n = 132)" ). E.g., value is presented in first line and confidence intervals in second line. In this case, it is also possible to insert an additional row into the table and include these values in separate cells.
Some text is too long and needs to be split across multiple rows. This is usually done automatically. However, often this should include an indent on the second or subsequent row.

Adjust column widths

This is often a manual process in order to get the table to fit on the page and avoid cell wrapping.

Decked headings: Special requirements

Decked headings occur where two or more column headings are grouped under a column spanner (e.g., M and SD is shown for two groups where the group name is the spanner).
Merge cells of column spanner (i.e., the heading that groups the two columns)
Insert line below the cells of the column spanner
Insert a small empty column between column spanner and other columns (this ensures that there is a gap between the line underneath the column spanners and makes it easier to see the intended grouping)

Table spanners: Special requirements

A table spanner is a centred heading that represents a major subdivision of a table.
It involves inserting a new row with merged cells and centred text and adding a line to the bottom of the table division.

Table caption, title, and notes: Special requirements

In general, I specify these things in the manuscript. Mostly this works well. There is just the occasional bit of information that might be data driven. E.g., correlations above a certain value might be flagged as significant and this information might be included in the table note.

Reflections on manual formatting

Table formatting is complex. There is a visual quality to formatting tables. While some tables are approximated by a matrix with row and column headers, there are a huge number of common and not so common additional requirements. I often identify refinements to table formatting in an iterative fashion until it looks right.

While I attempted to document all the tasks that I do, I would not be surprised if there were additional tasks that did not come to mind. And presumably the common requirements of APA style tables in psychology are not the same as those relevant to other style guides and other disciplines.

It is possible to automate all of the above steps using R and output a table in a suitable format such as rtf, docx, or possibly HTML. However, at this point, this would require a lot of coding for each table.

There are a few packages of relevance:

apaTables provides APA tables exported to RTF for a few very specific scenarios. And the author also adopts specific preferences, which while well reasoned, are not always what you want.
apaStyle is similar to apaTables in that it exports to Word format, although it seems a little more flexible. It has a generic table function that can handle decked headings, but it still seems a long way from the flexibility required to produce most tables.
rempsyc includes functions for outputting APA tables to Word from R.
xtable is one of the best packages for table production but it exports principally to HTML and LaTeX. It also doesn't really seem designed for capturing all the complexities of APA style tables.
htmlTable in gmisc allows for some complexity. See this example.

The challenge is to design a flexible and efficient system that is also reliable (in that it limits the introduction of errors). I think a nice challenge for anyone willing to take this on would be to develop simple set of functions in R that can be applied to generate tables in Word or RTF format that could be applied to produce the 16 tables in the APA 6th edition style manual (ideally from hypothetical data to include the additional challenges of extracting and formatting the numbers, converting variable names, etc.). These tables include a range of the common requirements of APA style that are not well supported in existing packages.

**Update:**

After posting, I learnt about the papaja package. It seems specifically designed for writing APA style documents with R Markdown. The apa_table function seems like its designed to capture many of the quirks of APA style, but at present its more advanced table-formatting features are limited to exporting LaTeX (i.e., Rmarkdown to LaTeX to PDF). A fully reproducible workflow has a lot to love, but at present I still find that collaboration and other features makes Word my go-to option for manuscript preparation.
huxtable (mentioned in the comments) has quite a lot of formatting flexibility. It exports to HTML and LaTeX format. See this vignette. It also supports a row and column spans, albeit row spans are handled as separate columns whereas APA style uses indenting. I'm also not clear on how you would go from HTML to Word. My general impression is that HTML is less prescriptive by design.

Suggestions for how R and RStudio could improve auto-completion and usability of R

2016-09-12T13:46:00.000+10:00

RStudio has improved the power of auto-completion in R and generally increased usability. However, there remains the potential to improve discoverability and usability. There are also coding practices that R package authors can adopt both to work better with auto-complete and make the features of their R package more discoverable. After using and teaching R for the last ten years, this post outlines what I see as major areas for potential improvement.
R has a reputation as being efficient once you know how it works, but difficult to learn.

Auto-completion increases coding productivity.

Users don't have to memorise the precise spelling of the name of every function, argument name, data frame variable, and argument value. It also helps to resolve the issue of the wide range of coding conventions in R (camelCase, dot.names, under_score_names, etc.).
It means that users can focus more on coding and less on looking up help files for the precise phrasing of some low level feature, or constantly typing dput(names(mydata)) to get lists of variable names.
New users may also know what they are looking for, but not know how to obtain it. Auto-completion can facilitate this.

My general conclusion is that auto-completion needs to be taken more seriously in R. RStudio has done a great job of implementing auto-completion. I also think that the R language and package authors could incorporate features to work better with IDEs that implement auto-complete.

Auto-completion of arguments that take a character variable

Many functions have multi-category options (e.g., method of correlation, missing data procedure for a table, type of factor analysis rotation. It would be good to have auto-completion on these values.

Example 1: If I have missing data on a correlation matrix, then I use the "use" argument to specify what kind of missing data substitution should occur. It would be good if code completion operated on the available options. That said, at least RStudio automatically shows the argument instructions which lists the options.

Example 2: The options for useNA and exclude arguments of table

Example 3: The rotation argument for factanal does not list available rotations. The help only states that the default argument is "varimax" and that there are other rotations in some other packages, although the help files does show "promax" as another option.

Recommendation:

Package authors should ensure that the help files list all argument options in the "arguments" section of the help file. If using "see details", at least list the permissible option names in the arguments section and use "see details" for actually putting the details of what each of the arguments means. RStudio displays the argument information in auto-complete. Often a user just wants to be reminded of the precise spelling for the argument option or wishes to get an overview of the choices.
It should be possible to enable auto-completion on the available options. I imagine this would involve the specification of additional language features in R which would then be detected by IDEs like RStudio.

Auto-completion for nested elipsis arguments

Elipsis arguments (...) allow for flexibility. However, they also decrease usability because, users are less clear on what arguments can be passed to a function. This is particularly true for arguments to methods like print and summary.

Example 1: I'm running a factor analysis
fit <- factanal(matrix(rnorm(1000), ncol = 10), 2)

The code for printing the loadings, has several arguments including "sort" and "cutoff"i.e.,
print(fit, sort = TRUE, cutoff = .5)

But auto-complete doesn't see these arguments. RStutidio actually does a pretty good job of finding arguments. It seems that these arguments are related to "print.loadings" as opposed to "print.factanal". Thus, if you go:
loads <- fit$loadings
Then, pressing tab after
print(loads,
will show the cutoff and sort arguments.
However, it seems that RStudio is only able to to go one layer deep.

I imagine that this is a hard one to solve.

Auto-completion of variable names in data frames

There is limited auto-completion support in RStudio for names in data frames. It has improved. You can type mydata[, {tab} and get the variable names. However, you can't type mydata[,c(" {tab}.

Recommendations:

RStudio should also auto-complete variable names after mydata[,c(" . i.e., after quotation marks. Because presumably that is how the user would be selecting variables and then they realise that they can't remember the precise spelling and so need to tab complete.

Auto-completion on formulas

Many functions in R use formulas. Most notable are model fitting functions like lm and glm. However, there is no support in RStudio for auto-completing variable names in formulas. Some of the impediments to this: Formulas come before listing the data.frame in most functions (e.g., lm). So if there are multiple data frames in the workspace, then it would be a little tricky to know which to list.

Auto-completion in the Hadleyverse (e.g., ggplot2) and other functions where a data frame is one argument and variable names are another

Hadley Wickham's packages are awesome. However, they have a particular coding style. In particular, a data frame is commonly one argument (e.g., the first) and variable names are specified as a separate argument; often this is done without quotation marks and in a slightly separate context to the specification of the data frame. For example, in the following context:

ggplot(mydata, aes(my_very_long_variable_name))

There is no auto-completion in RStudio for the variable my_very_long_variable_name.

Similar coding rules apply to a wide range of functions where variable names are specified in a separate argument to the data.frame (e.g., see many of the dplyr and tidyr functions, but also base R functions like subset and reshape). These functions would be so much easier to use if there was auto-completion of variable names in these contexts.

One approach would just be to show auto-completion of variable names of data.frames in more places. However, this could get noisy. Another approach would require a deeper understanding of the language. Presumably this could be done on an ad hoc basis. For example, RStudio could hard code ggplot2 features to know when auto-completion on variable names should occur. Otherwise, perhaps there could be a convention for how package authors could speak to IDEs that want auto-completion information, and a more general way of indicating that auto-completion software should look at the preceding data.frame for the variables.

Auto-completion for function arguments that take lists

There are many functions that have an argument that takes a named list:

nls(..., control = list(...))
ProjectTemplate::load.project(override.config = list(...))

There is no auto-completion on what are the allowed named elements.

Recommendations:

Package authors: should include the list of permissible argument names in the argument section of the help file so that auto-completion software could quickly show this information.
R language: There should be a way to specify what are the permissible arguments which could then be then incorporated into some form of auto-complete in RStudio.

Some other issues

The following are some other related issues that link with the issue of auto-completion.

Make more model fit information accessible from the fit object

An attractive feature of SPSS and related software is that you get a lot of output and there is often a GUI that allows you to select the output that you want. R model output tends to be brief, and if you want additional output, you need to ask for it. This is good also, but how to obtain the additional output could be more intuitive. For example, there is a lot of different information that you might want to obtain from a multiple regression (influence statistics, standardized coefficients, zero-order correlations between predictors and outcome, and so on). One of the challenges is that the model in R is often of the form: (1) return fit, (2) run function or method on that fit object. However, for a new user, it is often difficult to discover what are the available functions and methods that are required to derive a relevant bit of information from an R fit object.

It would be nice if it was as simple as typicaly fit. {tab} and you would get a big list of things that you might want to obtain.

Avoid printing output to the screen that can not easily be extracted

R generally makes reproducible analysis easier to perform. A common use case is to take the output of a function and use that output in a subsequent function. This can be as simple as creating a table that combines different elements (e.g., coefficients from multiple models along with fit statistics).

However, some functions print the statistics you want to the screen, but these numbers are not readily available. In general, this means that print function is performing the calculations and printing them to the screen, without ever storing the results in an object.

Example 1: The print method for factanal prints proportion variance explained for each factor. This is calculated in the print function but is not accessible. If you didn't know how to calculate this yourself, you would have to know that getAnywhere(print.factanal) is the incantation for seeing how R calculates it, and then you'd have to extract the code that does it.

In contrast, when you run summary on an lm fit, you can explore the object and extract things like adjusted r-squared. E.g.,

fit <- lm(y ~ x, mydata)

sfit <- summary(fit)

sfit$ (tab)

This will show the elements of what has been calculated. Depending on trade-offs for computation time, it might even be simpler, if more of these relevant summary statistics are calculated with the fit. So that a user only has to fit the object, and then they can extract the relevant information with fit$ (tab)

Recommendation

Package authors should try to ensure that for every important bit of output in a print function, there should be a standard way of extracting that information into an object. For example, the summary method for lm returns the adjusted r-squared.

Many different object exploration operators

There are many different operators for exploring objects

$ (dollar) to extract named elements of a list (particularly used for output of statistical functions, variables in data.frames and general lists of things) .
:: (double colon) to extract functions and other objects in a package (e.g., mypackage::foo())
::: (triple colon) to extract hidden functions
@ (at symbol) to extract elements of S4 class objects
. (period) which is a notational rule relevant to understanding S3 methods (e.g., print.lm)

Many rules for examining source code

Being able to see the source code is a nice feature in R. But equally, you need to know quite a bit to actually look at source code. e.g., getAnywhere, double colon versus triple colon, compiled code.

Workflow for Completing a Revise and Resubmit of a Journal Article in Psychology

2016-07-07T12:40:00.004+10:00

This post discusses my workflow for completing a revise and resubmit.
I have a template document for representing revise and resubmit responses.
See my templates page on github and specifically see the file "response-to-reviewers.dotx".

Setting up the Response Document

The document has the following core styles:

Heading 1: Divides up major sections of the review (e.g., Editor, Reviewer 1, Reviewer 2
Heading 2: Summary statement for each reviewer actions
Reviewer Comment: Exact quote of a particular reviewer comment
Body text: For recording my response
Quote: For formatting quotes of specifically modified sections of the text

Step 1 is to paste the full text of the editor and reviewers into a new Word response document. Apply reviewer comment style

Step 2 is to set up the response document. Level 1 headings are added that divide up the the reviewer sections.

Reviewer comments are divided into discrete points. The division of revision points may or may not be clear. Some reviewers provide numbered points. Others provide a more narrative review where each paragraph includes multiple points. Some points are interconnected but involve distinct actions.
For each point that is identified, I add a level 2 heading. The level 2 heading includes an identifier and a brief summary statement of the requirement. Identifiers are for example, "R1.2", which would refer to reviewer 1's second point. In some cases, where there are connected points, you get "R1.2.1", "R1.2.2" and so on.

There are several benefits to using identifiers. In some cases, multiple reviewers make the same point. Thus, you can quickly refer the reviewer to another review point. E.g., "This point was addressed in reviewer point R1.2". It can also be an efficient way of keeping track of reviewer points when you are working through a large number of them.

The summary statements are important. I aim to keep them short. Ideally they'll fit on one line so that they are easy to quickly understand (i.e., around 50 characters). I try to make them commands. For example:

Clarify unique contribution
Improve study motivation in introduction
Describe x more clearly
Add references to ...
Justify statistical method
Consider using ... method
Include ... Table 1

In some cases, the required action is not explicitly stated by the reviewer. For example, if a reviewer critiques a methodological decision, there are various possible actions including, justifying your choice, adding a limitation, and so on.

Benefits of the above approach

Using formal Headings in MS Word allows you to view a document map on the side that can quickly allow you to navigate between reviewer points.
Another benefit of the above process is that reviewer comments start to appear more manageable. When you first receive a few pages of reviewer comments, it can feel overwhelming. The above process begins to divide up each point into a more manageable task. The act of providing a summary statement also forces you to read and understand what action is required to respond to the reviewer comment.

Record initial reflections

Above, I show how the first reading is used to parse reviewer comments into discrete points and give descriptive titles. In the second reading, I add comments to each reviewer point using the comment feature in the Word Processor. This is an opportunity to have some initial reflections on (a) how easy it will be to satisfy the revision, (b) whether a change to the manuscript is required, and (c) what should be done. After I've added these, I often circulate the response document to collaborators, to allow them to add comments.

Sequencing the Revisions

The next task is to determine a sequence for working through the revisions. This involves keeping track of which points still need to be addressed and deciding on an order to work through the points.
At a basic level, I place an asterisk at the start of each heading that has not yet been addressed. This is removed once the point has been adequately addressed.

A more challenging issue is deciding on how to work through the changes. Some changes are interdependent. However, major revisions often have to be worked through first as they can have broader structural implications for the manuscript.
A few useful steps for thinking about sequencing include:

Organise the points into categories
Read through each point, and make some tentative notes about what to do (e.g., using comments in Word).
Decide on an explicit sequence to work on the points. This often requires you to brainstorm the pros and cons of working on one point versus another first.

In some cases, sequencing will raise some more meta-issues about the paper that transcend any given review point. I mostly find it easiest to work through points in the following order: analyses, results, method, introduction, discussion. The rationale is that any new analyses that you run and incorporate into your paper will change your results. And these may further require changes to the method, which in turn influence the framing and discussion. Likewise, if the introduction is changed, this may have implications for how the discussion integrates topics raised in the introduction.

Logistically, I generate a table of contents in MS Word. This lists all the reviewer point titles (i.e., the IDs and the titles such as "R1.1 Update method to include ..."). This works because all the reviewer points are formatted using heading styles. I then copy and paste this as plain text into a work document. These points are then organised thematically under headings and into an appropriate sequential order.

Addressing Revision Points

If sequencing issues have been resolved, it is a matter of working through each revision point. I have a few guiding principles:

Write in a manner which focuses on the scientific issue.
Treat the reviewer with respect.
If a reviewer has misunderstood something in the manuscript, take responsibility for making the manuscript clearer.

Another point is that the response document should be self-contained. Ideally, the reviewer should not need to look at the actual manuscript to judge whether you have effectively responded to their requested changes. This makes the experience of the reviewer much more pleasant. From a strategic perspective, they may also be less inclined to read through the entire manuscript again and come up with all new concerns.

If a table or figure is updated, then paste a screenshot of the updated table or figure.
If a new paragraph has been added, include a copy of that paragraph.
If a sentence or two has been added to a paragraph, include a copy of the whole paragraph and bold the section that has been added.
Only if the point is very basic is it sufficient to say, "this change was made". Examples of this might be adding a reference, fixing up typos, and so on.

Another useful strategy is to indicate new text in the manuscript with a different colour font (e.g., purple).

Collaborations and Revisions

It is often easiest if one person leads revisions. The lead person can also allocate specific revision tasks to co-authors. There is the issue of how to synchronise the revisions in the manuscript with the response document. If the changes are particularly complex or the collaborators are likely to make substantial additional changes to the manuscript, then it may be worth waiting a little bit before completing the response document. Or alternatively just see the response document as an initial draft to be returned once the manuscript has been finalised.

Track Changes

Some journals require that you include a version of the manuscript with track changes. In other cases, it can just be a useful addition to the submission. If you are using MS Word, then the compare documents feature is ideal for generating this document. This feature allows you to anonymise the change because you can label the change with "author" rather than your actual name.

Managing Timeframes from Initial Submission to Final Acceptance of Journal Articles in Psychology

2016-07-07T11:02:00.003+10:00

This post discusses issues related to the managing the timeframe from an initial submission to a journal to final acceptance at that or another journal. They are personal notes that pertain to my experiences in psychology. I post them here in case they might be useful for others.

Timeframe

Overview of time frame

The following are very rough rules of thumb for timelines for various decisions along with ranges that cover the majority of cases.

Submission decision
- Desk reject: 1 month (0 - 2)
- Sent out for review: 3 months (1 - 5)
Preparing new submission
- Make changes: 1 month (0 - 2)
- Wait to find time (self or others), acquire more data, get new skills: highly variable length
Revisions:
- Prepare revisions: 1 month (0 - 2)
- Review of revisions: 2 months (0 - 4)

Thus, a basic formula is:

months = number_of_reviewed_submissions * 3 + 
         number_of desk_rejects * 1 +
         number_of_major_edits_on_new_submissions * 1 + 
         make_revisions_and_accepted: i.e., 3 +
         gap_time (i.e., sum of all gaps)

Thus, in summary, desk rejects don't add a lot of time and gap time can be avoided if the manuscript is high priority.

Accepted on first journal review (Submit, Revise, Accept): 6 months
Accepted on second journal review: (Submit, New Revise, Submit, Revise, Accept) 10 months
Accepted on third journal review: 14 months

It is natural and appropriate to aim high on the first submission. You also typically get good feedback which could be used to improve the manuscript which can take a bit of time.

Submission decision

Desk rejection

Desk rejection occurs when an editor (chief or possibly action editor) reviews the manuscript and decides that the paper is not worth sending out for review. This commonly occurs when the editor thinks the paper is an inappropriate fit for the journal. Alternatively, the editor may feel that it is not up to the standard of the journal. This can be either on novelty-interest grounds or on pure scientific grounds.
Desk rejection is typically quick. You also typically don't get a lot of feedback about what is wrong with the manuscript in general. However, sometimes you will get some feedback about areas that can be improved. Alternatively, the desk rejection can help to refine your understanding of what is on topic at particular journals.

Sent out for review

If a paper is sent out for review, you will typically receive some detailed feedback. That the process often takes around 3 or 4 months is not surprising.

Editor has to appraise the manuscript and determine if it should be sent out for review (1-2 weeks)
Editor has to contact reviewers and get agreement to review manuscript (1-4 weeks)
Enough of the reviewers have to have completed their reviews (8 weeks is common)
If insufficient reviews, then ask for more reviewers (can add another 8 weeks)
Editor needs to go through reviews and possibly add own review and make decision (1 to 4 weeks)

A range of decisions can be provided but the two most common are (1) rejection and (2) request for revisions (with a distinction between minor and major; and sometimes between submitting revisions and resubmitting the manuscript).

Preparing new submission

If you get a desk rejection or a rejection after peer review, this is an opportunity to revise the manuscript for a new submission. I discuss strategies for doing this later, but from a time frame perspective, there is (a) the time to make the changes, and (b) the time where the manuscript is waiting to be attended to.

Revisions

Journals often have a deadline for submission of revisions (e.g., 2 months). This ensures that revisions are prioritised.

Responding to rejection

When a submitted manuscript is rejected by a journal, the manuscript in some sense returns to the status of a good draft.

Explicit reasons for rejection

It did not meet a requirements of the journal
- e.g., cross-sectional self-report generally not published; not a multi-study paper; topic not really of interest to the journal; they don't publish student/non-clinical/non-industry samples
Not important enough
- It is common feedback to be told that the manuscript is just not interesting enough often with a hint of what would be required to make it more interesting
- Examples: not novel enough; sample size not large or representative enough;
A list of substantive criticisms
- Is the criticism valid?
- Does it reflect a misunderstanding by the reviewer?
- Can the criticism be addressed? If so, how easily?
- There are small error of expression, unclear sentences, typos, and small errors

Understanding the rejection

Obvsiouly, substantive criticisms can be clear. But even with these, it is useful to get a sense of which were the major reasons for rejection and which are just suggestions for improvement.
Other times it is possible to read into the rejection. The rejection may imply that the reviewers did not follow the argument, or did not see the novelty, or focused too much on limitations.
Typically the reviewer will not state everything they object to.
One response can be to reframe the paper.

Basic options

Appeal rejection
Discard manuscript
Submit manuscript elsewhere
- no changes
- minimal changes
- substantive changes

Appeal rejection

In the majority of cases, appealing a rejection is a bad idea. There are many journals out there on a given topic. If the paper is good, try another one. Use the rejection to improve the manuscript. There are also many reasons why an editor will reject. It rarely comes down to a single issue that can be refuted. And if the contribution of the paper is not clear, then that is the author's fault.

Discard manuscript

Discarding a manuscript is an option. A similar approach is just to put it on hold because it has weaknesses and the time required to fix the weaknesses (if possible) is not worth the time required.
With experience, it becomes easier to judge earlier in the project life cycle where a paper might end up. If it clearly has flaws that can not be recovered, then drop the project early.

Submit manuscript elsewhere

This is the standard option.

Whether to make changes for resubmission?

Benefits of making changes

Systematically considering each change generally increases the chance of future manuscript acceptance

- Some reviewers see it as bad faith to not make changes identified in the review process
- Considering each change generally makes the paper better
Benefits of not making changes
- Making changes takes time
- One school of thought is that if a journal is truly interested in the work, then they will give you a "revise and resubmit" where the editor and reviewers will have their own particular changes that they want made. While there is some truth to this, I still think that systematically working through reviews gets papers closer to something that is appealing to reviewers and makes for a better paper.

My general approach is to treat a resubmission to a new journal like a revise and resubmit. I engage in the same process of responding to each point made by the reviewers. The main difference is that you don't have to be as polite to the reviewers.

Additional references on responding to rejection

Principles for minimising time to acceptance

I am leading

It is first important to distinguish papers in terms of who is leading. If you are leading a paper, then you have much more control over the following things.

Focus on core research area
- This enables better journal selection
- There are fewer gaps in the first submission.
- The revisions are easier to write.
Make initial submission strong
- Don't pursue weak projects
- Appraise potential fatal flaws early
Select appropriate journals
- Pick appropriate level of prestige, impact; lower is easier and quicker, but it is still important to aim high; perhaps if you think it has a 10% chance of being accepted in a great journal, it's probably worth a shot)
Learn from rejection
Make revisions flawless: Getting a revise and resubmit is an excellent opportunity. If you present a perfect and respectful response to every reviewer comment, then the manuscript has a very good chance of being accepted.
Desk rejects don't take up much time

Professional collaborator leading

The first rule is to pick good collaborators. Good collaborators should know how to write a good paper, select an appropriate journal, and be willing and able to persist with revisions and resubmissions in order to find a home for the paper.
Collaborations can also take you out of your core area and thus, you can be in less of a position to make judgements about where the article should be sent or how it should be reframed.

Student leading

When a student is leading, this creates particular challenges. In general, there is a difference between doctoral students who are learning to be independent scholars and other minor student thesis work (for me this includes fourth year and masters by course work projects).

Customising ProjectTemplate in R

2014-05-28T20:57:00.003+10:00

This post talks about my workflow for getting started with a new data analysis project using the ProjectTemplate package.

Update (24th August 2016)

Over the last two years, I have been refining this customised version of ProjectTemplate.

I have more detailed information about the latest version here.

Video at Melbourne R Users July 4th 2017

Overview of ProjectTemplate

ProjectTemplate is an R Package which facilitates data analysis, encourages good data analysis habits, and standardises many data analytic steps. After many years of refining a data analysis workflow in R, I realised that I'd basically converged on something similar to ProjectTemplate anyway. However, my approach was not quite as systematic, and it took more effort than necessary to get started on a new project. Thus, since late 2013, I've been using ProjectTemplate to organise my R data analysis projects.
While I have found ProjectTemplate to be an excellent tool, I realised that when I created a new data analysis project based on ProjectTemplate, I was repeatedly making a large number of customisations to the initial set of files and folders. Thus, I've now set up a repository to store these customisations so that I can get started on a new data analysis project more efficiently. The purpose of this post is to document these modifications.
This post assumes a reasonable knowledge of R and ProjectTemplate. If you're not familiar with ProjectTemplate, you could check out the ProjectTemplate website focusing particularly on the Getting Started section. If you're really keen you could also watch an hour long video on ProjectTemplate, RStudio, and GitHub

General setup

I have a copy of my customised version of the ProjectTemplate directory and file structure on github in the AnglimModifiedProjectTemplate repository. Specifically, it has:

Modifications to global.dcf as described below,
a blank readme.md
a couple of directories removed that I don't use (e.g., diagnositics, logs, profiling)
an initial rmd file with the customisations mentioned below in the reports directory
An .Rproj RStudio project file to enable easy launching of RStudio.
An additional output directory for storing tabular, text, and other output

Thus, whenever I want to start a new data analysis project I can download and extract the zip file of the repository on github).
Thus, after creating a project folder, the following steps can be skipped when using my customised template.

Open RStudio and create RStudio Project in existing directory
Create ProjectTemplate folder structure with library(ProjectTemplate); create.project()
Move ProjectTemplate files into folder
Modify global.dcf
Setup rmd reports

I also document below a few additional points about subsequent steps including:

Setting up the data directory
Updating the readme file
Setttig up git repository

Modifying global.dcf

My preferred starting global.dcf settings are

data_loading: on
cache_loading: off
munging: on
logging: off
load_libraries: on
libraries: psych, lattice, Hmisc
as_factors: off
data_tables: off

A little explanation:

as_factors I do quite a bit of string processing, particularly on meta data and on output tables. I find the automatic conversion of strings into factors to be a really annoying feature. Thus, setting this to off is my preferred setting.
load_libraries: I always have additional libraries so it makes sense to have this on.
libraries: There are many common packages that I use, but I almost always make use of the above comma separate list of packages.

Setup rmd files

Basics of such files

The first line in the first chunk is always:

```{r}
library(ProjectTemplate); load.project()
```

This loads everything required to get started with the project.

Setup data folder

ProjectTemplate automatically names resulting data.frames with a name based on the file name. This is convenient. However, it is often the case that the file names need to be changed from some raw data supplied or it may be that the original data format is not perfectly suited for importing. In that case, I store the raw data in a separate folder called raw-data and then export or create a copy in the desired format with the desired name in the data folder.

Overriding default data import options

Some data files can not be imported using the default data import rules. Of course, you can change the file to comply with the rules. Alternatively, I think the standard solution is to add a file in the lib directory (e.g., data-override.r) that imports the data files. Give the imported data file the same name that ProjectTemplate would.

Update readme

I change the file to README.md to make it clear that it is a markdown formatted file. I can then add a little information about the project.

Setup git repository

If using github, I create a new repository on github.

Output folder

A common workflow for me is to generate tables, text, and figure output fromthe script which is then incorporated into a manuscript document. While I really like Sweave and RMarkdown, I often find it more practical to write a manuscript in Microsoft Word. I use the output folder to store tabular output, standard text output, and figures.
In the case of tabular output, there is the task of ensuring the table is formatted appropriately (e.g., desired number of decimal places, cell alignment, cell borders, font, cell merging, etc.). I typically find this easiest to do in Excel. Thus, I have a file called output-processing.xlsx. I import the tabular data into this file and apply relevant formatting. This can then be incorporated into the manuscript. Here are a few more notes about Table conversion in MS Word.

Using R to replicate common SPSS multiple regression output

2013-12-04T17:22:00.002+11:00

The following post replicates some of the standard output you might get from a multiple regression analysis in SPSS. A copy of the code in RMarkdown format is available on github. The post was motivated by this previous post that discussed using R to teach psychology students statistics.

library(foreign)  # read.spss
library(psych)  # describe 
library(Hmisc)  # rcorr 
library(QuantPsyc)  # lm.beta
library(car)  # vif, durbinWatsonTest
library(MASS)  # studres
library(lmSupport)  #lm.sumSquares
library(perturb)  # colldiag

In order to emulate SPSS output, it is necessary to install several add-on packages. The above library commands load the packages into your R workspace. I've highlighted in the comment the names of the functions that are used in this script.
You may not have the above packages installed. If not, run commands like:

install.packages('foreign')
install.packages('psych')
etc.

for each of the above packages not installed or use the “packages” tab in RStudio to install.
Note also that much of this analysis could be performed using Rcommander using a more SPSS-style GUI environment.

Import and prepare data

cars_raw <- read.spss("cars.sav", to.data.frame = TRUE)
# get rid of missing data listwise
cars <- na.omit(cars_raw[, c("accel", "mpg", "engine", "horse", "weight")])

Ensure that cars.sav is the working directory.

Quick look at data

# note the need to deal with missing data
psych::describe(cars_raw)

##             var   n    mean     sd  median trimmed    mad    min     max
## mpg           1 398   23.51   7.82   23.00   23.06   8.90   9.00   46.60
## engine        2 406  194.04 105.21  148.50  183.75  86.73   4.00  455.00
## horse         3 400  104.83  38.52   95.00  100.36  29.65  46.00  230.00
## weight        4 406 2969.56 849.83 2811.00 2913.97 947.38 732.00 5140.00
## accel         5 406   15.50   2.82   15.50   15.45   2.59   8.00   24.80
## year*         6 405    6.94   3.74    7.00    6.93   4.45   1.00   13.00
## origin*       7 405    1.57   0.80    1.00    1.46   0.00   1.00    3.00
## cylinder*     8 405    3.20   1.33    2.00    3.14   0.00   1.00    5.00
## filter_.*     9 398    1.73   0.44    2.00    1.79   0.00   1.00    2.00
## weightKG     10 406 1346.97 385.48 1275.05 1321.75 429.72 332.03 2331.46
## engineLitre  11 406    3.19   1.73    2.44    3.02   1.42   0.07    7.47
##               range  skew kurtosis    se
## mpg           37.60  0.45    -0.53  0.39
## engine       451.00  0.69    -0.81  5.22
## horse        184.00  1.04     0.55  1.93
## weight      4408.00  0.46    -0.77 42.18
## accel         16.80  0.21     0.35  0.14
## year*         12.00  0.02    -1.21  0.19
## origin*        2.00  0.92    -0.81  0.04
## cylinder*      4.00  0.27    -1.69  0.07
## filter_.*      1.00 -1.04    -0.92  0.02
## weightKG    1999.43  0.46    -0.77 19.13
## engineLitre    7.41  0.69    -0.81  0.09


dim(cars)

## [1] 392   5

head(cars)

##   accel mpg engine horse weight
## 1  12.0  18    307   130   3504
## 2  11.5  15    350   165   3693
## 3  11.0  18    318   150   3436
## 4  12.0  16    304   150   3433
## 5  10.5  17    302   140   3449
## 6  10.0  15    429   198   4341

str(cars)

## 'data.frame':    392 obs. of  5 variables:
##  $ accel : num  12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
##  $ mpg   : num  18 15 18 16 17 15 14 14 14 15 ...
##  $ engine: num  307 350 318 304 302 429 454 440 455 390 ...
##  $ horse : num  130 165 150 150 140 198 220 215 225 190 ...
##  $ weight: num  3504 3693 3436 3433 3449 ...
##  - attr(*, "na.action")=Class 'omit'  Named int [1:14] 11 12 13 14 15 18 39 40 134 338 ...
##   .. ..- attr(*, "names")= chr [1:14] "11" "12" "13" "14" ...

Fit model

fit <- lm(accel ~ mpg + engine + horse + weight, data = cars)

Descriptive Statistics

# Descriptive statistics
psych::describe(cars)

##        var   n    mean     sd  median trimmed    mad min    max  range
## accel    1 392   15.52   2.78   15.50   15.46   2.52   8   24.8   16.8
## mpg      2 392   23.45   7.81   22.75   22.99   8.60   9   46.6   37.6
## engine   3 392  193.65 104.94  148.50  183.15  86.73   4  455.0  451.0
## horse    4 392  104.21  38.23   93.00   99.61  28.17  46  230.0  184.0
## weight   5 392 2967.38 852.29 2797.50 2909.64 945.90 732 5140.0 4408.0
##        skew kurtosis    se
## accel  0.27     0.43  0.14
## mpg    0.45    -0.54  0.39
## engine 0.69    -0.77  5.30
## horse  1.09     0.71  1.93
## weight 0.48    -0.76 43.05


# correlations
cor(cars)

##          accel     mpg  engine   horse  weight
## accel   1.0000  0.4375 -0.5298 -0.6936 -0.4013
## mpg     0.4375  1.0000 -0.7893 -0.7713 -0.8072
## engine -0.5298 -0.7893  1.0000  0.8959  0.9339
## horse  -0.6936 -0.7713  0.8959  1.0000  0.8572
## weight -0.4013 -0.8072  0.9339  0.8572  1.0000

rcorr(as.matrix(cars))  # include sig test for all correlations

##        accel   mpg engine horse weight
## accel   1.00  0.44  -0.53 -0.69  -0.40
## mpg     0.44  1.00  -0.79 -0.77  -0.81
## engine -0.53 -0.79   1.00  0.90   0.93
## horse  -0.69 -0.77   0.90  1.00   0.86
## weight -0.40 -0.81   0.93  0.86   1.00
## 
## n= 392 
## 
## 
## P
##        accel mpg engine horse weight
## accel         0   0      0     0    
## mpg     0         0      0     0    
## engine  0     0          0     0    
## horse   0     0   0            0    
## weight  0     0   0      0

# scatterplot matrix if you want
pairs.panels(cars)

Summary of model

# r-square, adjusted r-square, std. error of estimate, overall ANOVA, df, p,
# unstandardised coefficients, sig tests
summary(fit)

## 
## Call:
## lm(formula = accel ~ mpg + engine + horse + weight, data = cars)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -4.177 -1.023 -0.184  0.936  6.873 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 16.980778   0.977425   17.37   <2e-16 ***
## mpg          0.007476   0.019298    0.39   0.6987    
## engine      -0.008230   0.002674   -3.08   0.0022 ** 
## horse       -0.087169   0.005204  -16.75   <2e-16 ***
## weight       0.003046   0.000297   10.24   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.7 on 387 degrees of freedom
## Multiple R-squared:  0.631,  Adjusted R-squared:  0.627 
## F-statistic:  166 on 4 and 387 DF,  p-value: <2e-16

### additional info in terms of sums of squares
anova(fit)

## Analysis of Variance Table
## 
## Response: accel
##            Df Sum Sq Mean Sq F value Pr(>F)    
## mpg         1    577     577   200.8 <2e-16 ***
## engine      1    272     272    94.7 <2e-16 ***
## horse       1    753     753   261.8 <2e-16 ***
## weight      1    302     302   104.9 <2e-16 ***
## Residuals 387   1113       3                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


# 95% confidence intervals (defaults to 95%)
confint(fit)

##                 2.5 %    97.5 %
## (Intercept) 15.059049 18.902506
## mpg         -0.030466  0.045418
## engine      -0.013488 -0.002972
## horse       -0.097401 -0.076938
## weight       0.002461  0.003630

# but can specify different confidence intervals
confint(fit, level = 0.99)

##                 0.5 %    99.5 %
## (Intercept) 14.450621 19.510934
## mpg         -0.042478  0.057430
## engine      -0.015153 -0.001308
## horse       -0.100641 -0.073698
## weight       0.002276  0.003816


# standardised coefficients
lm.beta(fit)

##      mpg   engine    horse   weight 
##  0.02101 -0.31093 -1.19988  0.93456


# or you could do it manually
zcars <- data.frame(scale(cars))  # make all variables z-scores
zfit <- lm(accel ~ mpg + engine + horse + weight, data = zcars)
coef(zfit)[-1]

##      mpg   engine    horse   weight 
##  0.02101 -0.31093 -1.19988  0.93456


# correlations: zero-order, semi-partial, partial obscure function seems to
# do it
sqrt(lm.sumSquares(fit)[, c(2, 3)])

##              dR-sqr pEta-sqr
## (Intercept) 0.53638   0.6620
## mpg         0.01000   0.0200
## engine      0.09487   0.1546
## horse       0.51711   0.6483
## weight      0.31623   0.4617
## Error (SSE)      NA       NA
## Total (SST)      NA       NA


# or use own function
cor_lm <- function(fit) {
    dv <- names(fit$model)[1]
    dv_data <- fit$model[, dv]
    ivs <- names(fit$model)[-1]
    iv_data <- fit$model[, ivs]
    x <- fit$model
    x_omit <- lapply(ivs, function(X) x[, c(dv, setdiff(ivs, X))])
    names(x_omit) <- ivs
    lapply(x_omit, head)
    fits_omit <- lapply(x_omit, function(X) lm(as.formula(paste(dv, "~ .")), 
        data = X))
    resid_omit <- sapply(fits_omit, resid)
    iv_omit <- lapply(ivs, function(X) lm(as.formula(paste(X, "~ .")), data = iv_data))
    resid_iv_omit <- sapply(iv_omit, resid)

    results <- sapply(seq(ivs), function(i) c(zeroorder = cor(iv_data[, i], 
        dv_data), partial = cor(resid_iv_omit[, i], resid_omit[, i]), semipartial = cor(resid_iv_omit[, 
        i], dv_data)))
    results <- data.frame(results)

    names(results) <- ivs
    results <- data.frame(t(results))
    results
}

round(cor_lm(fit), 3)

##        zeroorder partial semipartial
## mpg        0.438   0.020       0.012
## engine    -0.530  -0.155      -0.095
## horse     -0.694  -0.648      -0.517
## weight    -0.401   0.462       0.316

Assumption testing

# Durbin Watson test
durbinWatsonTest(fit)

##  lag Autocorrelation D-W Statistic p-value
##    1           0.136         1.721   0.004
##  Alternative hypothesis: rho != 0


# vif
vif(fit)

##    mpg engine  horse weight 
##  3.085 10.709  5.383  8.736


# tolerance
1/vif(fit)

##     mpg  engine   horse  weight 
## 0.32415 0.09338 0.18576 0.11447


# collinearity diagnostics
colldiag(fit)

## Condition
## Index    Variance Decomposition Proportions
##           intercept mpg   engine horse weight
## 1   1.000 0.000     0.001 0.001  0.001 0.000 
## 2   3.623 0.002     0.051 0.016  0.005 0.001 
## 3  16.214 0.006     0.066 0.365  0.763 0.019 
## 4  18.519 0.127     0.431 0.243  0.152 0.227 
## 5  32.892 0.865     0.451 0.375  0.079 0.753


# residual statistics
rfit <- data.frame(predicted = predict(fit), residuals = resid(fit), studentised_residuals = studres(fit))
psych::describe(rfit)

##                       var   n  mean   sd median trimmed  mad   min   max
## predicted               1 392 15.52 2.21  16.11   15.80 1.40  3.13 20.06
## residuals               2 392  0.00 1.69  -0.18   -0.11 1.39 -4.18  6.87
## studentised_residuals   3 392  0.00 1.01  -0.11   -0.07 0.82 -2.49  4.47
##                       range  skew kurtosis   se
## predicted             16.93 -1.61     4.10 0.11
## residuals             11.05  0.75     1.10 0.09
## studentised_residuals  6.95  0.81     1.38 0.05


# distribution of standarised residuals
zresid <- scale(resid(fit))
hist(zresid)

# or add normal curve http://www.statmethods.net/graphs/density.html
hist_with_normal_curve <- function(x, breaks = 24) {
    h <- hist(zresid, breaks = breaks, col = "lightblue")
    xfit <- seq(min(x), max(x), length = 40)
    yfit <- dnorm(xfit, mean = mean(x), sd = sd(x))
    yfit <- yfit * diff(h$mids[1:2]) * length(x)
    lines(xfit, yfit, lwd = 2)
}
hist_with_normal_curve(zresid)


# normality of residuals
qqnorm(zresid)
abline(a = 0, b = 1)


# plot predicted by residual
plot(predict(fit), resid(fit))


# plot dependent by residual
plot(cars$accel, resid(fit))

Writing a Concise Introduction to a Psychology Journal Article: An Article Deconstruction

2013-11-19T14:46:00.000+11:00

I often talk about article deconstruction as a useful method for extracting principles for writing journal articles. The following is an article deconstruction of the introduction section of Fujita and Diener (2005). The writing principles extracted may be relevant to others writing introductions to journal articles in psychology.

The article

The article analyses longitudinal data on the stability of life satisfaction. Both authors and especially Ed Diener are major figures in well-being research. As I was reading the article, I found the writing style to be particularly engaging. Thus, I thought it would be a good article to deconstruct in order to identify relevant writing principles. Here is a PDF of the article

Fujita, F., & Diener, E. (2005). Life satisfaction set point: stability and change. Journal of personality and social psychology, 88(1), 158.

Paragraph Descriptions

This section analyses each paragraph of the introduction to extract writing principles.

1. Overview of study

Description:
- The opening sentence of the first paragraph states the purpose of the journal article. i.e., "The purpose of this study was ..."
- The second sentence elaborates briefly on the core concept
- The third sentence states the empirical method used in the study.
Analysis:
- This structure really gets to the point quickly about what the study is about both theoretically and methodologically.
- Explicit discussion of the importance of the topic is delayed until Paragraph 2. This differs from some other papers which commence with a more general opening paragraph that eludes to the importance of the problem. It also differs from the hour glass model of lab report writing (which I dislike quite a bit) which often does not get to the point of the paper until the final paragraph.

2. Importance of research question

Description:
- The paragraph is concerned with the importance of the research question.
- Theoretical argument for importance: essential to understanding relationship between core variables in conceptual space.
- Applied argument: relevant to societal interventions to improve people's lives
Analysis
- I often think of framing the importance in terms of overcoming a gap in the literature. Instead, this paragraph focuses on the importance of the research question in absolute terms. Importance of the study is reserved for later after a critique of the literature is presented. This approach overcomes some of the challenge of trying to fit too much in the opening few paragraphs (i.e., trying to summarise a complex critique of the literature in order to justify the research).

3. Historical development of set-point theory

Description:
- The first sentence highlights the literature on the broader topic (i.e., SWB) and uses a book length treatment as a general citation.
- Subsequent sentences present the contributions of several key authors who have discussed the core idea of the paper. "Headley and Wearing ... proposed ...", etc.
- There are also elaboration sentences.
Analysis
- The paragraph starts with the general research context, and then immediately moves to present historical development of the core idea.
- One of the challenges of writing a literature review is working out what to cover and how much general introduction should be provided. In this case, only a single sentence is provided to link with the general literature. So, for example, the paper does not provide general definitions of life satisfaction or subjective well-being.

4. The evidence for stability

Description
- It starts with a general claim: "lines of data suggest ...".
- It then presents two pieces of evidence: I.e., "First, ..."; "Second, ..."
- In relation to the first point made, there are two sentences. The first sentence states a general empirical relationship (e.g., "X is related to Y"). The second sentence provides an illustrative example of a study that shows the relationship and reports the specific findings. "For example, Eid and Diener (2004) found ..."
Analysis
- This paragraph links closely with the next one. The paper is about stability and change. Thus, devoting a paragraph each to stability and change provides a way of presenting both perspectives. This model would work in many literature reviews where two contrasting ideas are presented.
- The use of the general claim followed by an illustrative study is useful for making discussion of the literature more concrete.

5. The evidence for change

Description:
- The first sentence links with the previous paragraph and states the topic of the paragraph. E.g., "Despite evidence for {claim in previous paragraph}... there are also indications that {alternative claim elaborated in this paragraph}..."
- Then a series of findings are presented from the literature. Various transitional words are used to link ideas: "There are ..."; "Also, ..."; "Further, ..."; "Thus, ..."

6. Critique of existing studies

Description:
- The opening paragraph acknowledges that the central claim made in the paper has been made before, but that the literature has not yet "fully explored {the idea}".
- Then three major critiques are presented using the linking words: "For one thing ..."; "Furthermore, ..."; "Another limitation ..."
Analysis
- This paragraph serves to highlight the gap in the literature and justify the present study.
- Interestingly, the critique does not cite any particular studies. Rather it just states the limitations of the existing literature.
- The points made in the critique constitute limitations of previous research and not fundamental flaws.
- Both not citing specific studies and framing issues as limitations helps to create a positive respectful tone while at the same time justifying the current study.
- Given that the idea itself is not new, this study provides a good example of showing how the evidence and the methodology can be used to argue for the unique contribution.

7. Description of current study

Description
- The paragraph describes the method and sample.
- It is stated how this sample helps answer the research question. I.e., "By using {aspect of method in current study}, we overcome one of the limitations of past research {state limitation}.
- It is stated how the method/design helps answer the research question.
Analysis
- The benefits of the method of the current study are summarised and contrasted with the previous literature. Thus, the paragraph also highlights the gap in the previous literature and the importance of the current study.

8. Hypotheses

Description
- Two hypotheses are presented.
- Assorted justification for hypotheses are interspersed:
  - "on the basis of past findings"
  - "because ..."
  - whole sentences of the form: "people are likely ..."
Analysis
- The words "hypothesized" and "predicted" seem be to used interchangeably.
- The hypotheses are expressed in a fairly verbal manner, still leaving some scope fo how they will be converted into a numeric test.
- In contrast to some studies hypotheses are not numbered in any formal way. This creates a more flexible, concise and informal approach to specifying expectations. In many respects I prefer this given that ultimately the analyses are the evidence.

9. Additional questions examined

Description
- This paragraph introduces additional, ancillary research questions.
- The first sentence links the main research question to the additional research questions: "In addition to {main research question}, we examined some related questions."
- The remaining sentences are: (2) description of first additional question; (3) basic expectations and justification for first additional question; (4) description of second additional question.
Analysis
- This provides an interesting approach for how to handle additional questions that will be addressed by the analyses in a study. There may not be space to address the full literature on these additional questions. Even if space was available, extensive discussion of the literature on the additional questions may distract the reader from the core research question.
- In general the paragraph frames and justifies the questions as being related to the primary question, but not much time is spent discussing the specific issues.

General reflection on the introduction:

Focus and length

It is a relatively short and focussed introduction. Nine paragraphs is not long. It's about one APA journal page of double column text. Only about three paragraphs represent a literature review with references and the like, plus one more for a critique.
In its focus it implies that certain topics don't need to be discussed. It is assumed that the reader is familiar with them or at least assumes that it would be distracting from the goals of the paper to have to address such issues. For example:
- It does not define SWB or discuss the relationship between SWB and personality.
- It does not attempt to be a complete review of all studies that have been conducted on the stability or change of SWB.
There are no subheadings. This makes sense given the length. However, it could easily have had two subheadings, one for the literature review and one for the current study.

Citation practice

In a previous post I describe the importance of making the link between citations and argument clear. So for example, it should be clear whether any claim made is a proposal, empirical finding, or something else. The paper uses the following strategies:

Specific words showing link
- "{Author} proposed"; "{Author} found"; "{Author} suggested"
- "Advanced by {one author and colleagues} {insert multiple references}"
Implied that the reference provides empirical support for the claim asserted:
- "research indicates that {finding} {reference}";
- "There are {finding} {reference}";
Illustrative examples with implied empirical citation:
- "It has been found that {finding} ...such as {one domain} {reference} another domain {reference}"

Pronouns

The first person pronoun "we" is used quite a lot. Rather than "it was hypothesized that", the authors write "we hypothezized that". This creates a fairly engaging tone.

Comparison to my previous writing on introductions

I've discussed previously about how to write an introduction in psychology. In that post I present one structure for an introduction roughly as follows:

Opening (Aim, Gap, Importance, Method)
Literature review
Current study (Study description, Expectations)

While fairly similar, this paper differed to that structure in a few ways:

Aim and method was placed in the first paragraph. A more fluffy but engaging opening paragraph was omitted.
Importance: The second paragraph presents importance.
Gap: Gap is presented at the end of the literature review component as a presentation of common limitations with the existing literature. Gap also is addressed when articulating the current study. Design aspects that overcome past limitations are articulated.
Study description and expectations are basically as described in the post.

How to convert manual APA references to Endnote references in Word

2013-11-11T18:44:00.002+11:00

When collaborating on a journal article with colleagues I sometimes get Word documents that have manually formatted references. I often want to convert the manual references to Endnote references. The following post discusses a workflow for doing this.

Context

The reality is that most journals in psychology require or prefer submissions to be in Word format. Endnote works reasonably well for formatting APA style references in Microsoft Word. Furthermore, most of my students and collaborators are familiar with Word.

So, I sometimes end up with Word documents written with manual references. I could just continue on with manual references, but that has a whole host of problems: (1) it takes about 20 minutes to do a check that citations match the references, and every time the document is edited, this check needs to be performed again; (2) there are so many rules when it comes to APA style that it is better to get an automated tool like Endnote to apply them.

So, I want to convert the manual referencing to Endnote format. Here's one way to do it.

Getting references into Endnote

The first step is to get the references into Endnote. I asked here about about automatic import of lists of references into reference databases, but I have not found a solution. So for now, one option is just to copy and paste each reference into Google Scholar.

Before beginning the process:

Create and open the Endnote database for the journal article
Configure Google Scholar to use Endnote as it's default reference manager (this should then show an "import into Endnote" button for each reference).
Configure your browser to automatically open the "ics" file that results when you click "import into endnote"

To perform the search for each reference, sometimes pasting the whole reference into the search box will work, other times you need to only provide a portion of the search such as the title or author and year. For extra speed, I have an Google Scholar Search with the highlighted text using Alfred.

Also for each reference, it is often necessary to add additional information missed by Google Scholar. Scholar does a good job with journal articles, but misses information in books, book chapters, and so on.

Modifying citations

So the existing Word document has manually written APA citations that need to be modified to Endnote format. One approach to doing this is to convert all the manual true citations into temporary Endnote citations. Then pressing format citations in Endnote will lead Endnote to attempt to match each citation to a reference in the Endnote database.

The simplest step to do this is to just convert parentheses around citations to curly braces. I.e., (Smith, 2009) becomes {Smith, 2009}. This works fairly well, but there are a few things to keep in mind.

Removing the "and" symbol between references will improve Endnote matching. So make (Smith & Jones, 2000) into {Smith Jones, 2000}
If you have prefix or postfix text then add the temporary symbols described here: e.g., (e.g., Smith, 2000) becomes (e.g., \Smith, 2000)
If you have references with author outside the reference, then either put a comma before the reference to just show the year, or add the @@author-year code to include the author in text but generated by Endnote.

So once this is all done, running format temporary citations should lead Endnote to take you through each citation asking you to link each citation to the corresponding reference. And if you've done a good job of the preceding step, the first match should correspond to the article in most cases.

Checking

At this point, the Endnote generated references may need to be moved into their appropriate location and formatted as required.

The final step is to check all the references and citations.

Common errors for Google Scholar References:
- No page range (missing end page number)
- Case issues in title
- Case issues in journal name
- Issues with hyphens

There can also be some issues with the citations. When diagnosing problems, one strategy is to convert all citations to unformatted citations in Endnote.

Tutorials, Answers, and Data Files for Multivariate Research Methods Course using SPSS and Amos

2013-10-10T11:01:00.000+11:00

I recently developed a set of tutorials for teaching research methods using SPSS and Amos to I/O psychology students. I thought they might be useful for other instructors or people learning intermediate multivariate research methods to social and behavioural science students. Thus, I have made the resources available as a downloadable repository.

Each tutorial includes a set of exercises, data, and extensive answers. A particular emphasis is on using syntax, reproducible workflow in SPSS, managing metadata, and scale construction.

It contains six tutorial exercises.

Introduction to data analysis
Correlation and regression
Group differences
Moderators and mediators
Exploratory factor analysis
Confirmatory factor analysis

For example, here is the tutorial on confirmatory factor analysis with Amos in docx format. The repository also includes related data files.

GITHUB Repository Address: https://github.com/jeromyanglim/spss-research-methods-tutorials

Each exercise includes several folders

Instructions: This folder includes one or more Word documents with the exercise and answers. These files should be your starting point for getting an understanding of the tutorials.
data: This folder includes raw data and meta data used in the tutorial exercises. There are often raw csv files as well as various SPSS sav files. The exercises are designed to teach students how to import and process csv files in SPSS.
output: These folders often include a copy of much of the SPSS output in PDF form as well as some syntax files.

To use the repository it is recommended that you download the ZIP file.

Earlier versions of the corresponding lectures can basically be found in the teaching resources section of my website under multivariate methods.

Author: Dr Jeromy Anglim, Deakin University

Licence: Tutorial exercises are given a creative commons licence CC BY 3.0. Raw data files and data descriptions retain whatever licence they had previously.

Evaluating the Potential Incorporation of R into Research Methods Education in Psychology

2013-07-17T18:41:00.000+10:00

I was recently completing some professional development activities that required me to write a report on a self-chosen topic related to diversity in student backgrounds. I chose to use the opportunity to reflect on the potential for using R to teach psychology students research methods. I thought I'd share the report in case it interests anyone.

Abstract

Research methods is fundamental to psychology education at university. Recently, open source software called R has become a compelling alternative to the traditionally used proprietary software called SPSS for teaching research methods. However, despite many strong equity and pedagogical arguments for the use of R, there are also many risks associated with its use. This report reviews the literature on the role of technology in research methods university education. It then reviews literature on the diversity of psychology students in terms of motivations, mathematical backgrounds, and career goals. These reviews are then integrated with a pedagogical assessment of the pros and cons of SPSS and R. Finally, recommendations are made regarding how R could be best implemented in psychology research methods teaching.

Introduction

Training in research methods is a fundamental component of university education in psychology. However, for many reasons subjects in research methods are challenging to teach. Students have diverse mathematical, statistical, and computational backgrounds; students often lack motivation as they struggle to see the relevance of statistics. These issues are compounded by undergraduate majors in psychology that typically have several compulsory research methods subjects. Given the competition for entry into fourth year and post-graduate programs, such research methods subjects can be threatening to struggling students.

As with many other universities, research methods in psychology at Deakin University has largely been taught using software called SPSS. This software is typically taught as a menu driven program that is used to analyse data enabling standard data manipulation, analyses, and plotting. While SPSS is relatively user-friendly for standard analyses, there are several problems with teaching students how to use it. In particular, it is very expensive; thus, students can not be assumed to have access to it either from home for doing assignments or in future jobs. In addition, while SPSS makes it easy to perform standard analyses, it is very difficult to alter what SPSS does to perform novel analyses. Thus, for many reasons some lecturers are seeking alternative statistical software for teaching research methods.

While there are many programs for performing statistical analysis, one particularly promising program, known simply as "R", has emerged as a viable alternative to SPSS. R is open source so it is free for students and staff. Thus, students can use R from home when completing assignments, and can use it in any future job. It has a vast array of statistical functionality. Despite these benefits, it does present several challenges to incorporation into psychology. Analyses are typically performed using scripts. It is often less clear how to run certain analyses. The program often assumes a mental model of a statistician rather than an applied researcher.

Thus, the current report had the following aims. The first aim was to evaluate the pros and cons of using R to teaching psychology students research methods. The second aim was to evaluate how best R could be incorporated. In order to achieve these aims, the report is structured into several parts. First, the general literature of software in statistics education is reviewed. A particular focus is placed on diversity in student backgrounds in applied fields. Second, the backgrounds and career goals of psychology students are presented with reference to the literature and practical experience. Third, the pros and cons of using R versus SPSS is presented. Finally, ideas about how best to incorporate R into statistics education are reviewed.

Statistics education and the role of software

There is a substantial literature on statistics education and the role of statistical software in statistics education. Tiskovskaya and Lancaster (2012) provide one review of the challenges in statistics education. Their review is structured around teaching and learning, statistical literacy, and statistics as a profession. Of particular relevance to teaching statistics in psychology they outline several problems and provide relevant references to the statistical literature. With references taken from their paper, these issues include: inability to apply mathematics to real world problems (e.g., Garfield, 1995); mathematics and statistics anxiety and motivation issues in students (e.g., Gal & Ginsburg, 1994); inherent difficulty in students understanding probability and statistics (e.g., Garfield & Ben-Zvi, 2008); problems with background mathematical and statistical knowledge (e.g., Batanero et al 1994); the need to develop statistical literacy which translates into everyday life (e.g., Gal, 2002); and the need to develop assessment tools to evaluate statistical literacy. Tiskovskaya and Lancaster (2012) also reviewed potential statistics teaching reforms. They note that there is a need to provide contextualised practice, foster statistical literacy, and create an active learning environment.

Of particular relevance to the current review of statistical software, Tiskovskaya and Lancaster (2012) discuss the role of technology in statistics education. The importance of technology has increased as computers have become more powerful. This has enabled students to run powerful statistical programs on their computer. Some teachers have used this power to focus instruction on interpretation of statistical results rather than computational mechanics. Chance et al (2007) further note the value of using interactive applets to explore statistical concepts and taking advantage of internet resources in teaching.

Chance et al's (2007) review also summarises several useful suggestions for incorporating technology in statistics education. Moore (1997) notes the importance of balancing using technology as a tool with remembering that the aim is to teach statistics and not the tool per se. Chance et al (2007) notes particularly valuable uses of technology include analysing real datasets, exploring data graphically, and performing simulations. Chance et al (2007) also review statistical software packages for statistics education noting both the advantages and disadvantages of menu-driven applications such as SPSS.

Chance et al (2007) offer several recommendations for incorporating technology into statistical education. First, they highlight the importance of getting students practicing not just performing analyses, but also focusing on interpretation. Second, they recommend that tasks be carefully structured around exploration so that students see the bigger picture and do not get overwhelmed with software implementation issues. Third, collaborative exercises can force students to justify to their fellow students their reasoning. Fourth, they encourage the use of cycles of prediction and testing, which technology can facilitate (e.g., proposing a hypothesis for a simulation and then testing it).

Chance et al (2007) summarise the GAISE report by Franklin and Garfield (2006) on issues to consider when choosing software to teach statistics. These include (a) "ease of data entry, ability to import data in multiple formats, (b) Interactive capabilities, (c) Dynamic linking between data, graphical, and numerical analyses, (d) Ease of use for particular audiences, and (f) Availability to students, portability" (p.19). Franklin and Garfield (2006) also discuss a range of other implementation issues, such as the amount of time to allocate to software exploration, how much the software will be used in the course, and how accessible the software will be outside class. Garfield (1995) suggest that computers should be used to encourage students to explore data using analysis and visualisation tools. Running simulations and exploring resulting properties is also particularly useful. Thus, overall these general considerations regarding statistics education can inform the choice of statistical software. However, the above review also highlights that choice of software is only a small part of the overall unit design process.

Psychology students and the role of statistics

Pathways of psychological studies in Australian universities typically involve completing a three year undergraduate major in psychology, then a fourth year, followed by post-graduate professional or research degrees at masters or doctoral level. As a result of student interest, specialisation, and competition for places, there is a reduction over year levels. From my experience both at Melbourne University and Deakin University, a ball park estimate of the student numbers as a percentage of first year load, would be 40% at second year, 35% at third year, 10% at fourth year, and 3% at postgraduate level. This is from one to two thousand students at first year. Of course these are just rough estimates, but the point is to highlight that there are huge numbers of students getting a basic undergraduate education in psychology; in contrast, the few that go on to fourth year have both a high skill level in psychology also different needs regarding research methods.

Psychology students are taught using the scientist-practitioner model. A big part of science in psychology is research methods and statistics. Students typically complete two or three research methods subjects at undergraduate level, another unit in fourth year, and potentially further units at postgraduate level. The diverse nature of psychology student backgrounds, motivations, and career outcomes can make research methods a difficult subject to design and teach. Psychology undergraduate students also have diverse career goals and outcomes. Many go on to some form of further study. Those that exit at the end of third year have diverse employment outcomes. For example, Borden and Rajecki describe one US sample finding that income was lower than many other majors and that roles included administrative support (17.6%), social worker (12.6%), counsellor (7.6%) along with a diverse range of other jobs. Of those that go on, some will continue with research, but others will go into some form of applied practice.

In terms of research methods in psychology, there are a diverse range of goals. First, research methods is meant to help all students learn to reason about the scientific literature in psychology. Second, for students who continue with psychology research methods should give students the skills to be able to complete a quantitative fourth year and postgraduate thesis. For a subset of students, quantitative skills is part of their marketable skillset that they can take into future employment. Furthermore, for a small group of students who go on to do their PhD and then join academia, research methods skills are fundamental to the continuation of good research and the vitality of the discipline.

In addition to diverse aims are the diverse student backgrounds in psychology. In particular, there are typically no mathematics pre-requisites. By casual observation many students seem motivated to find work in the helping professions, and particularly as clinical psychologists. Many studies have discussed the challenges of teaching statistics to psychology students. For example LaLonde and Gardner (1993) proposed and tested a model of statistics achievement that combined mathematical aptitude and effort with anxiety and motivation as predictors.

Thus, in combination this diversity in background and student goals introduce several challenges when teaching research methods. For some students the main goal is to introduce a moderate degree of statistical literacy. For others, it is essential that they are at least able to analyse their thesis data in a basic way. A final group of advanced students needs skills that will allow them to model their data in a sophisticated way to contribute to the research literature. Thus, there is a tension between presenting ideas in an accessible way for all students versus tailoring the material for advanced students so they can truly excel.

This tension exists in many different aspects of research methods curriculum. Research methods can be taught with varying degrees of mathematical rigour and abstraction. Teaching can emphasise interpreting output or it can emphasise computational processes. It can also vary in the prominence of software versus ideas. In particular the correct choice of statistical software can substantially interact with these issues of balancing rigour with accessibility. In particular, tools like SPSS are more limited than R, but such limits can make standard analyses easier.

Aiken et al (2008) reviewed doctoral education in statistics and found that most surveyed programs were using SAS or SPSS primarily. They described a case study in curricular innovation in terms of novel topics emerging followed by initiatives from substantive researchers. Textbooks and software that make techniques accessible to psychology graduates also facilitate the teaching process. In some respects, as R has become more accessible through usability innovation and as the needs of data analysts have become more advanced, the argument for R has become more compelling.

Whether to use R in psychology research methods

Pros and cons of R

The above review thus provides a background for understanding both statistics education in general and the diversity in the background and goals of psychological students. The following analysis compares and contrasts R and SPSS as software for teaching research methods in psychology. This initial comparison focuses on price, features, usability and other considerations.

In terms of price, an initial benefit of R is that it is free. It is developed under the GNU open-source licence. It is free to the university and free to students. In contrast a student licence to SPSS for a year is around $200; A professional licence is around $2,000; and SPSS charges expensive licencing fees to the university. R would make it easier to get students to complete analyses from home. Requiring students to purchase SPSS creates equity issues and may even encourage some students to engage in software piracy. If as Devlin et al (2008) suggest that essential textbooks create economic hardship, even more expensive statistical software would compound this problem.

In terms of features, SPSS and R both run on Windows, OSX, and Linux. They both support most standard analyses that students may wish to run. However, R has a larger array of contributed packages. SPSS has several features including a data entry tool, a menu-driven GUI, and an output management system for tables and plots that R does not have. R makes it a lot easier to customise analyses, perform reproducible research, and simulations.

In terms of flexibility SPSS and R both have options for performing flexible analyses. However, R makes it a lot easier to gradually introduce customisation by building on standard analyses. It is also flexible in how it can be used because of the open source licence. R is particularly suited to advanced students who can benefit from the easier pathway it provides for growing statistical sophistication.

In terms of usability R and SPSS are quite different. R assumes greater knowledge about statistics. SPSS has an interface that is more familiar to standard Windows-based programs. R is a programming language with a less consistent mental model to standard Windows programs. R has a steeper initial learning curve, but shallower intermediate curve. R encourages students to gradually develop statistical skills. In particular R has several quirks which create difficulties for the novices (e.g., learning details of syntax, escaping spaces in file paths, treating strings as factors versus character variables, etc.). There are also many things that are easy in SPSS that are difficult in R. Some examples include: variable labels and modifying meta data, editing loaded data, browsing loaded data, producing tables of output, viewing and browsing statistical output, generating all the possible bits of output for an analysis, importing data, standard analyses that SPSS already does, and interactive plotting.

R and SPSS can also be compared in terms of existing resources. There are many online resources for both R and SPSS. Psychology-specific R resources exist but are less plentiful than for SPSS. Furthermore, existing psychology supervisors, research methods staff, and tutors are probably more familiar with SPSS which may cause issues when transitioning teaching to R. That said, many supervisors either train their students directly in the software that they want their students to use or they let the student handle details of implementation.

Mental Models

When choosing between SPSS and R it is worth considering the mental models required to use SPSS and R. These mental models both guide what needs to be trained and also may suggest the gap that needs to be closed between students' initial mental models and that which is required by the software.

The SPSS mental model is centred around a dataset. The typical workflow is as follows: (a) import or create data; (b) define meta data; (c) menus guide analysis choice; (d) dialog boxes guide choices within analyses; (e) large amounts of output are produced; (f) instructional material facilitates interpretation of output; (g) output can be copy and pasted into Word or another program for a final report. Custom statistical functions or taking SPSS output and using it as input to subsequent functions is not encouraged for regular users. Thus, overall the system guides the user in the analysis.

In contrast, R requires that the user guides the software. Thus, the R workflow is as follows: (a) Setup raw data in another program; (b) import data where often the user will have multiple datasets, meta datasets, and other data objects (e.g., vectors, tables of output); (c) transform data as required using a range of commands; (d) perform analyses, where command identification may involve a Google search or looking up a book, and understanding arguments in a command can be facilitated by internal documentation and online tools; (e) because the resulting output is minimal, the user often has to ask for specific output using additional commands; (f) much of what is standard in SPSS requires a custom command in R, but also much of which does not exist in SPSS can be readily created by an intermediate user; it is much easier to extract out particular statistical results and use that as input for subsequent functions; (g) while output can be incorporated into Word or Excel, users are encouraged to engage in various workflows that emphasise reproducible research.

Summary

Thus, overall SPSS is well suited to a menu-driven standardised analysis workflow which meets the needs of many psychology students. R is particularly suited to statisticians that need to perform a diverse range of analyses and are more comfortable with computer programming and statistics in general. R requires greater statistical knowledge and it encourages students to have a plan for their analyses. R also requires students to learn more about computing including programming, the command-line, file formats, and advanced file management. The emphasis on commands creates a greater demand on declarative memory which in turn makes R more suited to students who will perform statistical analysis more regularly. However, the flexibility and nature of R means that it can be used in many more contexts than SPSS such as demonstrating statistical ideas through simulation.

Overall, there are clearly pros and cons of both SPSS and R. R is particularly suited to more advanced students. Occasional users may be more productive initially with SPSS. That said, the many students who never go on with any data analysis work, may learn as much or more by using R. It also remains an empirical question to see how different psychology students might handle R. Thus, the remainder of this report focuses on what implementation of how R could be implemented most effectively.

How to use R in psychology research methods

When considering implementation of R in psychology, it is useful to look at existing textbooks and course implementations. When considering textbooks, it is important to note that psychology tends to use a particular subset of statistical analyses. It also often has analysis goals that differ from other fields. For example, there is a greater emphasis on theoretical meaning, effect sizes, complex experimental designs, test reliability, and causal interpretation. While there are many textbooks that teach statistics using R, only recently have books emerged that are specifically designed to teach R to psychology students. The two main books are Andy Field's "Discovering Statistics Using R" and Dan Navarro's "Learning Statistics with R". An alternative model is to take a more generic R textbook or online resource and combine it with a more traditional psychology textbook such as David Howell's "Statistical Methods for Psychology". In particular, there are many user friendly online resources for learning R such as http://www.statmethods.net/ or Venables, Smith and the R Core Team's "An Introduction to R". Whatever textbook option is chosen an important part of learning R involves learning how to get help. Thus, training should include learning how to navigate online learning resources and internet question and answer sites that are very effective in the case of R (e.g., stackoverflow.com).

Dan Navarro (2013) has written a textbook that teaches statistics to psychology students using R. Navarro (2013) presents several argument for using R instead of a different commercial statistics package. These include: (1) the benefits of the software being free and not locking yourself into expensive proprietary software; (2) that R is highly extensible and has many cutting edge statistical techniques; and (3) that R is a programming language and learning to program is a good thing. He also observes that while R has its problems and challenges, overall it provides the best current available option. Thus, overall, his approach is to inspire the student to see the bigger picture about why they are learning R. Navarro then spends two chapters introducing the R programming language. Starting with simple calculations, many basic concepts of variables, assignment, extracting data, and functions are introduced. Then, standard statistical techniques such as ANOVA and regression are presented with R implementations.

Overall, both these textbooks provide insight into how R could be implemented. Teaching with R provides some opportunity to teach statistics in a slightly deeper way. However, various recipes can be provided to perform standard analyses. Teaching R also requires taking a little extra time to teach the language. The menu-driven interface to R called R-Commander also provides a way of introducing R in a more accessible way. The infrastructure provided by R also provides the opportunity to introduce many important topics such as bootstrapping, simulation, power analysis, and customised formulas. Weekly analysis homework not easily possible with SPSS could consolidate R specific skills.

An additional issue of implementation relates to when R should be introduced. Fourth year provides one such opportunity where the students that remain at this level tend to be more capable and have some initial experience in statistics. Fourth year research methods is a very important subject. It is often designed to prepare students to analyse multivariate data. It is also designed to prepare students to be able to analyse data on their own including preliminary analyses, data cleaning, and transformations. R supports all the standard multivariate techniques that are currently taught at fourth year level. These include PCA, factor analysis, logistic regression, DFA, multiple regression, multilevel modelling, CFA, and SEM. R also makes it easier to explore more advanced methods such as bootstrapping and simulations.

Conclusion

Ultimately, it is an empirical question as to whether using R would provide a more effective tools for research methods education in psychology. It may be useful to explore the idea with some low-stakes optional post-graduate training modules in R. Such programs may give a sense of the kinds of practical issues that arise with students when learning to use R. If R is to be rolled out to all of fourth year psychology, this would be a high risk exercise. It would be important to evaluate the student learning outcomes in a broad way. In particular, it would be important to see any effect on analysis performance in fourth year theses.

References

Aiken, L. S., West, S. G., & Millsap, R. E. (2008). Doctoral training in statistics, measurement, and methodology in psychology: Replication and extension of Aiken, West, Sechrest, and Reno's (1990) survey of PhD programs in North America. The American Psychologist, 63(1), 32-50.
Batanero, C., Godino, J., Green, D., and Holmes, P. (1994). Errors and Difficulties in Understanding Introductory Statistical Concepts. International Journal of Mathematical Education in Science and Technology, 25 (4), 527–547.
Borden, V. M., & Rajecki, D. W. (2000). First-year employment outcomes of psychology baccalaureates: Relatedness, preparedness, and prospects.Teaching of Psychology, 27(3), 164-168.
Chance, B., Ben-Zvi, D., Garfield, J., and Medina, E. (2007). The Role of Technology in Improving Student Learning of Statistics. Technology Innovations in Statistics Education, 1(1). url: http://www.escholarship.org/uc/item/8sd2t4rr
Devlin, M., James, R., & Grigg, G. (2008). Studying and working: A national study of student finances and student engagement. Tertiary Education and Management, 14(2), 111-122.
Franklin, C. & Garfield, J. (2006). The GAISE (Guidelines for Assessment and Instruction in Statistics Education) project: Developing statistics education guidelines for pre K-12 and college courses. In G. Burrill (Ed.), 2006 NCTM Yearbook: Thinking and reasoning with data and chance. Reston, VA: National Council of Teachers of Mathematics.
Gal, I. (2002). Adults' Statistical Literacy: Meanings, Components, Responsibilities. With Discussion. International Statistical Review, 70(1), 1-51.
Gal, I. & Ginsburg, L. (1994). The Role of Beliefs and Attitudes in Learning Statistics: Towards an Assessment Framework. Journal of Statistics Education, 2(2). url: http://www.amstat.org/publications/jse/v2n2/gal.html
Garfield, J. (1995). How Students Learn Statistics. International Statistical Review, 63(1), 25-34.
Garfield, J. and Ben-Zvi, D. (2008). Developing Students' Statistical Reasoning: Connecting Research and Teaching Practice, Springer.
Lalonde, R. N., & Gardner, R. C. (1993). Statistics as a second language? A model for predicting performance in psychology students. Canadian Journal of Behavioural Science, 25, 108-125.
Moore, D.S. (1997). New pedagogy and new content: the case of statistics. International Statistical Review, 635, 123-165.
Navarro, D. (2013). Learning statistics with R: A tutorial for psychology students and other beginners (Version 0.3) url: http://ua.edu.au/ccs/teaching/lsr
Tishkovskaya, S., & Lancaster, G. (2012). Statistical education in the 21st century: a review of challenges, teaching innovations and strategies for reform. Journal of Statistics Education, 20(2), 1-55.
Wiberg, M. (2009). Teaching statistics in integration with psychology. Journal of Statistics Education, 17(1), 1-16.

Google Reader Replacements: Feedly and The Old Reader

2013-03-15T18:55:00.000+11:00

This post discusses the impending demise of Google Reader and configuring Feedly as a replacement.

Demise of Google Reader

I was very disappointed to read about Google terminating its Google Reader service.
Google Reader provided me with a great tool for following hundreds of blogs, journals, and assorted feeds. The interface was clean and efficient. I liked the keyboard shortcuts for navigation.
Some people are saying that Google+, Twitter, Facebook, Reddit, email newsletters, and so on are a substitute for Google Reader. This is rubbish. Google Reader is an efficient way of consuming and scanning new content based on providers that I care about. None of these other tools provide anything like this.

For bloggers the concern about the end of Google Reader is that this is one of the major ways that people consume blog content. Even my own small blog has around a thousand RSS subscribers. The most popular RSS reader is Google Reader, and thus there is the concern that the closure of Google Reader may damage this connection between blogs and subscribers. As a consequence we might see fewer subscribers and then fewer incentives to blog and then less great blog content. Thus, I really hope that one or more high quality, trustworthy, multi-device, free web services emerge that continue to provide a great RSS reading experience. Hopefully, this is an opportunity for a service to emerge that is even better than Google Reader.

Feedly

After an initial exploration I am having a good experience with Feedly. When you log into Feedly with your Google Account, it immediately synchronises with your Google Reader account. Supposedly Feedly will switch to their own backend when the Google Reader service ends. Nonetheless, I have still exported my feeds directly from Google Reader using the Google Reader export facilities.

I must admit that my first impressions of Feedly were worrying. However, a little persistence showed that I could replicate the Google Reader workflow.

First, Feedly runs both in the browser and on various mobile devices. One drawback is that it does require the installation of a browser plugin and an app on mobile devices. But given that I have admin privileges, this wasn't a major issue.

To configure like Google Reder see this blog post for a few tips.

After a few customisation steps I'm very happy. In particular: (1) I set tile view for each of my categories; (2) I saved a bookmark in my browser for feedly to be a particular category. I have my main feeds in a category called "core". This means that the default view when I click on the bookmark is like I'm used to in Google Reader. I find the default Feedly homepage annoying; (3) I learnt the keyboard shortcuts, in particular j and k for navigating between posts (I had to set an exception on Vimium). This was something that I really liked in Google Reader and it's great to see it still available in Feedly. Pressing question mark on the keyboard brings up available shortcut keys.
That said, it is early days and there are a lot of discussions about what service offers the best Google Reader replacement. I also need to build up trust when it comes to a provider of RSS services. I still need to see whether the migration from the Google Reader backend will be effective. I also don't yet understand feedly's business model and therefore wonder how they will provide the service in the longer term.

Alternatives

There's a discussion here of some of the alternatives.

The Old Reader appears to be a popular choice. It offers an interface nearly identical to Google Reder. It doesn't require a browser plug-in. The development seems friendly. It also did a better job of rendering a few posts with mathematics (e.g., posts from the Normal Deviate, which Feedly struggled with).

Anyway, it's nice that at the moment there are at least two reasonable replacements to Google Reader. Presumably much more will evolve in terms of the preferred option over the coming weeks and months.

Beamer presentations using pandoc, markdown, LaTeX, and a makefile

2012-07-23T23:20:00.000+10:00

This post discusses the creation of beamer presentations using a combination of markdown, pandoc, and LaTeX. This workflow offers the potential to reduce typing and increase readability of beamer presentation source code. Source code for an example presentation is provided containing markdown and LaTeX source code along with a makefile for building the beamer PDF.

Motivation

I've used beamer quite a lot to prepare slides for both research and teaching purposes (e.g., this 2010 presentation on R Workflow). I've also written up a guide to getting started with beamer and a simple beamer template.

Nonetheless, for some time I've been concerned about the high ratio of markup to content in beamer presentations. I even asked a question on TeX.SE on strategies for dealing with this issue. I find that beamer markup is a burden. It interferes with content creation. Creating, editing, and re-arranging slides is more difficult than it needs to be. The high quantity of markup also interferes with readability.

Several ways of dealing with this.

Use LaTeX macros: I.e., to shorten common environments. However, this reduces readability if it is ad hoc.
Org Mode in Emacs. This sounds good, but I'm more experienced with Vim.
Code Snippets. Code snippets partially solve the typing issue, but they don't solve the readability issue.

In the end, I decided to explore the combination of pandoc, markdown, and LaTeX to create a beamer presentation. The reasons for this included that:

Markdown is a really intuitive markup format that is easy to read and easy to modify.
When pandoc converts markdown to LaTeX, any LaTeX is passed straight through. Thus, it is possible to obtain customisation beyond the basic options provided by markdown.

Existing resources on combining Beamer, markdown and pandoc

John MacFarlane, author of pandoc, has some relevant documentation on slide production Important points:

The basic compilation command is: pandoc -t beamer my_source.md -o my_beamer.pdf
The post explains the slide separation rules.
You can have incremental lists by pre-pending dot points with the greater than symbol
Beamer Themes can be used via the -V option e.g., (-V theme:Warsaw)
It shows how the first few lines of the file pre-pended by % are incorporated into the title slide

Stephen Sinclair has a tutorial. Relevant points include:

Latex gets passed directly through
- Equations can be passed directly through
- Image size and placement can be controlled in detail with latex e.g., \centerline{\includegraphics[height=2in]{my_image.pdf}}
- You can use bibtex for citations
He also mentions a number of other options for compiling the document involving templates, regular expressions, and so on.

My approach

My approach involved running a makefile which converted a markdown file into a tex file, which was then incorporated into another tex file and then converted into a pdf.

The repository containing all files is available on github: https://github.com/jeromyanglim/rmarkdown-rmeetup-2012/tree/master/talk
The presentation PDF

To use the approach you would need the following software:

LaTeX distribution with beamer package
pandoc
support for make: On Linux, make is installed by default; on Windows, you may need something like cygwin or Rtools.

makefile

The makefile was as follows:

pdf:
    pandoc talk.md --slide-level 2 -t beamer -o talk.tex
    pdflatex main.tex
    pdflatex main.tex
    -xdg-open main.pdf

pandoc converted talk.md into a beamer latex file talk.tex
--slide-level 2 meant that level 1 markdown headings (i.e., lines preceded with a single hash: # Section name) represented section headings used in the presentation, and level 2 headings (i.e., lines preceded with two hashes ## Slide Title) represented new slides and their title.
The line -xdg-open main.pdf opens the resulting pdf file on Linux, but xdg-open could be replaced by the name of pdf viewer (e.g., on a different operating system).

Preamble LaTeX file: `main.tex`

I had a main LaTeX file (main.tex) as follows:

\documentclass[t]{beamer}  
\usetheme{Berkeley}                 
\setbeamertemplate{navigation symbols}{}
\title{MY_TALK_TITLE}
\subtitle{MY_TALK_SUBTITLE}
\author{MY_NAME}
\institute{MY_INSTITUTION}
\date{DATE_OF_MY_TALK}

% more preamble...

\begin{document}
\begin{frame}
\titlepage
\end{frame}

\begin{frame}
\frametitle{Outline}
\tableofcontents
\end{frame}

\input{"talk.tex"}

\end{document}

The file is completely in latex and includes the preamble the document environment, some opening slides with particular features, and the input command which reads in the file talk.tex.
talk.tex is generated by pandoc from talk.md and contains all the individual content slides.
I prefer to exclude navigation symbols.
Naturally, usetheme could be altered to some other theme (see the beamer theme matrix), such as default.

Body markdown slide file: `talk.md`

talk.md contained all the individual markdown slides.

For example a basic slide might look as follows:

# NAME OF A SECTION
## SLIDE TITLE
* Some point to make
    * Another point
    * Another point
* Some point to make
    * Another point
    * Another point

The first line adds a section title. This is not part of the slide, but can be used to generate table of contents, and in slide navigation.
The second line starts a new slide with the content to the right of the double hash constituting the slide title.
And then subsequent lines generate a two-level list represented in LaTeX using the itemize environment.

In general, markdown is converted by pandoc into sensible beamer content. See the actual markdown file talk.md and resulting tex file talk.tex. However, pandoc passes any LaTeX through as is, and this is sometimes required.

For incorporating images, I found the default markdown image command led to an excessively large image. Thus, I used LaTeX for images. I'd like to think that there is a way of making default images work well, but I didn't work it out. Thus, instead, I used commands like the following:

## SLIDE TITLE
\includegraphics[width=4in]{FIGURE_FILE_NAME.PNG}

I often had to tweak the image width to get it the right size.
I also read about some other options, which I discuss here.

Obviously there are many reasons that you might want to fall back to LaTeX. In my talk, I tried to keep things simple, so the main instances were:

Images as shown above
Small text for footnotes often with a url inside: e.g., \tiny{some text and a link: \url{http://jeromyanglim.blogspot.com}}
Large text at the end of the talk: e.g., \begin{center} \LARGE{Questions?} \end{center}

Conclusion

Overall, there were pros and cons to the approach I adopted.

By incorporating markdown and pandoc, there was an extra layer of complexity to think about to ensure that the conversion process had the desired effect. Error messages were sometimes more difficult to diagnose.
There were a lot of situations where you might want to have more control over slide content than what you get by default with Markdown.
There is an argument to suggest that slide creation is best in a WYSIWIG environment where you can manually tweak image positioning and layout.

Nonetheless, I really liked how easy it was to create, edit, and read dot points, nested dot points, frames, sections, and basic formatting, and in general it was relatively easy to incorporate LaTeX when required. I also like the idea of where possible using open plain-text file formats to take advantage of easier programmability, version control, incorporating into a powerful text editor, simpler conversion, and so on.

Other aspects

The following are a few other aspects that might interest some readers, particularly Vim users.

Syntax highlighting of markdown+LaTeX in Vim

There is a Vim plugin for pandoc that provides many features including syntax highlighting for documents that combine multiple markups including markdown and LaTeX. I found it best to install the latest version available on github: https://github.com/vim-pandoc/vim-pandoc

Code folding of markdown-Beamer

I also have the following script in my .vimrc file. The great thing about it is that it allows code folding if you use hash style markdown headings. It is setup to only fold on headings 1 and 2. This corresponds to sections and slides in my pandoc setting for beamer markdown documents. To increase the level, change it to MarkdownLevel(3), etc.

function! MarkdownLevel(maxlevel)
    if a:maxlevel >= 1 && getline(v:lnum) =~ '^# .*$'
        return ">1"
    endif
    if a:maxlevel >= 2 && getline(v:lnum) =~ '^## .*$'
        return ">2"
    endif
    if a:maxlevel >= 3 && getline(v:lnum) =~ '^### .*$'
        return ">3"
    endif
    if a:maxlevel >= 4 && getline(v:lnum) =~ '^#### .*$'
        return ">4"
    endif
    if a:maxlevel >= 5 && getline(v:lnum) =~ '^##### .*$'
        return ">5"
    endif
    if a:maxlevel >= 6 && getline(v:lnum) =~ '^###### .*$'
        return ">6"
    endif
    return "=" 
endfunction

au BufEnter *.md  setlocal foldexpr=MarkdownLevel(2)  
au BufEnter *.md  setlocal foldmethod=expr     
au BufEnter *.md  setlocal autoindent

Then the following Vim commands in normal model make folding, navigation, and getting a sense of structure really easy:

zx show current line and necessary headings; close other headings
zc close heading
zj and zk to move down and up headings

Showing backticks and single quotes properly in code

I often need to show code, and backticks and single quotes weren't showing properly. The following code in my LaTeX preamble drawn from this TeX.SE question solved the problem:

% enables straight single quote
\makeatletter
\let \@sverbatim \@verbatim
\def \@verbatim {\@sverbatim \verbatimplus}
{\catcode`'=13 \gdef \verbatimplus{\catcode`'=13 \chardef '=13 }} 
\makeatother

% enables backticks in verbatim
\makeatletter
{\catcode`\`=13
\xdef\@verbatim{\unexpanded\expandafter{\@verbatim}\chardef\noexpand`=18 }
}
\makeatother

Video: knitr, R Markdown, and R Studio: Introduction to Reproducible Analysis

2012-07-19T16:55:00.000+10:00

This post presents the video of a talk that I presented in July 2012 at Melbourne R Users on using knitr, R Markdown, and R Studio to perform reproducible analysis. I also provide links to a github repository where the R markdown examples can be examined and the slides can be downloaded.

Talk Overview

Reproducible analysis represents a process for transforming text, code, and data to produce reproducible artefacts including reports, journal articles, slideshows, theses, and books. Reproducible analysis is important in both industry and academic settings for ensuring a high quality product. R has always provided a powerful platform for reproducible analysis. However, in the first half of 2012, several new tools have emerged that have substantially increased the ease with which reproducible analysis can be performed. In particular, knitr, R Markdown, and RStudio combine to create a user-friendly and powerful set of open source tools for reproducible analysis.

Specifically, in the talk I discuss caching slow analyses, producing attractive plots and tables, and using RStudio as an IDE. I present three live examples of using R Markdown. I also show how the markdown package on CRAN can be used to work with other R development environments and workflows for report production.

There is a github repository called rmarkdown-rmeetup-2012 that contains:

the slides and source code for the slides (I used a combination of beamer, markdown, and pandoc)
the source code for the R Markdown examples presented in the talk
and assorted brainstorming that recorded some of my thinking as I developed the slides (see the issue tracker)

Follow this link to download the slides directly.

Video of Talk

The talk is split over two parts.

Relevant links:

The following links were either presented in the talk or are otherwise relevant to reproducible analysis.

My post on getting started with R Markdown
My thoughts on definitions of reproducible data analysis
My thoughts on degrees of reproducible data analysis
Reproducible Research Task View on CRAN
Software used in talk: R, R Studio, pandoc TeX distributions,
Overview of markdown
Getting started with writing LaTeX equations
Slide show on benefits of knitr and Rstudio by Yihui Xie and JJ Allaire
knitr options home page and knitr home page
Documentation on using R Markdown with R Studio
My existing posts on reproducible analysis
Places to ask questions: R on StackOverflow, LaTeX on TeX.SE, and knitr on github.
Extensive set of YouTube videos on reproducible analysis largely drawn from a workshop on "Reproducible Research: Tools and Strategies for Scientific Computing".

If viewing through syndication, feel free to subscribe to my blog on psychology and statistics here.

Converting Sweave LaTeX to knitr LaTeX: A case study

2012-06-10T00:20:00.000+10:00

The following post documents the steps I needed to take in order to convert a project using Sweave LaTeX into one using knitr LaTeX.

Additional Resources

It is fairly straightforward to convert a document from Sweave LaTeX to knitr LaTeX. Yihui Xie on the knitr website provides the following useful resources:

Transition to Sweave from knitr: This document describes knitr specifically from the perspective of what is the same as Sweave and what is different from Sweave.
knitr options: This includes discussion of the many R code chunk options in knitr. Many are the same as Sweave, but there are some new ones, and some modifications.
knitr minimal examples: These are useful for getting started with different types of knitr document including LaTeX.

My conversion from Sweave to knitr

The following documents the steps I needed to do in order to convert a journal article that was in Sweave LaTeX into a knitr LaTeX document. Most of this was documented in the above mentioned links on the knitr website, but there were still a few little surprises.

Rnw to tex conversion: Convert R CMD Sweave myfile.rnw to Rscript -e "library(knitr); knit('myfile.nw')" in makefile (see this SO question ).
global options: Replace \SweaveOpts{echo=FALSE} with \Sexpr{opts_chunk$set(echo=FALSE)}; This needed to appear before the first R code chunk in order to affect all code chunks in the file.
case on R code chunk options: Update true and false to TRUE and FALSE in r code chunk options.
results option: Update results=tex to results='asis' and in general ensure that text values in R code chunks are surrounded by quotation marks.
message option: I needed to prevent the display of messages when certain packages were loaded using \Sexpr{opts_chunk$set(message=FALSE}}.
These messages did not previously display under sweave.
hiding output: I had some R code chunks with options print=FALSE, term=FALSE; I replaced this with results='hide'.
methods package: I had a densityplot() (i.e., a lattice plot) that didn't display properly. It instead showed an error: Error using packet 1 could not find function "hasArg"; apparently this is caused by the fact that the methods package doesn't load by default when using Rscript; thus I needed to put require(methods) in the first R code chunk.
Sweave.sty: I removed Sweave.sty from my project directory and removed the line \usepackage{Sweave} from my rnw file as both things are not needed in knitr.
caching: Although there are packages for enabling caching, I'd never adopted any of them. knitr makes caching very simple. I just added cache=TRUE to the global chunk options (i.e., \Sexpr{opts_chunk$set(echo=FALSE, message=FALSE, cache=TRUE)}. This reduced the time to build the PDF from around 5 seconds to 1 second. I'm also planning to incorporate some Bayesian analyses with JAGS and rjags, where I'm expecting analyses will take several minutes or longer to run. At that point, I'll really appreciate the speed benefits of caching.
to make or not to make: I had a custom makefile on the project that kept everything neat and tidy, copying source files into a build directory, running all necessary commands to convert from rnw to tex and then to pdf, and then opening the pdf in a viewer. This still works well. However, the default "Compile to PDF" option in RStudio was also quite good (after setting tools - options - Sweave - Weave Rnw files using knitr). In particular, I liked the synctex support for Sweave that allows you to move from a position in the source to the corresponding position in the PDF viewer. Also, RStudio in combination with knitr seems to do a reasonable job of keeping the main project directory tidy. A few auxiliary files are added, but not too many. I also appreciate the simplicity that a simple button brings to getting started with analyses. However, a makefile does make things more portable.

My main conclusion from this process is that converting an ongoing Sweave LaTeX document to knitr LaTeX is fairly straightforward, and there are a number of useful benefits that arise. In particular, I really appreciate simple caching and not having to worry about Sweave.sty. Great work Yihui Xie!

Additional Resources

How to Convert Sweave LaTeX to knitr R Markdown: Winter Olympic Medals Example

2012-06-04T22:19:00.001+10:00

The following post shows how to manually convert a Sweave LaTeX document into a knitr R Markdown document. The post (1) reviews many of the required changes; (2) provides an example of a document converted to R Markdown format based on an analysis of Winter Olympic Medal data up to and including 2006; and (3) discusses the pros and cons of LaTeX and Markdown for performing analyses.

Overview

The following analyses of Winter Olympic Medals data have gone through several iterations:

R Script: I originally performed similar analyses in February 2010. It was a simple set of commands where you could see the console output and view the plots.
LaTeX Sweave: In February 2011 I adapted the example to make it a Sweave LaTex document. The source fo this is available on github. With Sweave, I was able to create a document that weaved text, commands, console input, console output, and figures.
R Markdown: Now in June 2012 I'm using the example to review the process of converting a document from Sweave-LaTeX to R Markdown. The souce code is available here on github (see the *.rmd file).

Converting from Sweave to R Markdown

The following changes were required in order to convert my LaTeX Sweave document into an R Markdown document suitable for processing with knitr and RStudio. Many of these changes are fairly obvious if you understand LaTeX and Markdown; but a few are less obvious. And obviously there are many additional changes that might be required on other documents.

R code chunks

R code chunk delimiters: Update from << ... >>= and @ to R markdown format ```{r ...} and ```
Inline code chunks: Update from \Sexpr{...} to either `r ...` or `r I(...)` format.
results=tex: Any results=tex needs to either be removed or converted to results='asis'. Note that string values of knitr options need to be quoted.
Boolean options: Sweave tolerates lower case true and false for code chunk options, knitr requires TRUE and FALSE.

Figures and Tables

Floats: Remove figure and table floats (e.g., \begin{table}...\end{table}, \begin{figure}...\end{figure}). In R Markdown and HTML, there are no pages and thus content is just placed immediately in the document.
Figure captions: Extract content from within the \caption{} command. When using R Markdown, it is often easiest to add captions to the plot itself (e.g., using the main argument in base graphics).
Table captions: extract content from within the \caption{} command; Table captions can be included in a caption argument using the caption argument to the xtable function (e.g., print(xtable(MY_DAT_FRAME), "html", caption="MY CAPTION", caption.placement="top") ). Caption placement defaults to "bottom" of table but can be optinally specified as "top" either as a global option or in print.xtable. Alternatively table titles can just be included as Markdown text.
References: Delete table and figure lables (e.g., \label{...}). Replace table and figure references (e.g., \ref{...} with actual numbers or other descriptive terminology. It would also be possible to implement something simple in R that stored table and figure numbers (e.g., initialise table and figure numbers at the start of the document; increment table counter each time a table is created and likewise for figures; store the value of counter in variable; include variable in caption text using paste() or something similar. Include counter in text using inline R code chunks.
Table content: Markdown supports HTML; so one option is to convert LaTeX tables to HTML tables using a function like print(xtable(MY_DATA_FRAME), type="html"). This is combined with the results='asis' R code chunk option.

Basic formatting

Headings: if we assume section is the top level: then \section{...} becomes # ..., \subsection{...} becomes ## ... and \subsubsection{...} becomes ### ...
Mathematics: Update latex mathematics to $latex ... and $$latex ... $$ notation if using RStudio.
Paragraph delimiters: If using RStudio then remove single line breaks that were not intended to be paragraph breaks.
Hyperlinks: Convert LaTeX Hyperlinks from \href or url to [text](url) format.

LaTeX things

Comments: Remove any LaTeX comments or switch from % comment to 
LaTeX escaped characters: Remove unnecessary escape characters (e.g., \% is just %).
R Markdown escaped characters: Writing about the R Markdown language in R Markdown sometimes requires the use of HTML codes for special characters such as backticks (`) and backslashes (\) to prevent the text from being interpreted; see here for a list of HTML character codes.
Header: Remove the LaTeX header information up to and including \begin{document}; extract any incorporate any relevant content such as title, abstract, author, date, etc.

R Markdown Analysis of Winter Olympic Medal Data

The following shows the output of the actual analysis after running the rmd source through Knit HTML in Rstudio. If you're curious, you may wish to view the rmd source code on GitHub side by side this point at this point.

Import Dataset

library(xtable)
options(stringsAsFactors = FALSE)
medals <- read.csv("data/medals.csv")
medals$Year <- as.numeric(medals$Year)
medals <- medals[!is.na(medals$Year), ]

The Olympic Medals data frame includes 2311 medals from 1924 to 2006. The data was sourced from The Guardian Data Blog.

Total Medals by Year

# http://www.math.mcmaster.ca/~bolker/emdbook/chap3A.pdf
x <- aggregate(medals$Year, list(Year = medals$Year), length)
names(x) <- c("year", "medals")
x$pos <- seq(x$year)
fit <- nls(medals ~ a * pos^b + c, x, start = list(a = 10, b = 1, 
    c = 50))

In general over the years the number of Winter Olympic medals awarded has increased. In order to model this relationship, year was converted to ordinal position. A three parameter power function seemed plausible, $ y = ax^b + c $, where $ y $ is total medals awarded and $ x $ is the ordinal position of the olympics starting at one. The best fitting parameters by least-squares were

\[ 0.202 x^{2.297 + 50.987}. \]

The figure displays the data and the line of best fit for the model. The model predicts that 2010, 2014, and 2018 would have 271, 295, and 322 medals respectively.

plot(medals ~ pos, x,  las = 1, 
        ylab = "Total Medals Awarded", 
        xlab = "Ordinal Position of Olympics",
        main="Total medals awarded 
     by ordinal position of Olympics with
     predicted three parameter power function fit displayed.",
        las = 1,
        bty="l")
lines(x$pos, predict(fit))

Gender Ratio by Year

medalsByYearByGender <- aggregate(medals$Year, list(Year = medals$Year, 
    Event.gender = medals$Event.gender), length)
medalsByYearByGender <- medalsByYearByGender[medalsByYearByGender$Event.gender != 
    "X", ]
propf <- list()
propf$prop <- medalsByYearByGender[medalsByYearByGender$Event.gender == 
    "W", "x"]/(medalsByYearByGender[medalsByYearByGender$Event.gender == "W", 
    "x"] + medalsByYearByGender[medalsByYearByGender$Event.gender == "M", "x"])
propf$year <- medalsByYearByGender[medalsByYearByGender$Event.gender == 
    "W", "Year"]
propf$propF <- format(round(propf$prop, 2))

propf$table <- with(propf, cbind(year, propF))
colnames(propf$table) <- c("Year", "Prop. Female")

The figure shows the number of medals won by males and females by year. The table shows the proportion of medals awarded to females by year. It shows a generally similar pattern for males and females. Medals increase gradually until around the late 1980s after which the rate of increase accelerates. However, females started from a much smaller base. Thus, both the absolute difference and the percentage difference has decreased over time to the point where in 2006 46 of medals were won by females.

plot(x ~ Year, medalsByYearByGender[medalsByYearByGender$Event.gender == 
    "M", ], ylim = c(0, max(x)), pch = "m", col = "blue", las = 1, ylab = "Total Medals Awarded", 
    bty = "l", main = "Total Medals Won by Gender and Year")
points(medalsByYearByGender[medalsByYearByGender$Event.gender == 
    "W", "Year"], medalsByYearByGender[medalsByYearByGender$Event.gender == 
    "W", "x"], col = "red", pch = "f")

print(xtable(propf$table,
             caption="Proportion of Medals that were awarded to Females by Year"), 
      type="html", 
      caption.placement="top",
      html.table.attributes='align="center"')

Proportion of Medals that were awarded to Females by Year
	Year	Prop. Female
1	1924	0.07
2	1928	0.08
3	1932	0.08
4	1936	0.12
5	1948	0.18
6	1952	0.23
7	1956	0.26
8	1960	0.38
9	1964	0.37
10	1968	0.37
11	1972	0.36
12	1976	0.35
13	1980	0.34
14	1984	0.36
15	1988	0.37
16	1992	0.43
17	1994	0.43
18	1998	0.44
19	2002	0.45
20	2006	0.46

Countries with the Most Medals

cmm <- list()
cmm$medals <- sort(table(medals$NOC), dec = TRUE)
cmm$country <- names(cmm$medals)
cmm$prop <- cmm$medals/sum(cmm$medals)
cmm$propF <- paste(round(cmm$prop * 100, 2), "%", sep = "")

cmm$row1 <- c("Rank", "Country", "Total", "%")
cmm$rank <- seq(cmm$medals)
cmm$include <- 1:10

cmm$table <- with(cmm, rbind(cbind(rank[include], country[include], 
    medals[include], propF[include])))
colnames(cmm$table) <- cmm$row1

Norway has won the most medals with 280 (12.12%). The table shows the top 10. Russia, USSR, and EUN (Unified Team in 1992 Olympics) have a combined total of 293. Germany, GDR, and FRG have a combined medal total of 309.

print(xtable(cmm$table, caption="Rankings of Medals Won by Country"), 
      "html", include.rownames=FALSE, caption.placement='top',
      html.table.attributes='align="center"')

Rankings of Medals Won by Country
Rank	Country	Total	%
1	NOR	280	12.12%
2	USA	216	9.35%
3	URS	194	8.39%
4	AUT	185	8.01%
5	GER	158	6.84%
6	FIN	151	6.53%
7	CAN	119	5.15%
8	SUI	118	5.11%
9	SWE	118	5.11%
10	GDR	110	4.76%

Proportion of Gold Medals by Country

Looking only at countries that have won more than 50 medals in the dataset, the figure shows that the proportion of medals won that were gold, silver, or bronze.

NOC50Plus <- names(table(medals$NOC)[table(medals$NOC) > 50])
medalsSubset <- medals[medals$NOC %in% NOC50Plus, ]
medalsByMedalByNOC <- prop.table(table(medalsSubset$NOC, medalsSubset$Medal), 
                                 margin = 1)
medalsByMedalByNOC <- medalsByMedalByNOC[order(medalsByMedalByNOC[, "Gold"], 
         decreasing = TRUE), c("Gold", "Silver", "Bronze")]
barplot(round(t(medalsByMedalByNOC), 2), horiz = TRUE, las = 1, 
        col=c("gold", "grey71", "chocolate4"), 
        xlab = "Proportion of Medals",
        main="Proportion of medals won that were gold, silver or bronze.")

How many different countries have won medals by year?

listOfYears <- unique(medals$Year)
names(listOfYears) <- unique(medals$Year)
totalNocByYear <- sapply(listOfYears, function(X) length(table(medals[medals$Year == 
    X, "NOC"])))

The figure shows the total number of countries winning medals by year.

plot(x = names(totalNocByYear), totalNocByYear, ylim = c(0, max(totalNocByYear)), 
    las = 1, xlab = "Year", main = "Total Number of Countries Winning Medals By Year", 
    ylab = "Total Number of Countries", bty = "l")

Australia at the Winter Olympics

ausmedals <- list()
ausmedals$data <- medals[medals$NOC == "AUS", ]
ausmedals$data <- ausmedals$data[, c("Year", "City", "Discipline", 
    "Event", "Medal")]
ausmedals$table <- ausmedals$data

Given that I am an Australian I decided to have a look at the Australian medal count. Australia does not get a lot of snow. Up to and including 2006, Australia has won 6 medals. It won its first medal in 1994. Of the 6 medals, 3 were bronze, 0 were silver, and 3 were gold. The table lists each of these medals.

print(xtable(ausmedals$table, 
             caption='List of Australian Medals',
             digits=0),
      type='html', 
      caption.placement='top', 
      include.rownames=FALSE,
      html.table.attributes='align="center"')

List of Australian Medals
Year	City	Discipline	Event	Medal
1994	Lillehammer	Short Track S.	5000m relay	Bronze
1998	Nagano	Alpine Skiing	slalom	Bronze
2002	Salt Lake City	Short Track S.	1000m	Gold
2002	Salt Lake City	Freestyle Ski.	aerials	Gold
2006	Turin	Freestyle Ski.	aerials	Bronze
2006	Turin	Freestyle Ski.	moguls	Gold

Ice Hockey

icehockey <- medals[medals$Sport == "Ice Hockey" & medals$Event.gender == 
    "M" & medals$Medal == "Gold", ]
icehockeyf <- medals[medals$Sport == "Ice Hockey" & medals$Event.gender == 
    "W" & medals$Medal == "Gold", ]

# names(table(icehockey$NOC)[table(icehockey$NOC) > 1])

The following are some statistics about Winter Olympics Ice Hockey up to and including the 2006 Winter Olympics.

Out of the 20 Winter Olympics that have been staged, Mens Ice Hockey has been held in 20 and the Womens in 3.
The USSR has won the most mens gold medals with 7 golds. It goes up to 8 if the 1992 Unified Team is included.
Canada has the second most golds with 6.
After that the only two nations to win more than one gold are Sweden (2 golds) and the United States (2 golds).
The table shows the countries who won gold and silver medals by year.
In the case of the Women's Ice Hockey, Canada has won 2 and the United States has won 1.

icehockeygs <- medals[medals$Sport == "Ice Hockey" & 
    medals$Event.gender == "M" &
    medals$Medal %in% c("Silver", "Gold"),  c("Year", "Medal", "NOC")]
icetab <- list()
icetab$data <- reshape(icehockeygs, idvar="Year", timevar="Medal",
    direction="wide")
names(icetab$data) <- c("Year", "Gold", "Silver")

print(xtable(icetab$data, 
             caption ="Country Winning Gold and Silver Medals by Year in Mens Ice Hockey", 
             digits=0), 
      type="html",     
      include.rownames=FALSE,
      caption.placement="top",
      html.table.attributes='align="center"')

Country Winning Gold and Silver Medals by Year in Mens Ice Hockey
Year	Gold	Silver
1924	CAN	USA
1928	CAN	SWE
1932	CAN	USA
1936	GBR	CAN
1948	CAN	TCH
1952	CAN	USA
1956	URS	USA
1960	USA	CAN
1964	URS	SWE
1968	URS	TCH
1972	URS	USA
1976	URS	TCH
1980	USA	URS
1984	URS	TCH
1988	URS	FIN
1992	EUN	CAN
1994	SWE	CAN
1998	CZE	RUS
2002	CAN	USA
2006	SWE	FIN

Reflections on the Conversion Process

Markdown versus LaTeX:
- I prefer performing analyses with Markdown than I do with LateX.
- Markdown is easier to type than LaTeX.
- Markdown is easier to read than LaTeX.
- It is easier with Markdown to get started with analyses.
- Many analyses are only presented on the screen and as such page breaks in LaTeX are a nuisance. This extends to many features of LaTeX such as headers, figure and table placement, margins, table formatting, partiuclarly for long or wide tables, and so on.
- That said, journal articles, books, and other artefacts that are bound to the model of a printed page are not going anywhere.
- Furthermore, bibliographies, cross-references, elaborate control of table appearance, and more are all features which LaTeX makes easier than Markdown.
R Markdown to Sweave LaTeX:
- The more common conversion task that I can imagine is taking some simple analyses in R Markdown and having to convert them into knitr LaTeX in order to include the content in a journal article.
- The first time I converted between the formats, it was good to do it in a relatively manual way to get a sense of all the required changes; however, if I had a large document or was doing the task on subsequent occasions, I would look at more automated solutions using string replacement tools (e.g., sed, or even just replacement commands in a text editor such as Vim), and markup conversion tools (e.g., pandoc).
- Perhaps if the formats get popular enough, developers will start to build dedicated conversion tools.

Additional Resources

If you liked this post, you may want to subscribe to the RSS feed of my blog. Also see:

This post on Getting Started with R Markdown, knitr, and Rstudio 0.96
This post for another Example Reproducible Report using R Markdown which analyses California Schools Test Data
These Assorted posts using Sweave
The knitr home page and knitr options page.
the xtable LaTeX table gallery which can also be used to generate HTML tables for inclusion in Markdown.

Example Reproducible Report using R Markdown: Analysis of California Schools Test Data

2012-05-18T21:22:00.000+10:00

This is a quick set of analyses of the California Test Score dataset. The post was produced using R Markdown in RStudio 0.96. The main purpose of this post is to provide a case study of using R Markdown to prepare a quick reproducible report. It provides examples of using plots, output, in-line R code, and markdown. The post is designed to be read along side the R Markdown source code, which is available as a gist on github.

Preliminaries

This post builds on my earlier post which provided a guide for Getting Started with R Markdown, knitr, and RStudio 0.96
The dataset analysed comes from the AER package which is an accompaniment to the book Applied Econometrics with R written by Christian Kleiber and Achim Zeileis.

Load packages and data

# if necessary uncomment and install packages.  install.packages('AER')
# install.packages('psych') install.packages('Hmisc')
# install.packages('ggplot2') install.packages('relaimpo')
library(AER)  # interesting datasets
library(psych)  # describe and psych.panels
library(Hmisc)  # describe
library(ggplot2)  # plots: ggplot and qplot
library(relaimpo)  # relative importance in regression

# load the California Schools Dataset and give the dataset a shorter name
data(CASchools)
cas <- CASchools

# Convert grade to numeric

# table(cas$grades)
cas$gradesN <- cas$grades == "KK-08"

# Get the set of numeric variables
v <- setdiff(names(cas), c("district", "school", "county", "grades"))

Q 1 What does the CASchools dataset involve?

Quoting the help (i.e., ?CASchools), the data is “from all 420 K-6 and K-8 districts in California with data available for 1998 and 1999” and the variables are:

* district: character. District code.
* school: character. School name.
* county: factor indicating county.
* grades: factor indicating grade span of district.
* students: Total enrollment.
* teachers: Number of teachers.
* calworks: Percent qualifying for CalWorks (income assistance).
* lunch: Percent qualifying for reduced-price lunch.
* computer: Number of computers.
* expenditure: Expenditure per student.
* income: District average income (in USD 1,000).
* english: Percent of English learners.
* read: Average reading score.
* math: Average math score.

Let's look at the basic structure of the data frame. i.e., the number of observations and the types of values:

str(cas)

## 'data.frame':    420 obs. of  15 variables:
##  $ district   : chr  "75119" "61499" "61549" "61457" ...
##  $ school     : chr  "Sunol Glen Unified" "Manzanita Elementary" "Thermalito Union Elementary" "Golden Feather Union Elementary" ...
##  $ county     : Factor w/ 45 levels "Alameda","Butte",..: 1 2 2 2 2 6 29 11 6 25 ...
##  $ grades     : Factor w/ 2 levels "KK-06","KK-08": 2 2 2 2 2 2 2 2 2 1 ...
##  $ students   : num  195 240 1550 243 1335 ...
##  $ teachers   : num  10.9 11.1 82.9 14 71.5 ...
##  $ calworks   : num  0.51 15.42 55.03 36.48 33.11 ...
##  $ lunch      : num  2.04 47.92 76.32 77.05 78.43 ...
##  $ computer   : num  67 101 169 85 171 25 28 66 35 0 ...
##  $ expenditure: num  6385 5099 5502 7102 5236 ...
##  $ income     : num  22.69 9.82 8.98 8.98 9.08 ...
##  $ english    : num  0 4.58 30 0 13.86 ...
##  $ read       : num  692 660 636 652 642 ...
##  $ math       : num  690 662 651 644 640 ...
##  $ gradesN    : logi  TRUE TRUE TRUE TRUE TRUE TRUE ...

# Hmisc::describe(cas) # For more extensive summary statistics

Q. 2 To what extent does expenditure per student vary?

qplot(expenditure, data = cas) + xlim(0, 8000) + xlab("Money spent per student ($)") + 
    ylab("Count of schools")


round(t(psych::describe(cas$expenditure)), 1)

##            [,1]
## var         1.0
## n         420.0
## mean     5312.4
## sd        633.9
## median   5214.5
## trimmed  5252.9
## mad       487.2
## min      3926.1
## max      7711.5
## range    3785.4
## skew        1.1
## kurtosis    1.9
## se         30.9

The greatest expenditure per student is around double that of the least expenditure per student.

Q. 3a What predicts expenditure per student?

# Compute and format set of correlations
corExp <- cor(cas["expenditure"], cas[setdiff(v, "expenditure")])
corExp <- round(t(corExp), 2)
corExp[order(corExp[, 1], decreasing = TRUE), , drop = FALSE]

##          expenditure
## income          0.31
## read            0.22
## math            0.15
## calworks        0.07
## lunch          -0.06
## computer       -0.07
## english        -0.07
## teachers       -0.10
## students       -0.11
## gradesN        -0.17

More is spent per student in schools :

where people with greater incomes live
reading scores are higher
that are K-6

Q. 4 what is the relationship between district level maths and reading scores?

ggplot(cas, aes(read, math)) + geom_point() + geom_smooth()

At the district level, the correlation is very strong (r = The correlation is 0.92). From prior experience I'd expect correlations at the individual-level in the .3 to .6 range. Thus, these results are consistent with group-level relationships being much larger than individual-level relationships.

Q. 5 What is the relationship between maths and reading after partialling out other effects?

# command has strange syntax requiring column numbers rather than variable
# names
partial.r(cas[v], c(which(names(cas[v]) == "read"), which(names(cas[v]) == 
    "math")), which(!names(cas[v]) %in% c("read", "math")))

## partial correlations 
##      read math
## read 1.00 0.72
## math 0.72 1.00

The partial correlation is still very strong but is substantially reduced.

Q. 6 What fraction of a computer does each student have?

cas$compstud <- cas$computer/cas$students
describe(cas$compstud)

## cas$compstud 
##       n missing  unique    Mean     .05     .10     .25     .50     .75 
##     420       0     412  0.1359 0.05471 0.06654 0.09377 0.12546 0.16447 
##     .90     .95 
## 0.22494 0.24906 
## 
## lowest : 0.00000 0.01455 0.02266 0.02548 0.04167
## highest: 0.32770 0.34359 0.34979 0.35897 0.42083

qplot(compstud, data = cas)

## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

The mean number of computers per student is 0.136.

Q. 7 What is a good model of the combined effect of other variables on academic performance (i.e., math and read)?

# Examine correlations between variables
psych::pairs.panels(cas[v])

pairs.panels shows correlations in the upper triangle, scatterplots in the lower triangle, and variable names and distributions on the main diagonal.
After examining the plot several ideas emerge.

# (a) students is a count and could be log transformed
cas$studentsLog <- log(cas$students)

# (b) teachers is not the variable of interest:
#   it is the number of students per teacher
cas$studteach <- cas$students /cas$teachers
# (c) computers is not the variable of interest:
#  it is the ratio of computers to students
# table(cas$computer==0) 
# Note some schools have no computers so ratio would be problematic.
# Take percentage of a computer instead
cas$compstud <- cas$computer / cas$students 

# (d) math and reading are correlated highly, reduce to one variable
cas$performance <- as.numeric(
        scale(scale(cas$read) + scale(cas$math)))

Normally, I'd add all these transformations to an initial data transformation file that I call in the first block, but for the sake of the narrative, I'll leave them here.

Let's examine correlations between predictors and outcome.

m1cor <- cor(cas$performance, cas[c("studentsLog", "studteach", "calworks", 
    "lunch", "compstud", "income", "expenditure", "gradesN")])
t(round(m1cor, 2))

##              [,1]
## studentsLog -0.12
## studteach   -0.23
## calworks    -0.63
## lunch       -0.87
## compstud     0.27
## income       0.71
## expenditure  0.19
## gradesN     -0.16

Let's examine the multiple regression.

m1 <- lm(performance ~ studentsLog + studteach + calworks + lunch + 
    compstud + income + expenditure + grades, data = cas)
summary(m1)

## 
## Call:
## lm(formula = performance ~ studentsLog + studteach + calworks + 
##     lunch + compstud + income + expenditure + grades, data = cas)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.8107 -0.2963 -0.0118  0.2712  1.5662 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  8.99e-01   4.98e-01    1.80    0.072 .  
## studentsLog -3.83e-02   1.91e-02   -2.01    0.045 *  
## studteach   -1.11e-02   1.59e-02   -0.70    0.487    
## calworks     1.96e-03   2.96e-03    0.66    0.508    
## lunch       -2.65e-02   1.48e-03  -17.97  < 2e-16 ***
## compstud     7.88e-01   3.86e-01    2.04    0.042 *  
## income       2.82e-02   4.89e-03    5.77  1.6e-08 ***
## expenditure  5.87e-05   4.90e-05    1.20    0.232    
## gradesKK-08 -1.21e-01   6.49e-02   -1.87    0.062 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 
## 
## Residual standard error: 0.457 on 411 degrees of freedom
## Multiple R-squared: 0.795,   Adjusted R-squared: 0.791 
## F-statistic:  199 on 8 and 411 DF,  p-value: <2e-16 
##

And some indicators of predictor relative importance.

# calc.relimp from relaimpo package.
(m1relaimpo <- calc.relimp(m1, type = "lmg", rela = TRUE))

## Response variable: performance 
## Total response variance: 1 
## Analysis based on 420 observations 
## 
## 8 Regressors: 
## studentsLog studteach calworks lunch compstud income expenditure grades 
## Proportion of variance explained by model: 79.48%
## Metrics are normalized to sum to 100% (rela=TRUE). 
## 
## Relative importance metrics: 
## 
##                  lmg
## studentsLog 0.009973
## studteach   0.016695
## calworks    0.177666
## lunch       0.492866
## compstud    0.025815
## income      0.251769
## expenditure 0.014785
## grades      0.010432
## 
## Average coefficients for different model sizes: 
## 
##                   1X        2Xs        3Xs        4Xs        5Xs
## studentsLog -0.08771 -0.0650133 -0.0558756 -0.0519312 -4.926e-02
## studteach   -0.11918 -0.0861199 -0.0629499 -0.0462155 -3.372e-02
## calworks    -0.05473 -0.0427576 -0.0324658 -0.0233760 -1.535e-02
## lunch       -0.03199 -0.0310310 -0.0301497 -0.0293300 -2.856e-02
## compstud     4.15870  3.0673338  2.2639604  1.6844348  1.287e+00
## income       0.09860  0.0850555  0.0726892  0.0614726  5.140e-02
## expenditure  0.00030  0.0001986  0.0001374  0.0001013  8.061e-05
## grades      -0.45677 -0.3345683 -0.2529014 -0.1981200 -1.628e-01
##                    6Xs        7Xs        8Xs
## studentsLog -4.626e-02 -4.252e-02 -3.833e-02
## studteach   -2.418e-02 -1.687e-02 -1.109e-02
## calworks    -8.399e-03 -2.612e-03  1.962e-03
## lunch       -2.785e-02 -2.718e-02 -2.654e-02
## compstud     1.034e+00  8.828e-01  7.884e-01
## income       4.250e-02  3.477e-02  2.821e-02
## expenditure  6.882e-05  6.206e-05  5.871e-05
## grades      -1.414e-01 -1.291e-01 -1.215e-01

Thus, we can conclude that:

Income and indicators of income (e.g., low levels of lunch vouchers) are the two main predictors. Thus, schools with greater average income tend to have better student performance.
Schools with more computers per student have better student performance.
Schools with fewer students per teacher have better student performance.

For more information about relative importance and the relaimpo package measures check out Ulrike Grömping's website.
Of course this is all observational data with the usual caveats regarding causal interpretation.

Now, let's look at some weird stuff.

Q. 8.1 What are common words in Californian School names?

# create a vector of the words that occur in school names
lw <- unlist(strsplit(cas$school, split = " "))

# create a table of the frequency of school names
tlw <- table(lw)

# extract cells of table with count greater than 3
tlw2 <- tlw[tlw > 3]

# sorted in decreasing order
tlw2 <- sort(tlw2, decreasing = TRUE)

# values as proporitions
tlw2p <- round(tlw2/nrow(cas), 3)

# show this in a bar graph
tlw2pdf <- data.frame(word = names(tlw2p), prop = as.numeric(tlw2p), 
    stringsAsFactors = FALSE)
ggplot(tlw2pdf, aes(word, prop)) + geom_bar() + coord_flip()

# make it log counts
ggplot(tlw2pdf, aes(word, log(prop * nrow(cas)))) + geom_bar() + 
    coord_flip()

The word “Elementary” appears in almost all school names (98.3%). The word “Union” appears in around half (43.3%).

Other common words pertain to:

Directions (e.g., South, West),
Features of the environment (e.g., Creek, Vista, View, Valley)
Spanish words (e.g., rio for river; san for saint)

Q. 8.2 Is the number of letters in the school's name related to academic performance?

cas$namelen <- nchar(cas$school)
table(cas$namelen)

## 
## 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39 
##  1  4  9 26 28 31 33 27 30 45 38 28 36 30 18 10  5  4  6  3  1  2  2  2  1

round(cor(cas$namelen, cas[, c("read", "math")]), 2)

##      read math
## [1,] 0.03    0

The answer appears to be “no”.

Q. 8.3 Is the number of words in the school name related to academic performance?

cas$nameWordCount <- sapply(strsplit(cas$school, " "), length)
table(cas$nameWordCount)

## 
##   2   3   4   5 
## 140 202  72   6

round(cor(cas$nameWordCount, cas[, c("read", "math")]), 2)

##      read math
## [1,] 0.05 0.01

The answer appears to be “no”.

Q. 8.4 Are schools with nice popular nature words in their name doing better academically?

tlw2p  #recall the list of popular names

## lw
## Elementary      Union       City     Valley      Joint       View 
##      0.983      0.433      0.060      0.040      0.031      0.019 
##   Pleasant        San      Creek        Oak      Santa       Lake 
##      0.017      0.017      0.014      0.014      0.014      0.012 
##   Mountain       Park        Rio      Vista      Grove   Lakeside 
##      0.012      0.012      0.012      0.012      0.010      0.010 
##      South    Unified       West 
##      0.010      0.010      0.010

# Create a quick and dirty list of popular nature names
naturenames <- c("Valley", "View", "Creek", "Lake", "Mountain", "Park", 
    "Rio", "Vista", "Grove", "Lakeside")

# work out whether the word is in the school name
schsplit <- strsplit(cas$school, " ")
cas$hasNature <- sapply(schsplit, function(X) length(intersect(X, 
    naturenames)) > 0)
round(cor(cas$hasNature, cas[, c("read", "math")]), 2)

##      read math
## [1,] 0.09 0.08

So we've found a small correlation.

Let's graph the data to see what it means:

ggplot(cas, aes(hasNature, read)) + geom_boxplot() + geom_jitter(position = position_jitter(width = 0.1)) + 
    xlab("Has a nature name") + ylab("Mean student reading score")

So in the sample nature schools have slightly better reading score (and if we were to graph it, maths scores). However, the number of schools having nature names is actually somewhat small (n= 61) despite the overall quite large sample size.

But is it statistically significant?

t.read <- t.test(cas[cas$hasNature, "read"], cas[!cas$hasNature, 
    "read"])
t.math <- t.test(cas[cas$hasNature, "math"], cas[!cas$hasNature, 
    "math"])

So, the p-value is less than .05 for reading (p = 0.046) but not quite for maths (p = 0.083). Bingo! After a little bit of data fishing we have found that reading scores are “significantly” greater for those schools with the listed nature names.

But wait: I've asked three separate exploratory questions or perhaps six if we take maths into account.

$\frac{.05}{3} =$ 0.0167
$\frac{.05}{6} =$ 0.0083

At these Bonferonni corrected p-values, the result is non-significant. Oh well…

Review

Anyway, the aim of this post was not to make profound statements about California schools. Rather the aim was to show how easy it is to produce quick reproducible reports with R Markdown. If you haven't already, you may want to open up the R Markdown file used to produce this post in RStudio, and compile the report yourself.

In particular, I can see R Markdown being my tool of choice for:

Blog posts
Posts to StackExchange sites
Materials for training workshops
Short consulting reports, and
Exploratory analyses as part of a larger project.

The real question is how far I can push Markdown before I start to miss the control of LaTeX. Markdown does permit arbitrary HTML. Anyway, if you have any thoughts about the scope of R Markdown, feel free to add a comment.

Getting Started with R Markdown, knitr, and Rstudio 0.96

2012-05-17T14:31:00.000+10:00

This post examines the features of R Markdown using knitr in Rstudio 0.96. This combination of tools provides an exciting improvement in usability for reproducible analysis. Specifically, this post (1) discusses getting started with R Markdown and knitr in Rstudio 0.96; (2) provides a basic example of producing console output and plots using R Markdown; (3) highlights several code chunk options such as caching and controlling how input and output is displayed; (4) demonstrates use of standard Markdown notation as well as the extended features of formulas and tables; and (5) discusses the implications of R Markdown. This post was produced with R Markdown. The source code is available here as a gist. The post may be most useful if the source code and displayed post are viewed side by side. In some instances, I include a copy of the R Markdown in the displayed HTML, but most of the time I assume you are reading the source and post side by side.

Getting started

To work with R Markdown, if necessary:

Install R
Install the lastest version of RStudio (at time of posting, this is 0.96)
Install the latest version of the knitr package: install.packages("knitr")

To run the basic working example that produced this blog post:

Open R Studio, and go to File - New - R Markdown
If necessary install ggplot2 and lattice packages: install.packages("ggplot2"); install.packages("lattice")
Paste in the contents of the gist (which contains the R Markdown file used to produce this post) and save the file with an .rmd extension
Click Knit HTML

opts_knit$set(upload.fun = imgur_upload)  # upload all images to imgur.com

Prepare for analyses

set.seed(1234)
library(ggplot2)
library(lattice)

Basic console output

To insert an R code chunk, you can type it manually or just press Chunks - Insert chunks or use the shortcut key. This will produce the following code chunk:

```{r}

```

Pressing tab when inside the braces will bring up code chunk options.

The following R code chunk labelled basicconsole is as follows:

```{r basicconsole}
x <- 1:10
y <- round(rnorm(10, x, 1), 2)
df <- data.frame(x, y)
df
```

The code chunk input and output is then displayed as follows:

x <- 1:10
y <- round(rnorm(10, x, 1), 2)
df <- data.frame(x, y)
df

##     x    y
## 1   1 1.31
## 2   2 2.31
## 3   3 3.36
## 4   4 3.27
## 5   5 5.04
## 6   6 6.11
## 7   7 8.43
## 8   8 8.98
## 9   9 8.38
## 10 10 9.27

Plots

Images generated by knitr are saved in a figures folder. However, they also appear to be represented in the HTML output using a data URI scheme. This means that you can paste the HTML into a blog post or discussion forum and you don't have to worry about finding a place to store the images; they're embedded in the HTML.

Simple plot

Here is a basic plot using base graphics:

```{r simpleplot}
plot(x)
```

plot(x)

Note that unlike traditional Sweave, there is no need to write fig=TRUE.

Multiple plots

Also, unlike traditional Sweave, you can include multiple plots in one code chunk:

```{r multipleplots}
boxplot(1:10~rep(1:2,5))
plot(x, y)
```

boxplot(1:10 ~ rep(1:2, 5))

plot(x, y)

`ggplot2` plot

Ggplot2 plots work well:

qplot(x, y, data = df)

`lattice` plot

As do lattice plots:

xyplot(y ~ x)

Note that unlike traditional Sweave, there is no need to print lattice plots directly.

R Code chunk features

Create Markdown code from R

The following code hides the command input (i.e., echo=FALSE), and outputs the content directly as code (i.e., results=asis, which is similar to results=tex in Sweave).

```{r dotpointprint, results='asis', echo=FALSE}
cat("Here are some dot points\n\n")
cat(paste("* The value of y[", 1:3, "] is ", y[1:3], sep="", collapse="\n"))
```

Here are some dot points

The value of y[1] is 1.31
The value of y[2] is 2.31
The value of y[3] is 3.36

Create Markdown table code from R

```{r createtable, results='asis', echo=FALSE}
cat("x | y", "--- | ---", sep="\n")
cat(apply(df, 1, function(X) paste(X, collapse=" | ")), sep = "\n")
```

x	y
1	1.31
2	2.31
3	3.36
4	3.27
5	5.04
6	6.11
7	8.43
8	8.98
9	8.38
10	9.27

Control output display

The folllowing code supresses display of R input commands (i.e., echo=FALSE) and removes any preceding text from console output (comment=""; the default is comment="##").

```{r echo=FALSE, comment="", echo=FALSE}
head(df)
```

Control figure size

The following is an example of a smaller figure using fig.width and fig.height options.

```{r smallplot, fig.width=3, fig.height=3}
plot(x)
```

plot(x)

Cache analysis

Caching analyses is straightforward. Here's example code. On the first run on my computer, this took about 10 seconds. On subsequent runs, this code was not run.

If you want to rerun cached code chunks, just delete the contents of the cache folder

```{r longanalysis, cache=TRUE}
for (i in 1:5000) {
    lm((i+1)~i)
}
```

Basic markdown functionality

For those not familiar with standard Markdown, the following may be useful. See the source code for how to produce such points. However, RStudio does include a Markdown quick reference button that adequatly covers this material.

Dot Points

Simple dot points:

Point 1
Point 2
Point 3

and numeric dot points:

Number 1
Number 2
Number 3

and nested dot points:

A
- A.1
- A.2
B
- B.1
- B.2

Equations

Equations are included by using LaTeX notation and including them either between single dollar signs (inline equations) or double dollar signs (displayed equations). If you hang around the Q&A site CrossValidated you'll be familiar with this idea.

There are inline equations such as $y_i = \alpha + \beta x_i + e_i$.

And displayed formulas:

$$\frac{1}{1+\exp(-x)}$$

knitr provides self-contained HTML code that calls a Mathjax script to display formulas. However, in order to include the script in my blog posts I took the script and incorporated it into my blogger template. If you are viewing this post through syndication or an RSS reader, this may not work. You may need to view this post on my website.

Tables

Tables can be included using the following notation

A	B	C
1	Male	Blue
2	Female	Pink

Hyperlinks

If you like this post, you may wish to subscribe to my RSS feed.

Images

Here's an example image:

Code

Here is Markdown R code chunk displayed as code:

```{r}
x <- 1:10
x
```

And then there's inline code such as x <- 1:10.

Quote

Let's quote some stuff:

To be, or not to be, that is the question: Whether 'tis nobler in the mind to suffer The slings and arrows of outrageous fortune,

Conclusion

R Markdown is awesome.
- The ratio of markup to content is excellent.
- For exploratory analyses, blog posts, and the like R Markdown will be a powerful productivity booster.
- For journal articles, LaTeX will presumably still be required.
The RStudio team have made the whole process very user friendly.
- RStudio provides useful shortcut keys for compiling to HTML, and running code chunks. These shortcut keys are presented in a clear way.
- The incorporated extensions to Markdown, particularly formula and table support, are particularly useful.
- Jump-to-chunk feature facilitates navigation. It helps if your code chunks have informative names.
- Code completion on R code chunk options is really helpful. See also chunk options documentation on the knitr website.
Other recent posts on R markdown include those by :
- Christopher Gandrud
- Markcus Gesmann
- Rstudio on R Markdown
- Yihui Xie: I really want to thank him for developing knitr. He has also posted this example of R Markdown.

Questions

The following are a few questions I encountered along the way that might interest others.

Annoying `<br/>`'s

Question: I asked on the Rstudio discussion site: Why does Markdown to HTML insert <br/> on new lines?

Answer: I just do a find and delete on this text for now. Specifically, I have a sed command that extracts just the content between the body tags and removes br tags. I can then, readily incorporate the result into my blogposts.

sed -i -e '1,/<body>/d' -e'/^<\/body>/,$d' -e 's/<br\/>$//' filename.html

Temporarily disable caching

Question: I asked on StackOverflow about How to set cache=FALSE for a knitr markdown document and override code chunk settings?

Answer: Delete the cache folder. But there are other possible workflows.

Equivalent of Sexpr

Question: I asked on Stack Overvlow about whether there an R Markdown equivalent to Sexpr in Sweave?.

Answer: Include the code between brackets of “backtick r space” and “backtick”. E.g., in the source code I have calculated 2 + 2 = 4 .

Image format

Question: When using the URI scheme images don't appear to display in RSS feeds of my blog. What's a good strategy?

Answer: One strategy is to upload to imgur. The following provides an example of exporting to imgur.

Add the following lines of code near the top of the file:

``` {r optsknit}
opts_knit$set(upload.fun = imgur_upload) # upload all images to imgur.com
```

I found that the function failed when I was at work behind a firewall, but worked at home.

How to plot three categorical variables and one continuous variable using ggplot2

2012-05-03T22:31:00.000+10:00

This post shows how to produce a plot involving three categorical variables and one continuous variable using ggplot2 in R.

The following code is also available as a gist on github.

1. Create Data

First, let's load ggplot2 and create some data to work with:

library(ggplot2)

set.seed(4444)
Data <- expand.grid(group=c("Apples", "Bananas", "Carrots", "Durians", 
            "Eggplants"),
            year=c("2000", "2001", "2002"),
            quality=c("Grade A", "Grade B", "Grade C", "Grade D", 
            "Grade E"))
Group.Weight <- data.frame(
    group=c("Apples", "Bananas", "Carrots", "Durians", "Eggplants"),
    group.weight=c(1,1,-1,0.5, 0))
Quality.Weight <- data.frame(
    quality=c("Grade A", "Grade B", "Grade C", "Grade D", "Grade E"),
    quality.weight = c(1,0.5,0,-0.5,-1))
Data <- merge(Data, Group.Weight)
Data <- merge(Data, Quality.Weight)
Data$score <- Data$group.weight + Data$quality.weight + 
    rnorm(nrow(Data), 0, 0.2)
Data$proportion.tasty <- exp(Data$score)/(1 + exp(Data$score))

2. Produce Plot

And here's the code to produce the plot.

ggplot(data=Data, 
       aes(x=factor(year), y=proportion.tasty, 
           group=group,
           shape=group,
           color=group)) + 
               geom_line() + 
               geom_point() +
               opts(title = 
               "Proportion Tasty by Year, Quality, and Group") +
               scale_x_discrete("Year") +
               scale_y_continuous("Proportion Tasty") + 
        facet_grid(.~quality )

And here's what it looks like:

Getting Started with JAGS, rjags, and Bayesian Modelling

2012-04-11T15:50:00.029+10:00

This post provides links to various resources on getting started with Bayesian modelling using JAGS and R. It discusses: (1) what is JAGS; (2) why you might want to perform Bayesian modelling using JAGS; (3) how to install JAGS; (4) where to find further information on JAGS; (5) where to find examples of JAGS scripts in action; (6) where to ask questions; and (7) some interesting psychological applications of Bayesian modelling.

What is JAGS?

JAGS stands for Just Another Gibbs Sampler. To quote the program author, Martyn Plummer, "It is a program for analysis of Bayesian hierarchical models using Markov Chain Monte Carlo (MCMC) simulation..." It uses a dialect of the BUGS language, similar but a little different to OpenBUGS and WinBUGS.

Why JAGS?

The question of why you might want to use JAGS can be approached in several different ways:

Why Bayesian rather than Null Hypothesis Significance Testing approaches?
- To quote John D. Cook quoting Anthony O'Hagan, the benefits of "the bayesian approach are that it is 1. fundamentally sound, 2. very flexible, 3. produces clear and direct inferences, and 4. makes use of all available information." (see John's blog post for elaboration)
- John K. Kruschke made a similar argument in an Open Letter extolling the benefits of the bayesian approach summarised as: "(1) Scientific disciplines from astronomy to zoology are moving to Bayesian data analysis. We should be leaders of the move, not followers. (2) Modern Bayesian methods provide richer information, with greater flexibility and broader applicability than 20th century methods. Bayesian methods are intellectually coherent and intuitive. Bayesian analyses are readily computed with modern software and hardware. (3) Null-hypothesis significance testing (NHST), with its reliance on p values, has many problems. There is little reason to persist with NHST now that Bayesian methods are accessible to everyone."
Why JAGS/BUGS rather than coding in a low-level language?
- It's simpler; for models that BUGS can handle, BUGS can shield you from some of the thorny details related to numeric integration.
- There are simple interfaces with R.
Why JAGS rather than WinBUGS or OpenBUGS?
- I'm using JAGS because it works well on Ubuntu. WinBUGS is broadly Windows specific, although I've read that it may work with the emulation software Wine.
- JAGS interfaces well with R. I'm comfortable writing scripts. Thus, I don't personally see the benefits of using a dedicated GUI like WinBUGS. I can leverage what I know about R.
- However, ultimately converting code between different flavours of BUGS is not that difficult.
- For further discussion of the issue, see this discussion on CrossValidated.

More than anything I found that JAGS provided a useful entry point into the world of Bayesian modelling. This in turn appealed to me for several reasons:

Even when I perform analyses using an NHST approach I often intuitively think of empirical research questions in terms of probability densities on a parameter of interest that changes as empirical and theoretical evidence is accumulated. See for example Thompson's (2002) concept of meta-analytic thinking. Bayesian analysis provides tools for formalising this orientation.
More broadly, I appreciate the explicitness that a Bayesian approach requires and encourages. E.g., specifying the distribution of the error term, specifying a prior, specifying the distribution of parameters in a mixed effects model, and so on.
There are several modelling challenges that I'm currently working through where a Bayesian approach offers substantial flexibility and applicability. In particular, I'm interested in modelling individual differences in the effect of practice on strategy use and task performance and then relating these individual differences to factors like intelligence, prior experience, and personality.

JAGS Installation

JAGS runs on Linux, Mac, and Windows. I run JAGS on Ubuntu through an interface with R called rjags.

The following sets out a basic installation process:

If necessary Download and install R and potentially a user interface to R like R Studio (see here for tips on getting started with R).
Download and install JAGS as per operating system requriements.
Install additional R packages: e.g., in R install.packages("rjags") . In particular, I use the packages rjags to interface with JAGS and coda to process MCMC output.

Information on JAGS

The manual for different versions of JAGS is located here. Several particularly relevant sections include:
- the list of supported distributions and how they are parameterised. This is often important given that the code looks similar to R but often uses different parameterisation (e.g., precision is used instead of standard deviation for a normal distribution).
- It summarises differences between WinBUGS and JAGS.
- It sets out available functions and operators.
The rjags help pdf for information about how to interface with JAGS from R.
Martin Plummer has a blog called JAGS NEWS
The Bayesian Task View on CRAN lists and briefly describes the many R packages related to Bayesian statistics.
Lunn and colleagues have a 2009 article called The BUGS project: Evolution, critique and future directions. It provides a useful historical perspective on the broader BUGS project, although it does not mention much about JAGS specifically.

Examples JAGS Scripts

I find it easier to pick up a new language by playing with examples. The following provides links to example JAGS code, often with accompanying explanations:

Justin Esarey
- An entire course on Bayesian Statistics with examples in R and JAGS. It includes 10 lectures and each lecture lasts around 2 hours. The content is designed for a social science audience and it includes a syllabus linking with Simon Jackman's text. The videos are linked from above or available direclty on YouTube
John Myles White
- A course on statistical models that is under development with JAGS scripts on github
- A model of Cannabalt scores using a gamma distribution
- Simple introductory examples of fitting a normal distribution, linear regression, and logistic regression
- A follow-up post demonstrating the use of the coda package with rjags to perform MCMC diagnostics.
John K. Kruschke
- John Krushke wrote a book called Doing Bayesian Data Analysis: A Tutorial with R and BUGS. It's an excellent entry point into the world of Bayesian statistics for the social and behavioural scientist who has reasonable quantiative training, but is not necessarily ready to absorb the kinds of books that are used in graduate-level statistics courses.
- See this blog post for a link to the zip file containing the JAGS code.
BUGS Project
- BUGS is well known for the large set of examples that accompany the project.
- You can see the JAGS code used to run these examples here.
Patrick J Mineault
- An example from Gelman et al examining the effect of training programs on SAT scores
Simon Jackman wrote the book Bayesian Analysis for the Social Sciences that has accompanying JAGS code.

More broadly, examples and tutorials designed for WinBUGS can generally be adapted to be useful for JAGS. So for example, you can explore these WinBUGS examples:

Michael Lee and Eric-Jan Wagemakers have a free online book called A Course in Bayesian Graphical Modeling for Cognitive Science
The website for the book Markov Chain Monte Carlo has several WinBUGS examples.
There is an extensive list of BUGS resources on the BUGS project website.

Asking questions

There are several places to ask questions about JAGS, R, and Bayesian statistics.

JAGS, BUGS, and bayesian questions on stats.stackexchange.com (aka CrossValidated).
JAGS discussion forum
There's also a BUGS discussion list

In general, I prefer the Stack Exchange model for asking and answering questions on the internet, although the most important issue is typically where the experts are located.

Interesting Psychological Applications of Bayesian Modelling

If you want to see some examples of Bayesian modelling applied to psychological data, I found the following articles quite interesting. PDFs are available online.

Shiffrin, Lee, Kim, and Wagenmakers (2008) present a tutorial on hierarchical bayesian methods in the context of cognitive science.
Michael Lee (2011) in Journal of Mathematical Psychology discusses the benefits of hiearchical Bayesian methods to modelling psychological data and provides several example applications.
Lee Averell and Andrew Heathcote (2010) in Journal of Mathematical Psychology analyse individual differences in the forgetting curve using a hierarchical Bayesian approach.

If you know of any other interesting JAGS resources or have any comments about my choice of software for Bayesian data analysis, feel free to post a comment.

New Psychology and Cognitive Science Question and Answer Site: COGSCI.SE

2012-02-17T21:21:00.000+11:00

There is now a new website for researchers to ask and answer questions on topics related to psychology and cognitive science. The site is cogsci.stackexchange.com. From the success of earlier released sites in the Stack Exchange network such as those on programming, statistics, and latex, the site for psychology and cognitive science has the potential be a great resource for researchers. I'm actively contributing on the site. So, if you are a researcher in psychology, I hope you'll check it out. The rest of this post sets out (a) a little history of Stack Exchange question and answer sites as they relate to psychology and statistics; (b) why I think this new site for psychology and cognitive science has so much potential; and (c) why, if you are a professional or student researcher in psychology, you might want to get involved.

A little history

I first learnt about the Stack Exchange network back in 2009. While I was busy learning R, a number of people in the online data science world such as JD Long, Michael Driscoll, Drew Conway, and many more were promoting a programmer's question and answer site called Stack Overflow as a place to ask and answer R related questions. It was a site pitched at overcoming the many problems of discussion boards, mailing lists, and the like: e.g., off topic threads, spam, extended discussion difficulty finding the correct answers, poor indexing by Google, etc. As of Feb 2011, it now has over 10,000 questions with the R tag.

Shortly afterwards, the Stack Exchange Network developed Area51. The idea was to take the question and answer infrastructure that made Stack Overflow a success in the programming world, and extend it to all sorts of other domains. Instead of going down the model of Quora or, shudder to think, Yahoo Answers, Stack Exchange did not permit the creation of a site until a sufficient community of active users existed to maintain the site at a high standard. Thus, my main interests, statistics and psychology were going to have to wait.

A site for statistics questions was the first to join the network (stats.stackexchange.com). Professor Rob Hyndman proposed the site, and perhaps given the overlapping worlds of programmers and data analysts, the site launched a few months later in July 2010. At the time of posting it has over 7,000 questions. I've been actively involved in the site. I've used it to get advice on my own research. I've also used it extensively in various statistical consulting roles. In particular, I've encouraged others who would otherwise send me an email about statistics, to post the question on stats.se so that any answer can be an ongoing resource for others.

In the case of psychology and cognitive science, I've had to wait a lot longer. The overlap between programming and psychology communities is much smaller, and site proposals were split over separate cognitive science, psychology, and psychiatry proposals. Finally, in December 2011 these three proposals were merged and on January 19th 2012 the site was launched in Beta under the title Cognitive Sciences at the url cogsci.stackexchange.com. Although the initial name is suggestive of a focus on "cognitive" science. The history of the merging of site proposals, the inclusion of the "s", plural "sciences", and the attitude of current participants admits the full range of questions in cognitive science, psychology, and psychiatry.

At the time of posting the site is growing at a healthy rate. Most questions are getting good answers, and the community norms around question quality, references, scope and so on are being clarified on the meta site. However, there is also the challenge of getting the word out about the site to academics, researchers, and graduate students who are not otherwise familiar with the Stack Exchange network of sites. In my opinion, Stack Exchange provides the best currently available infrastructure for building a high quality question and answer site. However, it still relies on a community of expert contributors.

So, if you're a researcher in psychology or cognitive science, why might you want to get involved? And why might you want to talk to fellow researchers about the?

Reasons to participate as an academic

If you are an academic, Lecturer, or Post Doc, there are many reasons why you might want to participate:

Answering questions is a way of facilitating knowledge transfer to the broader community; this can be intrinsically enjoyable especially when you get direct feedback on the number of people who read your answers.
If you use your own name, as many people do, the voting and reputation system, and various other mechanisms provide a means for your contribution to be recognised.
You get immediate feedback on what others think of your answers; Thus, it creates an environment of feedback conducive to learning.
I see sites like Stack Exchange as part of a broader model of open science. As you create and develop knowledge, you encounter challenges. The idea is to record these challenges as questions and then add the resolutions as answers. Thus, when others encounter the same problems, good answers are only a Google search away. I'm not saying that question and answer sites replace journal articles, but they can fill a bridging role linking the language of questions to the answers provided in journal articles.
Furthermore, the content on Stack Exchange is licenced under creative commons, so even if the site disappeared the content would still be available on other sites that reproduce the material. This is much better than the policy of almost all journals which copyright your, typically, state-sponsored research and lock it up behind a pay wall, thereby frustrating the process of knowledge dissemination.
While contributing to Wikipedia is another great way to improve the sum of all knowledge, unlike Wikipedia, your answers generally stay there; in contrast to Wikipedia, where your contributions can and are often completely removed by other editors.

Reasons to participate as a student researcher

If you are doing a thesis in psychology or cognitive science, or possibly even if you are just studying a few subjects, many of the above reasons for participating will also apply.

However, you may also find that the capacity to ask questions will be particularly useful. As a side point, if it is early days in your career, you may or may not want to use your real name.

In particular, I'd encourage students completing a thesis to incorporate asking and answering questions into their scholarly process. You might encounter questions like:

Is there a meta analysis on X?
What are the main theory about Y?
What is the best measure of Z?
What is the empirical support for theory W?

These kinds of questions come up all the time when doing research. Of course, as researchers we have strategies for finding answers ourselves. However, the stack exchange model encourages you to learn from the answers of others and also to "leave crumbs" so that others can follow in your footsteps more easily. The idea is not be shy. Post questions frequently. If you're able to answer your question, contribute a self-answer.

Thus, even if only a handful of people ever read a thesis, by asking and answering many questions along the way, resources will be left that thousands of people will learn from and discover through Google searches in the years to come. Even if you don't have answers, your question can be the trigger for an expert to share knowledge to create a valuable Internet artefact.

Getting Started

If you want to learn more or give the site a go:

Have a read through the FAQ on cogsci.se
Ask a question
See if you can answer one of the currently unanswered questions

I'm floating around on the site, so if you're a researcher in psychology, I hope to see you there.

Tips for Undergraduates Interested in a Career in Organisational Psychology: Australian Perspective

2011-07-24T21:40:00.000+10:00

Undergraduate psychology students often ask me about careers in organisational psychology. This post aims to provide a few links and resources to assist such students to learn about the profession and the career pathways. The post includes (a) a basic description of organisational psychology, (b) links to Australian educational and professional society resources, (c) discussion of PhD and academic options, and (d) additional resources to learn more about the profession.

Overview of Organisational Psychology

What is the profession called?

Before discussing the profession some consideration should be given to what to call it. 'Organisational psychology' goes by various names and abbreviations:

Organisational Psychology (Org Psych)
Industrial/Organisational Psychology (I/O or I/O Psych)
Work Psychology
Occupational Psychology

Different names imply both historical and present differences in focus. However, such terms are also often used interchangeably. See SIOP's 'What's in a Name?' article for an overview of various job titles.

Names vary by region. In the United States, "I/O" is preferred. In Australia, "Organisational Psychology" is arguably the more common term, consistent with the APS college name and many course names. Thus, I'll tend to use this term in this post.

What is organisational psychology?

Here are a few descriptions:

"Organisational Psychology is the science of people at work. Organisational psychologists specialise in analysing organisations and their people, and devising strategies to recruit, motivate, develop, change and inspire." - prize winning elevator pitch (APS COP)
Industrial / Organisational psychologists "Apply principles of psychology to personnel, administration, management, sales, and marketing problems. Activities may include policy planning; employee screening, training and development; and organizational development and analysis. May work with management to reorganize the work setting to improve worker productivity. - Industrial/Organisational Psychologist job description on O*N
"Industrial-organizational (I-O) psychology is the scientific study of the workplace. Rigor and methods of psychology are applied to issues of critical relevance to business, including talent management, coaching, assessment, selection, training, organizational development, performance, and work-life balance." - SIOP: Student Section

The APS College of Organisational Psychology has a page that describes "What is an organisational psychologist" and "Areas of Specialisation".

Learning more about the profession:

A good strategy for learning more about the profession is to browse the various society pages:

SIOP - Division of the American Psychological Association: The United States is huge; and I/O is huge in the United States. The SIOP web page has heaps of useful online resources.
Division of Occupational Psychology: British Psychological Society
Australian Psychological Society: College of Organisational Psychologists

Organisational Psychology in Australia

Registration

"Psychologist" is a regulated term in Australia. It is illegal to call yourself a psychologist, if you are not appropriately registered.
Pathways to registration are set out by the Psychology Board of Australia.
The traditional pathway for registration has involved first completing a four year accredited undergraduate psychology sequence, followed by either two years of supervised practice or the completion of an accredited post-graduate program (e.g., Masters, Doctorate, Masters / PhD). Over recent years, rules for registration have been changing. So, make sure you do your own research.
I should also mention that even if you can't call yourself a "psychologist", completing an undergraduate major in psychology, particularly one with honours in psychology (and perhaps also an undergraduate subject in organisational psychology) can open doors to many roles related to organisational psychology (e.g., HR, selection and recruitment, marketing research, etc.).

Finding organisational psychology university programs in Australia:

The APAC accreditation site lists approved postgraduate psychology programs.
To find organisational psychology courses, last I checked, the following worked
- Click Search for courses - Australia
- Click on the State you want to search for
- Search for "org"

Groups and networking opportunities

The Australian Psychological Society: College of Organisational Psychologists is the main group representing organisational psychologists in Australia.
It is made up of various state branches.
The society sometimes runs sessions suited to students wanting to learn more (e.g., careers fairs).

A few informal online groups are also good places to learn more about the profession in Australia. Both welcome professionals and students:

Facebook group: Organisational Psychology in Australia
LinkedIn group: Organisational Psychology in Australia

Salary Surveys

There are many reasons to find a career in organisational psychology intellectually stimulating and meaningful. There has also often been financial reasons to find it attractive:

A career in academia

PhD on a topic related to organisational psychology or related area

Doing a PhD on a topic related to organisational psychology can create many opportunities. Such a PhD can open up doors to academic positions in a wide range of departments including, psychology, HRM, management, business, and so on. The solid background in statistics and research methods provides a particular advantage for an academic career. Of course, academic positions are competitive and generally require a good publication track record.
Choosing a good PhD supervisor is important. In addition to supervisors in departments that offer organisational psychology programs, it's possible to look at supervisors in departments and universities that don't offer such programs.
The skills learnt can also readily be applied in many social science research related roles in industry.

Examples of eminent organisational psychology academics

For those considering pursuing an academic career related to organisational psychology, the past SIOP award recipients, particularly in the categories Distinguished Scientific Contribution, and Distinguished Early Career Contribution, provide motivating examples of successful I/O psychology researchers.

Example academic websites

The following links point to examples of successful academics in I/O psychology.
I also selected these particular pages because each one provides PDFs for many of the respective academic's publications. This can give a flavour of the kind of work, focus, and specialisation that an academic in I/O might engage in.

Journals to read

Further understanding of the research done in organisational psychology and related disciplines can be gained from reading some of the core journals. A good starting point can be gained by perusing the following ranked list of journals generated by Michael Zickar and Scott Highhouse back in 2001 based on a survey of SIOP members:

Journal of Applied Psychology
Personnel Psychology
Academy of Management Journal
Academy of Management Review
Organizational Behavior and Human Decision Processes
Administrative Science Quarterly
Journal of Management
Journal of Organizational Behavior
Organizational Research Methods
Journal of Vocational Behavior

Additional Resources

Richard Landers has a series of posts providing advice on pursuing a career in I/O psychology from the U.S. perspective:
TIP is the official newsletter of SIOP. Current and back issues are available online and provide a good insight into the profession including the interface between professional practice and scientific research.

Correlation Resources: SPSS, R, Causality, Interpretation, and APA Style Reporting

2011-07-17T22:30:00.000+10:00

This post provides links to a range of resources related to the use and interpretation of correlations. I wanted to provide a page with links to a number of additional resources that would be useful both for those of my students who might be keen to learn more and for anyone else who might be interested. Specifically, this post provides links to: (a) introductory book-style chapters on correlation, (b) resources related to assorted issues in correlation (i.e., discussion of causal inference, correlation with various variable types, range restriction, statistical power, correlation interpretation, and significance testing), (c) tutorials on computing correlations using SPSS and R, and (d) tips for reporting correlations in APA Style.

Introductions to correlation

The following provide general textbook style overviews of correlation:

David Kenny's Chapter 16 Testing Measures of Association provides a textbook overview of correlation designed for psychology undergraduate students. It also includes several practice questions. David Kenny has kindly made his entire textbook 'Statistics for the Social and Behavioral Sciences' available online for free as either an overall pdf or individual chapters.
David Stockburger's Introductory Statistics chapter on Correlation
My own slides and notes on correlation

Assorted Issues

Correlation and Causation

Knowing how to reason about causality in the behavioural and social sciences is a really important skill.

Check out this earlier post on correlation and causation which includes links to PDFs of important journal articles on the topic.
Joy of Stats on Correlation provides a 4 minute video with a few entertaining examples of correlations and their connection with causal inference.

Types of variables

The prototypical correlation example is based on two continuous, normally distributed variables. However, in practice there are many other types of variables that you might wish to correlate. The following provide pages provide links to suggestions for how to analyse some other common scenarios:

Range restriction

HyperStat has a general discussion of range restriction
See this simulation on connexions showing the effect of range restriction

Statistical Power

Statistical power within the context of correlation is the probability of obtaining a statistically significant correlation in a study given that a true correlation exists.

This earlier post provides (a) some simple rules of thumb for power analysis for correlations, (b) how to calculate statistical power using free software called G-Power, and (c) links to additional reading on the important topic of statistical power.

Interpretation

When I first learnt about the correlation coefficient, I found it challenging to truly grok what a particular value meant. Learning the standard interpretation was easy. The challenging part was understanding the practical and theoretical implications for a correlation of a given size.

The following are some of the standard interpretations of a correlation:
- Pearson's correlation is an index of the direction and strength of linear association between two variables.
- The square of the correlation between X and Y is the percentage of variance shared between X and Y (e.g., if r = .50, then the two variables share .50 * .50 = 25% of variance).
- If X and Y were standardised (i.e., made so that the mean of both variables was zero and the standard deviation was one) then, the correlation would be the same as the regression coefficient of X predicting Y or Y predicting X. Thus, for example, if r = .25 you could say that "a value one standard deviation greater on X predicts a .25 standard deviation greater value on Y".
Strategies for building an intuition of what a correlation means:
- Play with the Regression by Eye simulation. The simulation generates a scatterplot, and you are asked to indicate which of a set of correlations corresponds to the scatterplot. It helps to build a mapping between the graphical intuitiveness of a scatterplot and the numeric summary of the linear association in the scatterplot (i.e., the correlation coefficient).
- Memorise some of the rules of thumbs for describing correlation effect sizes (see this discussion by Andy Field), but don't take the rules of thumb too seriously.
- Try to build up a frame of reference for correlations in different contexts by reading results sections. Meta analyses can also be particularly useful in this regard.
- Read the article 'Meyer, G. J., et al (2001). Psychological Testing and Psychological Assessment: A Review of Evidence and Issues. American Psychologist, 56(2), 128-165.' (PDF) which provides large tables of meta-analytic correlations for a wide range of medical and psychological domains sorted by the size of the correlation. Studying these tables can help build an intuition and a context for interpretation of correlations.

Graphical approaches

As with most statistical techniques, there are various ways of representing the data. The correlation coefficient provides a very brief summary of the association between two variables. However, graphical representations of association are much richer.

The following are some general heuristics that I find useful when plotting data that might also be represented as a correlation:

Use scatterplots to explore features of the association (e.g., presence of outliers, linearity, distributional properties, spread of data around any trend line, etc.);
If one of the variables is positively skewed, consider plotting the corresponding axis on a log scale;
If there are a lot of data points (e.g., n > 1000), adopt a different strategy such as using some form of partial transparency (e.g., see use of the alpha property in ggplot2), or sampling the data;
If one of the variables takes on a limited number of discrete categories, consider using a jitter or a sunflower plot;
If there are three or more variables, consider using a scatterplot matrix;
Fitting some form of trend line is often useful;
Adjust the size of the plotting character to the sample size (for bigger n, use a smaller plotting character).

Significance tests on correlations

There are a wide range of possible significance tests that can be performed on correlations. The following links provide some suggestions and links for different scenarios.

General post on comparing significance of two correlations under various conditions.
Significance of correlation using Pearson's table

Statistical Software

Calculating a correlation coefficient and its associated statistical significance is a standard task that almost any statistical package can perform. Many psychology students are taught to use SPSS. It is a proprietary (i.e., you can't run it at home without a paid licence) data analysis system with a strong empahsis on a GUI and making it easy to perform various standardised analyses common in the social sciences.

My preferred tool for performing data analysis is R. It is open source (thus, you can run it at home for free) and is often described as the lingua franca of statistics. It generally requires a more sophisticated understanding of statistics and computing to use effectively. Thus, for the interested psychology student or researcher I have this introduction to R for researchers in psychology.

Below I list resources for performing correlation analysis in SPSS and R.

SPSS

Andy Field has a chapter on correlation which discusses correlation using SPSS.
This video tutorial on running and interpreting a correlation analysis using SPSS goes for about 7 minutes and is elementary.

R

R makes it easy to perform correlations on datasets. Specifically, the following links provide example syntax:

Quick-R on correlations
Quick-R on scatterplots
More generally, William Revelle has some great resources on R for psychology.

Reporting Correlations in APA Style

APA Style Manual: When required to report results using APA style, the authoritative source is the Publication Manual of the APA.
Article Deconstruction: Another general strategy is to find a journal article that (a) reports a similar statistical test as you require, and (b) that is published in an APA journal or at least is in a journal that uses APA style.
- APA journals are listed here
- A quick search on Google Scholar will often be sufficient and quicker, although PsycInfo (a subscription service) is more reliable if you have access to it (many universities do). E.g., a quick search for apa "significant correlation between" psychology revealed several relevant articles and some with immediate PDF access.
- I also have a separate post on this general approach of deconstructing journal articles to discern writing principles.
Correlation Matrices: Many psychological studies, particularly those based on correlational/observational designs, involve the measurement of a range of numeric variables. It is particularly useful, and common, in such cases to report a correlation matrix between sets of variables. I have a post with instructions on formatting a correlation matrix in APA style using a combination of SPSS, Excel, and Word. The post also includes links to examples of correlation matrices being reported.
General overview of reporting statistics including correlations using APA style