tag:blogger.com,1999:blog-89090748302380916802024-03-07T07:02:00.593+11:00Jeromy Anglim's Blog: Psychology and StatisticsPosts on statistics, study design, statistical computing, R, and more with a focus on research applications in psychology.jeromyanglimhttp://www.blogger.com/profile/12949204812496382042noreply@blogger.comBlogger202125tag:blogger.com,1999:blog-8909074830238091680.post-56649968699179371932020-08-04T12:43:00.001+10:002020-08-04T12:48:46.408+10:00A Publication Workflow for Organising Files and DirectoriesThe following describes my workflow for publishing journal articles. It defines a set of rules for organising the files and directories associated with writing and publishing a peer-reviewed journal article.<div>It covers issues of file organisation, version control, and collaboration. It embodies a number of lessons that I've learnt while publishing journal articles. It also helps to have a standardised approach.</div><div><br /><span><a name='more'></a></span></div><div><span></span><div><div>In this context, the project is the publication of a journal article.</div><div><br /></div><div><b>Short Project Name</b></div><div><div>Every project needs a short name that uniquely identifies the project. This name is used in several settings including the parent directory name, the data analysis directory name, the manuscript name, and when talking to colleagues about the project. </div><div><br /></div><div>A good project name is short, descriptive, and uniquely identifies the project. Two words is usually best, but three words is okay. 8 to 15 characters is usually about right. It's a little bit like thinking of a good running head, but even shorter and more for private consumption. </div><div><br /></div><div><b>Examples: </b>Some recent short project names for my papers include: "hexaco-ei", "hexaco-wellbeing", "employee-facets", "hexaco-applicants", "subtask-learning", and "dynamic-wellbeing".</div><div><br /></div><div>Project names can be bad for a range of reasons. </div><div><ul style="text-align: left;"><li><b>Too long.</b> This makes file names hard to read. It makes it difficult to talk to colleagues about the paper. It makes it more mentally taxing to think about the project by name.</li><li><b>Conflicts with other projects.</b> If you have multiple projects in an area, it's important to distinguish the focal project from other similar projects. </li><li><b>Not descriptive enough: </b> It is best to think about the defining feature of the study. Bad names do not elicit the project in mind.</li></ul></div></div><div><b>Parent Directory Name</b></div><div>Every project has a parent directory. This is the directory that contains all the core files of the project.</div><div>The parent directory is named:</div><div><b>"short-name-year"</b>or <b>"short-name-year-storagemode"</b></div><div><b><br /></b></div><div>By storage mode, I refer to tools like dropbox, onedrive, or local for local computer. E.g.,</div><div><b>"short-name-year-dropbox" or </b><b>"short-name-year-onedrive" or </b><b>"short-name-year-local"</b></div><div><b><br /></b></div><div>Appending the year the project commenced is helpful as an additional identifier for the project. In particular, the short name might be good initially, but may become less identifying over time (e.g., you do a lot more similar research). So when you're searching for the project in years to come, the year becomes particularly useful.</div><div><br /></div><div>Appending the storage mode is particularly helpful when you are collaborating with colleagues on a project using a tool like dropbox or onedrive, but you also need to maintain some files on your local computer. For example, confidential data, private brainstorming, files you don't want colleagues interfering with, files that get corrupted on data sharing platforms, etc. In this case appending "local" to your local files and "dropbox" to the shared files helps distinguish the two.<br /></div><div><br /></div><div><b>Examples:</b> "hexaco-ei-2017-dropbox", "hexaco-wellbeing-2017-dropbox"</div><div><br /></div><div><b>Directories</b></div><div>The following are the core directories of the project</div><div><ul style="text-align: left;"><li><b>manuscript: </b>Stores the authoritative version of the manuscript, the reference manager database, and any online supplement files that will be submitted to the journal.</li><li><b>archive: </b>Stores old versions of files. I.e., the manuscript. It provides a simple form of version control.</li><li><b>submissions: </b>This contains one directory for each journal submission and each journal submission includes folders for each step of the publication process. </li><li><br /></li></ul>Additional directories<br /><ul style="text-align: left;"><li><b>notes:</b> Stores any preliminary analyses, literature reviews, and files that involve analysis or reflection.</li><li><b>resources: </b>Store any files related to the study. E.g., meta-data, scoring instructions, details about survey, tasks, and so on, raw data.</li><li><b>analysis:</b> The data analysis files</li></ul></div><div><b>Manuscript directory</b></div><div><b><br /></b></div><div>The manuscript file name is in the following format:</div><div><b>manuscript-shortname-date.docx</b></div><div><b><br /></b></div><div><b>Examples:</b> if the date of last editing was 4th August 2020 and the short name was "hexaco-values", the manuscript file would be called:</div><div>"manuscript-hexaco-values-4-aug-2020.docx"</div><div><br /></div><div>Note that the file never has words like "draft", "rough draft", "final", "absolutely-final", "final2", etc. The date is all that is required to indicate that it is the latest version. </div><div><br /></div><div>The word "manuscript" is placed at the start of the file name for several reasons. First it clearly denotes this file as the manuscript file as opposed to some other file (e.g., supplemental files, etc.). Second, it is easier to identify it as the manuscript file than if the file is called ("shortname-manuscript-date"). This then leads to fewer errors when uploading files to the manuscript submission system.</div><div><br /></div><div><b>When to update the date in the manuscript file name? </b>A general idea is that whenever the manuscript reaches a key stage, a copy of the manuscript is placed in the "archive" directory and the date in the filename is updated. Key stages include: whenever the manuscript moves between authors, when submitting it to a journal, after revise and resubmit, when it has been a long time since you've touched the manuscript, when you're about to engage in some substantial edits. This essentially implements a basic form of version control. It enables you to recover any deleted content should you need to. It also more comfortable to implement edits knowing that things can be restored.</div><div><br /></div><div><b>Other files in the manuscript directory: </b></div><div><b>Files associated with the reference manager.</b> I use Endnote to manage references and I use a database that is project specific. In theory, Endnote can experience issues if multiple collaborators are trying to use Dropbox to work with the same endnote folder. That said, often there are no issues. One solution is to designate one person as the one to manage Endnote, and other authors just put comments about references. </div><div><br /></div><div><b>Other files:</b> Quite often, there are online supplement files that get submitted to the journal. These often provide additional methodological details or additional analyses. It makes sense to keep these in the manuscript folder as they will need to be submitted to the journal.</div><div><br /></div><div><b>Submissions directory</b></div><div><b><br /></b></div><div>The submissions directory is where all the submissions to journals are stored. The general folder structure is that there is a directory for each journal that you submit to with the prefix 1, 2, 3, etc. Obviously, you only submit to journal 2 after journal 1 has rejected you. </div><div><br /></div><div style="text-align: left;"><b>Example:</b> if the first submission was to Journal of Personality it would be called "1-jopy"; If that was rejected, and we tried Australian Psychologist, the second folder would be called "2-apsych". </div><div><br /></div><div>Within each journal submission directory are numbered directories for each stage of the submission. Here is one example set of folders:</div><div><ul style="text-align: left;"><li>1-initial-submission (cover letter, manuscript with anonymised title page, non-anonymised title page, online supplement, confirmation of submission email, pdf of submission)</li><li>2-first-revision (email with revision requests, updated manuscript/supplement, response notes, confirmation of resubmission email, pdf of resubmission)</li><li>3-acceptance (email confirming acceptance, </li><li>4-licence (copy of copyright agreement)</li><li>5-proofs (files associated with proofing)</li><li>6-formatted-online-first (copy of online first version)</li><li>7-preprint (preparing post-print for psyarxiv)</li><li>8-page-numbers (final journal pdf with page numbers)</li></ul><div>Other common folders include:</div></div><div><ul style="text-align: left;"><li>3-second-revision (same files as first revision but just updated)</li><li>4-third-revision (same files as first revision but just updated)</li><li>5-rejection (copy of rejection email; optionally details brainstorming reflections)</li></ul><div>The general principle is that these submission directories include (a) a read-only copy of the manuscript (often split up into title page and body) and related files (e.g., online supplement, figures, etc.), (b) any journal submission specific files (e.g., cover letter, responses to reviewer comments, journal specific information such as highlights), and (c) any journal correspondence, (d) PDFs generated by the submission system.</div></div><div><br /></div><div>An important principle here is that everything has one authoritative source. So, you never edit the actual manuscript in the submissions folder. These edits belong in the "manuscript" folder. The only edits to the manuscript that occur in the submissions folder are things like: anonymising the title page, making the manuscript conform to journal requirements (e.g., putting tables/figures in specific places).</div><div><br /></div><div>That said, things like cover letters and response to reviewer comments do live in their respective submission directories. And that is their authoritative home.</div><div><br /></div><div><b>Resources and Notes Directories</b></div><div>Journal articles have lots of assorted resources (details on measures and procedure, literature searches, data analysis notes, brainstorming of ideas, etc.). The main point here is that these materials are organised in directories of the project. </div><div><br /></div><div><b>Linked Directories</b></div><div>In some instances not all files are contained in the project directory. Resources may be relevant to more than one project. Or there maybe files that need to be stored elsewhere. In this case, I place an alias or shortcut link to these resources in the parent directory.</div><div><br /></div><div><b>Workings File</b></div><div>I often have a file called "workings-shortname.docx" in the root folder of the project. This is used to store all project related brainstorming and notes. </div><div><b><br /></b></div><div><b>Template of Project</b></div><div>I store a template version of a new project on github: </div><div><a href="https://github.com/jeromyanglim/anglim-manuscript-template/">https://github.com/jeromyanglim/anglim-manuscript-template/</a></div><div><br /></div><div>I have a bookmark in my browser which downloads a zipped up copy of the template:</div><div><a href="https://github.com/jeromyanglim/anglim-manuscript-template/archive/master.zip">https://github.com/jeromyanglim/anglim-manuscript-template/archive/master.zip</a></div><div><br /></div><div>This makes starting a new project very efficient. I update it from time to time to reflect changing conventions and so on (e.g., APA 7).</div><div><br /></div><div><br /></div><div><br /></div><div><br /></div><div><b><br /></b></div></div></div>jeromyanglimhttp://www.blogger.com/profile/12949204812496382042noreply@blogger.comtag:blogger.com,1999:blog-8909074830238091680.post-20489415625417159012019-03-05T10:50:00.003+11:002019-03-05T12:09:00.838+11:00Home, End, Page Up, Page Down Keys in OSXShortcut keys for navigation are inconsistent in OSX across applications. Notionally the home and end keys are Fn + Left / Right and Page Up / Page Down are Fn + Up/Down. However, this does not always achieve the same navigational effect, especially for programs ported from Windows.<br />
<br />
<a name='more'></a><br /><br />
<br />
<div>
TextEdit, Chrome Text Editor (probably other native OSX apps)</div>
<div>
<ul>
<li>Paragraph Up / Down: Alt + Up / Down</li>
<li>Start of line / End of line: Cmd + Left / Right</li>
<li>Start of document / End of document: Cmd + Up / Down</li>
<li>Page Up / Down: Fn + Alt + Up / Down</li>
<li>Previous / Next Word: Alt + Left / Right</li>
</ul>
</div>
Finder<br />
<br />
<ul>
<li>Top / Bottom File: Alt + Up / Down</li>
</ul>
<div>
<br /></div>
<div>
Chrome Browsing</div>
<br />
<div>
<ul>
<li>Start / End of document:: Fn + Left /Right OR Cmd + Up / Down</li>
<li>Page Up / Page Down: Fn + Up / Down</li>
</ul>
Skim</div>
<br />
<div>
<ul>
<li>Start / End of document:: Fn + Left /Right </li>
<li>Page Up / Page Down: Fn + Up / Down</li>
<li>Previous Page / Next Page : Cmd + Left / Right</li>
</ul>
</div>
<br />
<div>
<ul></ul>
<div>
Word</div>
</div>
<div>
<ul>
<li>Paragraph Up / Down: Alt + Up / Down</li>
<li>Start of line / End of line: Fn + Left / Right OR Cmd + Left / Right</li>
<li>Start / End of document: Fn + Cmd + Left / Right</li>
<li>Page Up / Down: Fn + Up / Down</li>
<li>Previous / Next Word: Alt + Left / Right</li>
</ul>
<div>
Outlook</div>
</div>
<div>
<ul>
<li>Paragraph Up / Down: Alt + Up / Down</li>
<li>Start of line / End of line: Fn + Left / Right</li>
<li>Start / End of document: Fn + Cmd + Left / Right</li>
<li>Page Up / Down: Fn + Up / Down</li>
<li>Previous / Next Word: Alt + Left / Right</li>
</ul>
</div>
jeromyanglimhttp://www.blogger.com/profile/12949204812496382042noreply@blogger.comtag:blogger.com,1999:blog-8909074830238091680.post-50406193664909942932017-11-14T13:00:00.002+11:002017-11-14T13:01:51.415+11:00Ways that closed access academic publishers could improveAccessing journal articles has certainly got easier over the years. Nonetheless, there are still many issues with how closed-access journals provide access. This hampers the scientific process.<br />
<br />
Accessing a journal article that my institution has paid for should be as simple as going to the standard publications home page (with one more click to download the pdf).<br />
<br />
While from a broader societal level, there are many reasons to like open access publishing. However, for working scientists with institutional access to most journals, the day-to-day issue with closed access publishers is usability related. Here are a few things that I wish all publishers would do:<br />
<br />
<a name='more'></a><br />
<h3>
Do not add cover pages to PDFs</h3>
<div>
Some publishers still insist on adding an initial page to their PDFs.</div>
<div>
Many have stopped doing this. But I notice Emerald still does this.</div>
<div>
<br /></div>
<h3>
When the user clicks download PDF, Download the PDF</h3>
<div>
<ul>
<li>Do not ask the user to consent to terms and conditions (as JSTOR requires)</li>
<li>Do not give a pop-up with further options (e.g., Elsevier - do you want to download article or issue; T&F do you want to download the PDF or an interactive PDF)</li>
</ul>
<div>
If you want to provide two different type of downloads, then include two different buttons on the main page.</div>
<h3>
Do not try to open the PDF in a publisher-specific proprietary PDF viewer</h3>
</div>
<div>
Th user wants to download the PDF to their computer or view it in their normal viewer. Often they want to collate it on their own computer.</div>
<h3>
Facilitate seamless access for those with institutional access from the official landing page</h3>
<div>
Users generally have access to journal articles through their institution. Publishers should do their best to ensure that anyone who has institutional access is able to quickly get access through the official manuscript landing page. They should err on the side of granting access. Whether this involves simple institutional sign-in, cookies, etc. The point is that the official page for a manuscript (i.e., the one that the doi directs to) should provide simple access to articles for people with a subscription (institutional or otherwise). Forcing users to jump through hoops using library proxies, links from Google Scholar and so on is not helpful. This all compounded by the fact that there are many different publishers. Basically, the experience for someone with institutional access should be seamless whether they are accessing the article via a mobile phone or a computer, whether they are on campus or not.</div>
<div>
Presumably, there are many solutions to this. Cookies, simple authentication systems, and so on. The point is, it should "just work" on all devices from all locations.</div>
<h3>
Do not throttle downloads</h3>
<div>
I heard recently that someone was prevented from downloading articles, because they had reached some kind of short term maximum. This is not good. Or at the very least, such a throttle should be implemented only when downloads hit the thousands in a day, not 20 or 30 articles.</div>
<div>
<br /></div>
<h3>
General Principles</h3>
<div>
<ul>
<li>Listen to the usability experts and not the legal department</li>
<li>Your priority is facilitating access to scientific knowledge</li>
</ul>
</div>
<div>
<br /></div>
<div>
<br /></div>
jeromyanglimhttp://www.blogger.com/profile/12949204812496382042noreply@blogger.comtag:blogger.com,1999:blog-8909074830238091680.post-18460318701367724152017-03-31T14:14:00.001+11:002022-10-21T14:15:56.423+11:00Generating APA style tables in R: Current challengesThis post reviews some aspects of generating formatted tables using R suitable for inclusion in a manuscript conforming to APA style. I review my current workflow that involves a large amount of manual formatting in Excel. I then discuss what it would take to automate more of these manual steps in R.<br />
<br />
<a name='more'></a>My current workflow for incorporating tables into a journal manuscript involves the following steps:<br />
<ul>
<li>Create data.frame in R with core table data, row names are column names carry row and column headers. This usually includes some rounding of numbers to desired precision (in order to avoid Excel rounding errors)</li>
<li>Export data.frame as a csv e.g., <span style="font-family: "courier new" , "courier" , monospace;">write.csv(mytab, file = "output/mytab.csv")</span><span style="font-family: inherit;">, although sometimes I'll write to Excel to automate bolding.</span></li>
<li><span style="font-family: inherit;">Open csv in Excel and apply manual formatting</span></li>
<li><span style="font-family: inherit;">Paste adjusted table into Word</span></li>
</ul>
<h3>
Pros and cons of manual formatting in Excel</h3>
Benefits of Excel approach<br />
<div>
<ul>
<li>In many respects this approach is fairly efficient. </li>
<li>If you are not updating your table results often, then it is often quicker to do formatting adjustments in Excel. </li>
</ul>
<div>
Problems with Excel approach</div>
<ul>
<li>If the data is updated multiple times, then the conversion of the table to formatted requirements can be time consuming. </li>
<li>There is also the potential for errors to be introduced in the adjustment process. And the more times that the data is updated, the more the adjustment process might lead to transcription errors.</li>
<li>The time it takes to manually convert the table discourages making updates that would require this.</li>
<li>There is scope to standardise certain tables (e.g., correlation matrices, tables of descriptives by groups) and thus work spent automating could have benefits for future projects.</li>
</ul>
</div>
<h3>
Review of activities done during Excel formatting</h3>
<div>
The following is influenced by terminology and formatting requirements of APA style (see Chapter 5 in APA 6th Edition Manual).</div>
<div>
<ul>
<li><b>Modify fonts</b></li>
<ul>
<li>Change font type and size to align with manuscript (e.g.,12 point Times New Roman)</li>
<li>Add selective font formats. Bolding certain numbers is quite common (e.g., correlations or factor loadings above a threshold); Italicising certain statistical labels (e.g., M and SD, 1, 2, 3 etc in correlation tables) is common.</li>
<li>Superscripted fonts related to specific table notes</li>
</ul>
<li><b>Add or modify content</b></li>
<ul>
<li>Convert R row and column names to names used in table. In particular, variable names are almost always distinct from table names.</li>
<li>Ensure capitalisation meets style requirements</li>
<li>Add consecutive numbers and period typically to row names. E.g., it is common to number variables in a correlation matrix "1. Age", "2. Income", etc. </li>
<li>Add stub heading. I.e., the column heading for the first column (i.e., row.names) </li>
<li>Adjust numbers: e.g., a p-value less than .001 might be shown as <.001, an adjusted r-squared value less than 0 might be displayed as 0.</li>
<li>Convert p-values to significance stars</li>
</ul>
<li><b>Adjust cell alignment. </b></li>
<ul>
<li>Usually, headers are centred, numbers in body are centred, and first row is left aligned.</li>
<li>When row headings are nested, nested row stubs are indented (e.g., 3 spaces)</li>
</ul>
<li><b>Delete cell content</b></li>
<ul>
<li>Deleting lower or upper diagonal from symmetrical matrices: e.g., correlation matrix</li>
<li>Deleting diagonal from correlation matrices</li>
</ul>
<li><b>Delete rows or columns</b></li>
<ul>
<li>Ideally, the actual rows or columns of data have been specified correctly in R, but occasionally, it is simpler to remove rows or columns at the Excel stage. For example, the R output might list fit statistics for 6 models, but it is later decided that only five are relevant. In particular, rearranging the order of rows should be done in R for increased reliability.</li>
</ul>
<li><b>Add lines</b></li>
<ul>
<li>Lines are placed on top and bottom line of column header and bottom line of last row</li>
<li>Decked column headings and table spanners require additional lines</li>
</ul>
<li><b>Format numbers</b></li>
<ul>
<li>Common tasks include adjusting number of decimal places, removing leading zeros (e.g., correlations, multiple r, p-values), putting parentheses around certain numbers, putting two numbers together in some way (e.g., ranges, confidence intervals, often have a separator like a comma or hyphen and may be surrounded by brackets).</li>
</ul>
<li><b>Add line breaks in cells</b></li>
<ul>
<li>Some cells have two or more bits of information that should be presented on distinct rows. column names will include sample size on second row (e.g., "Treatment {line-break} (n = 132)" ). E.g., value is presented in first line and confidence intervals in second line. In this case, it is also possible to insert an additional row into the table and include these values in separate cells.</li>
<li>Some text is too long and needs to be split across multiple rows. This is usually done automatically. However, often this should include an indent on the second or subsequent row.</li>
</ul>
<li><b>Adjust column widths</b></li>
<ul>
<li>This is often a manual process in order to get the table to fit on the page and avoid cell wrapping.</li>
</ul>
<li><b>Decked headings: Special requirements</b></li>
<ul>
<li>Decked headings occur where two or more column headings are grouped under a column spanner (e.g., M and SD is shown for two groups where the group name is the spanner). </li>
<li>Merge cells of column spanner (i.e., the heading that groups the two columns)</li>
<li>Insert line below the cells of the column spanner</li>
<li>Insert a small empty column between column spanner and other columns (this ensures that there is a gap between the line underneath the column spanners and makes it easier to see the intended grouping)</li>
</ul>
<li><b>Table spanners: Special requirements</b></li>
<ul>
<li>A table spanner is a centred heading that represents a major subdivision of a table. </li>
<li>It involves inserting a new row with merged cells and centred text and adding a line to the bottom of the table division.</li>
</ul>
<li><b>Table caption, title, and notes: Special requirements</b></li>
<ul>
<li>In general, I specify these things in the manuscript. Mostly this works well. There is just the occasional bit of information that might be data driven. E.g., correlations above a certain value might be flagged as significant and this information might be included in the table note.</li>
</ul>
</ul>
<h3>
Reflections on manual formatting</h3>
</div>
<div>
Table formatting is complex. There is a visual quality to formatting tables. While some tables are approximated by a matrix with row and column headers, there are a huge number of common and not so common additional requirements. I often identify refinements to table formatting in an iterative fashion until it looks right.<br />
<br />
While I attempted to document all the tasks that I do, I would not be surprised if there were additional tasks that did not come to mind. And presumably the common requirements of APA style tables in psychology are not the same as those relevant to other style guides and other disciplines.<br />
<br />
It is possible to automate all of the above steps using R and output a table in a suitable format such as rtf, docx, or possibly HTML. However, at this point, this would require a lot of coding for each table.<br />
<br />
There are a few packages of relevance:<br />
<br />
<ul>
<li><a href="https://cran.r-project.org/web/packages/apaTables/index.html">apaTables</a> provides APA tables exported to RTF for a few very specific scenarios. And the author also adopts specific preferences, which while well reasoned, are not always what you want.</li>
<li><a href="https://cran.r-project.org/web/packages/apaStyle/index.html">apaStyle</a> is similar to apaTables in that it exports to Word format, although it seems a little more flexible. It has a generic table function that can handle decked headings, but it still seems a long way from the flexibility required to produce most tables.</li><li><a href="https://rempsyc.remi-theriault.com/articles/table">rempsyc</a> includes functions for outputting APA tables to Word from R.</li>
<li><a href="https://cran.r-project.org/web/packages/xtable/index.html">xtable</a> is one of the best packages for table production but it exports principally to HTML and LaTeX. It also doesn't really seem designed for capturing all the complexities of APA style tables.</li>
<li>htmlTable in gmisc allows for some complexity. <a href="http://timelyportfolio.blogspot.com.au/2013/04/tables-are-like-cockroaches.html">See this example.</a></li>
</ul>
<br />
The challenge is to design a flexible and efficient system that is also reliable (in that it limits the introduction of errors). I think a nice challenge for anyone willing to take this on would be to develop simple set of functions in R that can be applied to generate tables in Word or RTF format that could be applied to produce the 16 tables in the APA 6th edition style manual (ideally from hypothetical data to include the additional challenges of extracting and formatting the numbers, converting variable names, etc.). These tables include a range of the common requirements of APA style that are not well supported in existing packages.<br />
<br />
**Update:**<br />
<br />
<ul>
<li>After posting, I learnt about the <a href="https://github.com/crsh/papaja">papaja package</a>. It seems specifically designed for writing APA style documents with R Markdown. The apa_table function seems like its designed to capture many of the quirks of APA style, but at present its more advanced table-formatting features are limited to exporting LaTeX (i.e., Rmarkdown to LaTeX to PDF). A fully reproducible workflow has a lot to love, but at present I still find that collaboration and other features makes Word my go-to option for manuscript preparation. </li>
<li>huxtable (mentioned in the comments) has quite a lot of formatting flexibility. It exports to HTML and LaTeX format. See <a href="https://cran.r-project.org/web/packages/huxtable/vignettes/introduction-to-huxtable.html">this vignette</a>. It also supports a row and column spans, albeit row spans are handled as separate columns whereas APA style uses indenting. I'm also not clear on how you would go from HTML to Word. My general impression is that HTML is less prescriptive by design.</li>
</ul>
</div>
jeromyanglimhttp://www.blogger.com/profile/12949204812496382042noreply@blogger.comtag:blogger.com,1999:blog-8909074830238091680.post-58093396796568627312016-09-12T13:46:00.000+10:002016-09-12T14:07:49.939+10:00Suggestions for how R and RStudio could improve auto-completion and usability of RRStudio has improved the power of auto-completion in R and generally increased usability. However, there remains the potential to improve discoverability and usability. There are also coding practices that R package authors can adopt both to work better with auto-complete and make the features of their R package more discoverable. After using and teaching R for the last ten years, this post outlines what I see as major areas for potential improvement.<br />
<a name='more'></a>R has a reputation as being efficient once you know how it works, but difficult to learn.<br />
<div>
Auto-completion increases coding productivity.<br />
<div>
<ul>
<li>Users don't have to memorise the precise spelling of the name of every function, argument name, data frame variable, and argument value. It also helps to resolve the issue of the wide range of coding conventions in R (camelCase, dot.names, under_score_names, etc.). </li>
<li>It means that users can focus more on coding and less on looking up help files for the precise phrasing of some low level feature, or constantly typing <span style="font-family: "courier new" , "courier" , monospace;">dput(names(mydata)) </span><span style="font-family: inherit;">to get lists of variable names.</span></li>
<li><span style="font-family: inherit;">New users may also know what they are looking for, but not know how to obtain it. Auto-completion can facilitate this.</span></li>
</ul>
<div>
My general conclusion is that auto-completion needs to be taken more seriously in R. RStudio has done a great job of implementing auto-completion. I also think that the R language and package authors could incorporate features to work better with IDEs that implement auto-complete. </div>
<div>
<h3>
<b>Auto-completion of arguments that take a character variable</b></h3>
</div>
<div>
<span style="font-family: inherit;">Many functions have multi-category options (e.g., method of correlation, missing data procedure for a table, type of factor analysis rotation. It would be good to have auto-completion on these values. </span></div>
<div>
<b><br /></b></div>
Example 1: If I have missing data on a correlation matrix, then I use the "use" argument to specify what kind of missing data substitution should occur. It would be good if code completion operated on the available options. That said, at least RStudio automatically shows the argument instructions which lists the options.<br />
<br />
Example 2: The options for <span style="font-family: "courier new" , "courier" , monospace;">useNA</span> and <span style="font-family: "courier new" , "courier" , monospace;">exclude</span> arguments of <span style="font-family: "courier new" , "courier" , monospace;">table</span><br />
<br />
Example 3: The <span style="font-family: "courier new" , "courier" , monospace;">rotation</span> argument for <span style="font-family: "courier new" , "courier" , monospace;">factanal</span> does not list available rotations. The help only states that the default argument is <span style="font-family: "courier new" , "courier" , monospace;">"varimax"</span> and that there are other rotations in some other packages, although the help files does show "promax" as another option.<br />
<br />
<b>Recommendation</b>:<br />
<br />
<ul>
<li>Package authors should ensure that the help files list all argument options in the "arguments" section of the help file. If using "see details", at least list the permissible option names in the arguments section and use "see details" for actually putting the details of what each of the arguments means. RStudio displays the argument information in auto-complete. Often a user just wants to be reminded of the precise spelling for the argument option or wishes to get an overview of the choices.</li>
<li>It should be possible to enable auto-completion on the available options. I imagine this would involve the specification of additional language features in R which would then be detected by IDEs like RStudio.</li>
</ul>
<h3>
Auto-completion for nested elipsis arguments </h3>
<div>
Elipsis arguments (...) allow for flexibility. However, they also decrease usability because, users are less clear on what arguments can be passed to a function. This is particularly true for arguments to methods like print and summary.</div>
<br />
Example 1: I'm running a factor analysis<br />
<span style="font-family: "courier new" , "courier" , monospace;">fit <- factanal(matrix(rnorm(1000), ncol = 10), 2)</span><br />
<span style="font-family: monospace;"><br />
</span> The code for printing the loadings, has several arguments including "sort" and "cutoff"i.e.,<br />
<code><span style="font-family: "courier new" , "courier" , monospace;">print(fit, sort = TRUE, cutoff = .5)</span></code><br />
<code><br />
</code> <br />
But auto-complete doesn't see these arguments. RStutidio actually does a pretty good job of finding arguments. It seems that these arguments are related to "print.loadings" as opposed to "print.factanal". Thus, if you go:<br />
<span style="font-family: "courier new" , "courier" , monospace;">loads <- fit$loadings</span><br />
<span style="font-family: inherit;">Then, pressing tab after </span><br />
<span style="font-family: "courier new" , "courier" , monospace;">print(loads, </span><br />
<span style="font-family: inherit;">will show the </span><span style="font-family: "courier new" , "courier" , monospace;">cutoff</span><span style="font-family: inherit;"> and </span><span style="font-family: "courier new" , "courier" , monospace;">sort</span><span style="font-family: inherit;"> arguments.</span><br />
<span style="font-family: inherit;">However, it seems that RStudio is only able to to go one layer deep.</span><br />
<span style="font-family: inherit;"><br /></span>
I imagine that this is a hard one to solve.<br />
<h3>
Auto-completion of variable names in data frames</h3>
<div>
There is limited auto-completion support in RStudio for names in data frames. It has improved. You can type <span style="font-family: "courier new" , "courier" , monospace;">mydata[, {tab}</span> and get the variable names. However, you can't type <span style="font-family: "courier new" , "courier" , monospace;">mydata[,c(" {tab}</span><span style="font-family: inherit;">.</span></div>
<div>
<br /></div>
<div>
Recommendations:</div>
<div>
<ul>
<li>RStudio should also auto-complete variable names after <span style="font-family: "courier new" , "courier" , monospace;">mydata[,c("</span> . i.e., after quotation marks. Because presumably that is how the user would be selecting variables and then they realise that they can't remember the precise spelling and so need to tab complete.</li>
</ul>
</div>
<h3>
Auto-completion on formulas</h3>
<div>
Many functions in R use formulas. Most notable are model fitting functions like <span style="font-family: "courier new" , "courier" , monospace;">lm</span> and <span style="font-family: "courier new" , "courier" , monospace;">glm</span>. However, there is no support in RStudio for auto-completing variable names in formulas. Some of the impediments to this: Formulas come before listing the data.frame in most functions (e.g., lm). So if there are multiple data frames in the workspace, then it would be a little tricky to know which to list. </div>
<h3>
Auto-completion in the Hadleyverse (e.g., ggplot2) and other functions where a data frame is one argument and variable names are another</h3>
<div>
Hadley Wickham's packages are awesome. However, they have a particular coding style. In particular, a data frame is commonly one argument (e.g., the first) and variable names are specified as a separate argument; often this is done without quotation marks and in a slightly separate context to the specification of the data frame. For example, in the following context:</div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;">ggplot(mydata, aes(my_very_long_variable_name))</span></div>
<div>
<span style="font-family: inherit;">There is no auto-completion in RStudio for the variable </span><span style="font-family: "courier new" , "courier" , monospace;">my_very_long_variable_name.</span></div>
<div>
<span style="font-family: "courier new" , "courier" , monospace;"><br /></span></div>
<div>
<span style="font-family: inherit;">Similar coding rules apply to a wide range of functions where variable names are specified in a separate argument to the data.frame (e.g., see many of the </span><span style="font-family: "courier new" , "courier" , monospace;">dplyr</span><span style="font-family: inherit;"> and </span><span style="font-family: "courier new" , "courier" , monospace;">tidyr</span><span style="font-family: inherit;"> functions, but also base R functions like </span><span style="font-family: "courier new" , "courier" , monospace;">subset</span><span style="font-family: inherit;"> and </span><span style="font-family: "courier new" , "courier" , monospace;">reshape</span><span style="font-family: inherit;">). These functions would be so much easier to use if there was auto-completion of variable names in these contexts. </span></div>
<div>
<br /></div>
<div>
One approach would just be to show auto-completion of variable names of data.frames in more places. However, this could get noisy. Another approach would require a deeper understanding of the language. Presumably this could be done on an ad hoc basis. For example, RStudio could hard code ggplot2 features to know when auto-completion on variable names should occur. Otherwise, perhaps there could be a convention for how package authors could speak to IDEs that want auto-completion information, and a more general way of indicating that auto-completion software should look at the preceding data.frame for the variables.</div>
<h3>
<span style="font-family: inherit;">Auto-completion for function arguments that take lists</span></h3>
<div>
<span style="font-family: inherit;">There are many functions that have an argument that takes a named list:</span></div>
<div>
<ul>
<li><span style="font-family: "courier new" , "courier" , monospace;">nls(..., control = list(...))</span></li>
<li><span style="font-family: "courier new" , "courier" , monospace;">ProjectTemplate::load.project(override.config = list(...))</span></li>
</ul>
<div>
There is no auto-completion on what are the allowed named elements.</div>
</div>
<div>
<br /></div>
<div>
Recommendations:</div>
<div>
<ul>
<li>Package authors: should include the list of permissible argument names in the argument section of the help file so that auto-completion software could quickly show this information.</li>
<li>R language: There should be a way to specify what are the permissible arguments which could then be then incorporated into some form of auto-complete in RStudio.</li>
</ul>
<h2>
Some other issues</h2>
</div>
<div>
The following are some other related issues that link with the issue of auto-completion. </div>
<h3>
Make more model fit information accessible from the fit object</h3>
<div>
An attractive feature of SPSS and related software is that you get a lot of output and there is often a GUI that allows you to select the output that you want. R model output tends to be brief, and if you want additional output, you need to ask for it. This is good also, but how to obtain the additional output could be more intuitive. For example, there is a lot of different information that you might want to obtain from a multiple regression (influence statistics, standardized coefficients, zero-order correlations between predictors and outcome, and so on). One of the challenges is that the model in R is often of the form: (1) return fit, (2) run function or method on that fit object. However, for a new user, it is often difficult to discover what are the available functions and methods that are required to derive a relevant bit of information from an R fit object. </div>
<div>
<br /></div>
<div>
It would be nice if it was as simple as typicaly<span style="font-family: "courier new" , "courier" , monospace;"> fit. {tab} </span><span style="font-family: inherit;">and you would get a big list of things that you might want to obtain.</span></div>
<h3>
<b>Avoid printing output to the screen that can not easily be extracted</b></h3>
<h3>
<div style="font-size: medium; font-weight: normal;">
R generally makes reproducible analysis easier to perform. A common use case is to take the output of a function and use that output in a subsequent function. This can be as simple as creating a table that combines different elements (e.g., coefficients from multiple models along with fit statistics).</div>
<div style="font-size: medium; font-weight: normal;">
<b><br /></b></div>
<div style="font-size: medium; font-weight: normal;">
However, some functions print the statistics you want to the screen, but these numbers are not readily available. In general, this means that print function is performing the calculations and printing them to the screen, without ever storing the results in an object.</div>
<div style="font-size: medium; font-weight: normal;">
<br /></div>
<div style="font-size: medium; font-weight: normal;">
Example 1: The print method for factanal prints proportion variance explained for each factor. This is calculated in the print function but is not accessible. If you didn't know how to calculate this yourself, you would have to know that <span style="font-family: "courier new" , "courier" , monospace;">getAnywhere(print.factanal)</span> is the incantation for seeing how R calculates it, and then you'd have to extract the code that does it.</div>
<div style="font-size: medium; font-weight: normal;">
<br /></div>
<div style="font-size: medium; font-weight: normal;">
In contrast, when you run summary on an lm fit, you can explore the object and extract things like adjusted r-squared. E.g.,</div>
<div style="font-size: medium; font-weight: normal;">
<span style="font-family: "courier new" , "courier" , monospace;">fit <- lm(y ~ x, mydata)</span></div>
<div style="font-size: medium; font-weight: normal;">
<span style="font-family: "courier new" , "courier" , monospace;">sfit <- summary(fit)</span></div>
<div style="font-size: medium; font-weight: normal;">
<span style="font-family: "courier new" , "courier" , monospace;">sfit$ (tab)</span></div>
<div style="font-size: medium; font-weight: normal;">
<br /></div>
<div style="font-size: medium; font-weight: normal;">
This will show the elements of what has been calculated. Depending on trade-offs for computation time, it might even be simpler, if more of these relevant summary statistics are calculated with the fit. So that a user only has to fit the object, and then they can extract the relevant information with <span style="font-family: "courier new" , "courier" , monospace;">fit$ (tab)</span></div>
<div style="font-size: medium; font-weight: normal;">
<br /></div>
<div style="font-size: medium; font-weight: normal;">
Recommendation</div>
<div style="font-size: medium; font-weight: normal;">
</div>
<ul style="font-size: medium; font-weight: normal;">
<li>Package authors should try to ensure that for every important bit of output in a print function, there should be a standard way of extracting that information into an object. For example, the summary method for lm returns the adjusted r-squared.</li>
</ul>
<div style="font-size: medium; font-weight: normal;">
</div>
</h3>
<h3>
<span style="font-family: inherit;">Many different object exploration operators</span></h3>
<div>
<span style="font-family: inherit;">There are many different operators for exploring objects</span></div>
<div>
<ul>
<li>$ (dollar) to extract named elements of a list (particularly used for output of statistical functions, variables in data.frames and general lists of things) .</li>
<li>:: (double colon) to extract functions and other objects in a package (e.g., <span style="font-family: "courier new" , "courier" , monospace;">mypackage::foo()</span>)</li>
<li>::: (triple colon) to extract hidden functions</li>
<li>@ (at symbol) to extract elements of S4 class objects</li>
<li>. (period) which is a notational rule relevant to understanding S3 methods (e.g., print.lm)</li>
</ul>
<h3>
Many rules for examining source code</h3>
</div>
<div>
Being able to see the source code is a nice feature in R. But equally, you need to know quite a bit to actually look at source code. e.g., getAnywhere, double colon versus triple colon, compiled code.</div>
<div>
<br /></div>
</div>
</div>
jeromyanglimhttp://www.blogger.com/profile/12949204812496382042noreply@blogger.comtag:blogger.com,1999:blog-8909074830238091680.post-11389023723642414472016-07-07T12:40:00.004+10:002023-03-13T10:40:08.689+11:00Workflow for Completing a Revise and Resubmit of a Journal Article in PsychologyThis post discusses my workflow for completing a revise and resubmit.<br />
I have a template document for representing revise and resubmit responses.<br />
See my <a href="https://github.com/jeromyanglim/APAWordTempate">templates page on
github</a> and specifically see
the file <a href="https://github.com/jeromyanglim/APAWordTempate/blob/master/response-to-reviewers.dotx?raw=true">"response-to-reviewers.dotx"</a>.<br />
<br />
<a name='more'></a><h3>
Setting up the Response Document</h3>
The document has the following core styles:<br />
<ul>
<li>Heading 1: Divides up major sections of the review (e.g., Editor, Reviewer 1,
Reviewer 2</li>
<li>Heading 2: Summary statement for each reviewer actions</li>
<li>Reviewer Comment: Exact quote of a particular reviewer comment</li>
<li>Body text: For recording my response</li>
<li>Quote: For formatting quotes of specifically modified sections of the text</li>
</ul>
Step 1 is to paste the full text of the editor and reviewers into a new Word
response document. Apply reviewer comment style<br />
<br />
Step 2 is to set up the response document. Level 1 headings are added that
divide up the the reviewer sections.<br />
<br />
Reviewer comments are divided into discrete points.
The division of revision points may or may not be clear.
Some reviewers provide numbered points. Others provide a more narrative review
where each paragraph includes multiple points. Some points are interconnected
but involve distinct actions.<br />
For each point that is identified, I add a level 2 heading. The level 2 heading
includes an identifier and a brief summary statement of the requirement.
Identifiers are for example, "R1.2", which would refer to reviewer 1's second
point. In some cases, where there are connected points, you get "R1.2.1",
"R1.2.2" and so on.<br />
<br />
There are several benefits to using identifiers. In some cases, multiple
reviewers make the same point. Thus, you can quickly refer the reviewer to
another review point. E.g., "This point was addressed in reviewer point R1.2".
It can also be an efficient way of keeping track of reviewer points when you are
working through a large number of them.<br />
<br />
The summary statements are important. I aim to keep them short. Ideally they'll
fit on one line so that they are easy to quickly understand (i.e., around 50
characters). I try to make them commands. For example:<br />
<ul>
<li>Clarify unique contribution</li>
<li>Improve study motivation in introduction</li>
<li>Describe x more clearly</li>
<li>Add references to ...</li>
<li>Justify statistical method</li>
<li>Consider using ... method</li>
<li>Include ... Table 1</li>
</ul>
In some cases, the required action is not explicitly stated by the reviewer. For
example, if a reviewer critiques a methodological decision, there are various
possible actions including, justifying your choice, adding a limitation, and so
on.<br />
<br />
Benefits of the above approach<br />
<ul>
<li>Using formal Headings in MS Word allows you to view a document
map on the side that can quickly allow you to navigate between reviewer points.</li>
<li>Another benefit of the above process is that reviewer comments start to appear
more manageable. When you first receive a few pages of reviewer comments, it can
feel overwhelming. The above process begins to divide up each point into a more
manageable task. The act of providing a summary statement also forces you to
read and understand what action is required to respond to the reviewer comment.</li>
</ul>
<h3>
Record initial reflections</h3>
<div>
Above, I show how the first reading is used to parse reviewer comments into discrete points and give descriptive titles. In the second reading, I add comments to each reviewer point using the comment feature in the Word Processor. This is an opportunity to have some initial reflections on (a) how easy it will be to satisfy the revision, (b) whether a change to the manuscript is required, and (c) what should be done. After I've added these, I often circulate the response document to collaborators, to allow them to add comments.</div>
<h3>
Sequencing the Revisions</h3>
The next task is to determine a sequence for working through the revisions.
This involves keeping track of which points still need to be addressed and
deciding on an order to work through the points.<br />
At a basic level, I place an asterisk at the start of each heading that has not
yet been addressed. This is removed once the point has been adequately
addressed.<br />
<br />
A more challenging issue is deciding on how to work through the
changes. Some changes are interdependent. However, major revisions often
have to be worked through first as they can have broader structural implications
for the manuscript.<br />
A few useful steps for thinking about sequencing include:<br />
<ul>
<li>Organise the points into categories</li>
<li>Read through each point, and make some tentative notes about what to do (e.g., using comments in Word).</li>
<li>Decide on an explicit sequence to work on the points. This often requires you
to brainstorm the pros and cons of working on one point versus another first.</li>
</ul>
In some cases, sequencing will raise some more meta-issues about the paper that
transcend any given review point. I mostly find it easiest to work through points in the following order: analyses, results, method, introduction, discussion. The rationale is that any new analyses that you run and incorporate into your paper will change your results. And these may further require changes to the method, which in turn influence the framing and discussion. Likewise, if the introduction is changed, this may have implications for how the discussion integrates topics raised in the introduction.<br />
<br />
Logistically, I generate a table of contents in MS Word. This lists all the reviewer point titles (i.e., the IDs and the titles such as "R1.1 Update method to include ..."). This works because all the reviewer points are formatted using heading styles. I then copy and paste this as plain text into a work document. These points are then organised thematically under headings and into an appropriate sequential order.<br />
<h3>
Addressing Revision Points</h3>
If sequencing issues have been resolved, it is a matter of working through each
revision point. I have a few guiding principles:<br />
<ul>
<li>Write in a manner which focuses on the scientific issue.</li>
<li>Treat the reviewer with respect.</li>
<li>If a reviewer has misunderstood something in the manuscript, take responsibility
for making the manuscript clearer.</li>
</ul>
Another point is that the response document should be self-contained. Ideally,
the reviewer should not need to look at the actual manuscript to judge whether
you have effectively responded to their requested changes. This makes the experience of the
reviewer much more pleasant. From a strategic perspective, they may also be less
inclined to read through the entire manuscript again and come up with all new
concerns.<br />
<ul>
<li>If a table or figure is updated, then paste a screenshot of the updated table
or figure.</li>
<li>If a new paragraph has been added, include a copy of that paragraph.</li>
<li>If a sentence or two has been added to a paragraph, include a copy of the
whole paragraph and bold the section that has been added.</li>
<li>Only if the point is very basic is it sufficient to say, "this change was
made". Examples of this might be adding a reference, fixing up typos, and so
on.</li>
</ul>
Another useful strategy is to indicate new text in the manuscript with a different colour font (e.g., purple).<br />
<ul>
</ul>
<h3>
Collaborations and Revisions</h3>
It is often easiest if one person leads revisions.
The lead person can also allocate specific revision tasks to co-authors. There is the issue of how to synchronise the revisions in the manuscript with the response document. If the changes are particularly complex or the collaborators are likely to make substantial additional changes to the manuscript, then it may be worth waiting a little bit before completing the response document. Or alternatively just see the response document as an initial draft to be returned once the manuscript has been finalised.<br />
<br />
<h3>
Track Changes</h3>
<div>
Some journals require that you include a version of the manuscript with track changes. In other cases, it can just be a useful addition to the submission. If you are using MS Word, then the compare documents feature is ideal for generating this document. This feature allows you to anonymise the change because you can label the change with "author" rather than your actual name.</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjTiZgOFN3AyKdUwX15o3hJe7pb6N6hRkFmXhj4osAPDxf7am6dwyqpO6SR00WaFM6m3vQ60Q732_vgzvlXCsO2xoj0WNl-rRalcDDz5urv59DqX4ydKxIMyTmEOKOnzDmBgqmr65vvmw/s1600/Screen+Shot+2017-04-11+at+5.17.25+pm.png" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="215" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjTiZgOFN3AyKdUwX15o3hJe7pb6N6hRkFmXhj4osAPDxf7am6dwyqpO6SR00WaFM6m3vQ60Q732_vgzvlXCsO2xoj0WNl-rRalcDDz5urv59DqX4ydKxIMyTmEOKOnzDmBgqmr65vvmw/s320/Screen+Shot+2017-04-11+at+5.17.25+pm.png" width="320" /></a></div>
<div>
<br /></div>
jeromyanglimhttp://www.blogger.com/profile/12949204812496382042noreply@blogger.comtag:blogger.com,1999:blog-8909074830238091680.post-39345959769342214222016-07-07T11:02:00.003+10:002016-07-07T12:52:15.214+10:00Managing Timeframes from Initial Submission to Final Acceptance of Journal Articles in PsychologyThis post discusses issues related to the managing the timeframe from an initial submission
to a journal to final acceptance at that or another journal. They are personal notes that pertain to my experiences in psychology. I post them here in case they might be useful for others.<br />
<br />
<a name='more'></a><br />
<h1>
Timeframe</h1>
<h2>
Overview of time frame</h2>
The following are very rough rules of thumb for timelines for various decisions
along with ranges that cover the majority of cases.<br />
<ul>
<li>Submission decision
<ul>
<li>Desk reject: 1 month (0 - 2)</li>
<li>Sent out for review: 3 months (1 - 5)</li>
</ul>
</li>
<li>Preparing new submission
<ul>
<li>Make changes: 1 month (0 - 2)</li>
<li>Wait to find time (self or others), acquire more data, get new skills:
highly variable length</li>
</ul>
</li>
<li>Revisions:
<ul>
<li>Prepare revisions: 1 month (0 - 2)</li>
<li>Review of revisions: 2 months (0 - 4)</li>
</ul>
</li>
</ul>
Thus, a basic formula is:<br />
<pre><code>months = number_of_reviewed_submissions * 3 +
number_of desk_rejects * 1 +
number_of_major_edits_on_new_submissions * 1 +
make_revisions_and_accepted: i.e., 3 +
gap_time (i.e., sum of all gaps)
</code></pre>
Thus, in summary, desk rejects don't add a lot of time and gap time can be
avoided if the manuscript is high priority.<br />
<ul>
<li>Accepted on first journal review (Submit, Revise, Accept): 6 months</li>
<li>Accepted on second journal review: (Submit, New Revise, Submit, Revise, Accept) 10 months</li>
<li>Accepted on third journal review: 14 months</li>
</ul>
It is natural and appropriate to aim high on the first submission.
You also typically get good feedback which could be used to improve the
manuscript which can take a bit of time.<br />
<h2>
Submission decision</h2>
<h3>
Desk rejection</h3>
Desk rejection occurs when an editor (chief or possibly action editor) reviews
the manuscript and decides that the paper is not worth sending out for review.
This commonly occurs when the editor thinks the paper is an inappropriate fit
for the journal. Alternatively, the editor may feel that it is not up to the
standard of the journal. This can be either on novelty-interest grounds or on
pure scientific grounds.<br />
Desk rejection is typically quick. You also typically don't get a lot of
feedback about what is wrong with the manuscript in general. However, sometimes
you will get some feedback about areas that can be improved. Alternatively, the
desk rejection can help to refine your understanding of what is on topic at
particular journals.<br />
<h3>
Sent out for review</h3>
If a paper is sent out for review, you will typically receive some detailed
feedback. That the process often takes around 3 or 4 months is not surprising.<br />
<ul>
<li>Editor has to appraise the manuscript and determine if it should be sent out
for review (1-2 weeks)</li>
<li>Editor has to contact reviewers and get agreement to review manuscript (1-4
weeks)</li>
<li>Enough of the reviewers have to have completed their reviews (8 weeks is
common)</li>
<li>If insufficient reviews, then ask for more reviewers (can add another 8 weeks)</li>
<li>Editor needs to go through reviews and possibly add own review and make
decision (1 to 4 weeks)</li>
</ul>
A range of decisions can be provided but the two most common are (1) rejection
and (2) request for revisions (with a distinction between minor and major; and
sometimes between submitting revisions and resubmitting the manuscript).<br />
<h2>
Preparing new submission</h2>
If you get a desk rejection or a rejection after peer review, this is an
opportunity to revise the manuscript for a new submission.
I discuss strategies for doing this later, but from a time frame perspective,
there is (a) the time to make the changes, and (b) the time where the manuscript
is waiting to be attended to.<br />
<h2>
Revisions</h2>
Journals often have a deadline for submission of revisions (e.g., 2 months).
This ensures that revisions are prioritised.<br />
<h1>
Responding to rejection</h1>
When a submitted manuscript is rejected by a journal, the manuscript in some
sense returns to the status of a good draft.<br />
<h2>
Explicit reasons for rejection</h2>
<ul>
<li>It did not meet a requirements of the journal
<ul>
<li>e.g., cross-sectional self-report generally not published; not
a multi-study paper; topic not really of interest to the journal; they
don't publish student/non-clinical/non-industry samples</li>
</ul>
</li>
<li>Not important enough
<ul>
<li>It is common feedback to be told that the manuscript is just not
interesting enough often with a hint of what would be required to make
it more interesting</li>
<li>Examples: not novel enough; sample size not large or representative enough;</li>
</ul>
</li>
<li>A list of substantive criticisms
<ul>
<li>Is the criticism valid?</li>
<li>Does it reflect a misunderstanding by the reviewer?</li>
<li>Can the criticism be addressed? If so, how easily?</li>
<li>There are small error of expression, unclear sentences, typos, and
small errors</li>
</ul>
</li>
</ul>
<h2>
Understanding the rejection</h2>
<ul>
<li>Obvsiouly, substantive criticisms can be clear. But even with these, it is
useful to get a sense of which were the major reasons for rejection and
which are just suggestions for improvement.</li>
<li>Other times it is possible to read into the rejection. The rejection may
imply that the reviewers did not follow the argument, or did not see the
novelty, or focused too much on limitations.</li>
<li>Typically the reviewer will not state everything they object to.</li>
<li>One response can be to reframe the paper.</li>
</ul>
<h2>
Basic options</h2>
<ul>
<li>Appeal rejection</li>
<li>Discard manuscript</li>
<li>Submit manuscript elsewhere
<ul>
<li>no changes</li>
<li>minimal changes</li>
<li>substantive changes</li>
</ul>
</li>
</ul>
<h3>
Appeal rejection</h3>
In the majority of cases, appealing a rejection is a bad idea.
There are many journals out there on a given topic. If the paper is good, try
another one. Use the rejection to improve the manuscript.
There are also many reasons why an editor will reject. It rarely comes down to
a single issue that can be refuted. And if the contribution of the paper is not
clear, then that is the author's fault.<br />
<h3>
Discard manuscript</h3>
Discarding a manuscript is an option. A similar approach is just to put it on
hold because it has weaknesses and the time required to fix the weaknesses (if
possible) is not worth the time required.<br />
With experience, it becomes easier to judge earlier in the project life cycle
where a paper might end up. If it clearly has flaws that can not be recovered,
then drop the project early.<br />
<h3>
Submit manuscript elsewhere</h3>
This is the standard option.<br />
<h2>
Whether to make changes for resubmission?</h2>
<ul>
<li>Benefits of making changes</li>
<ul>
<li>Systematically considering each change generally increases the chance of
future manuscript acceptance</li>
</ul>
<li><ul>
<li>Some reviewers see it as bad faith to not make changes identified in
the review process</li>
<li>Considering each change generally makes the paper better</li>
</ul>
</li>
<li>Benefits of not making changes
<ul>
<li>Making changes takes time</li>
<li>One school of thought is that if a journal is truly interested in the
work, then they will give you a "revise and resubmit" where the editor
and reviewers will have their own particular changes that they want
made. While there is some truth to this, I still think that systematically
working through reviews gets papers closer to something that is appealing
to reviewers and makes for a better paper.</li>
</ul>
</li>
</ul>
My general approach is to treat a resubmission to a new journal like a revise and resubmit.
I engage in the same process of responding to each point made by the reviewers.
The main difference is that you don't have to be as polite to the reviewers.<br />
<h2>
Additional references on responding to rejection</h2>
<ul>
<li><a href="http://expertedge.journalexperts.com/2014/02/24/your-paper-was-rejected-what-next/">ExpertEdge</a></li>
<li><a href="http://careers.ucsc.edu/grad/get_published.html">Santa Cruz Career Centre</a></li>
<li><a href="http://www.insidehighered.com/advice/2009/04/27/belcher">Inside HigherEd</a></li>
<li><a href="http://www.diabetologia-journal.org/rejection.html">Diabetologia</a></li>
<li><a href="https://medium.com/advice-and-help-in-authoring-a-phd-or-non-fiction/seven-upgrade-strategies-for-a-problematic-article-or-chapter-3c6b81be9aa2">Patrick Dunleavy</a></li>
</ul>
<h1>
Principles for minimising time to acceptance</h1>
<h2>
I am leading</h2>
It is first important to distinguish papers in terms of who is leading.
If you are leading a paper, then you have much more control over the following
things.<br />
<ul>
<li>Focus on core research area
<ul>
<li>This enables better journal selection</li>
<li>There are fewer gaps in the first submission.</li>
<li>The revisions are easier to write.</li>
</ul>
</li>
<li>Make initial submission strong
<ul>
<li>Don't pursue weak projects</li>
<li>Appraise potential fatal flaws early</li>
</ul>
</li>
<li>Select appropriate journals
<ul>
<li>Pick appropriate level of prestige, impact; lower is easier and quicker,
but it is still important to aim high; perhaps if you think it has a
10% chance of being accepted in a great journal, it's probably worth a shot)</li>
</ul>
</li>
<li>Learn from rejection</li>
<li>Make revisions flawless: Getting a revise and resubmit is an excellent
opportunity. If you present a perfect and respectful response to every
reviewer comment, then the manuscript has a very good chance of being accepted.</li>
<li>Desk rejects don't take up much time</li>
</ul>
<h2>
Professional collaborator leading</h2>
The first rule is to pick good collaborators.
Good collaborators should know how to write a good paper, select an appropriate
journal, and be willing and able to persist with revisions and resubmissions in
order to find a home for the paper.<br />
Collaborations can also take you out of your core area and thus, you can be in
less of a position to make judgements about where the article should be sent or
how it should be reframed.<br />
<h2>
Student leading</h2>
When a student is leading, this creates particular challenges.
In general, there is a difference between doctoral students who are learning to
be independent scholars and other minor student thesis work (for me this
includes fourth year and masters by course work projects).jeromyanglimhttp://www.blogger.com/profile/12949204812496382042noreply@blogger.comtag:blogger.com,1999:blog-8909074830238091680.post-20147107645008349902014-05-28T20:57:00.003+10:002017-07-12T10:13:21.939+10:00Customising ProjectTemplate in RThis post talks about my workflow for getting started with a new data analysis project using the <code>ProjectTemplate</code> package. <br />
<a name='more'></a><h3>Update (24th August 2016)</h3><div>Over the last two years, I have been refining this customised version of ProjectTemplate.</div><div>I have <a href="https://github.com/jeromyanglim/AnglimModifiedProjectTemplate">more detailed information about the latest version here</a>.<br />
<br />
Video at Melbourne R Users July 4th 2017<br />
<iframe width="560" height="315" src="https://www.youtube.com/embed/pKwXOo4Kkiw" frameborder="0" allowfullscreen></iframe><br />
</div><h3>Overview of ProjectTemplate</h3>ProjectTemplate is an R Package which facilitates data analysis, encourages good data analysis habits, and standardises many data analytic steps. After many years of refining a data analysis workflow in R, I realised that I'd basically converged on something similar to ProjectTemplate anyway. However, my approach was not quite as systematic, and it took more effort than necessary to get started on a new project. Thus, since late 2013, I've been using ProjectTemplate to organise my R data analysis projects.<br />
While I have found ProjectTemplate to be an excellent tool, I realised that when I created a new data analysis project based on ProjectTemplate, I was repeatedly making a large number of customisations to the initial set of files and folders. Thus, I've now set up a repository to store these customisations so that I can get started on a new data analysis project more efficiently. The purpose of this post is to document these modifications.<br />
This post assumes a reasonable knowledge of R and ProjectTemplate. If you're not familiar with ProjectTemplate, you could check out the <a href="http://projecttemplate.net/">ProjectTemplate website</a> focusing particularly on the <a href="http://projecttemplate.net/getting_started.html">Getting Started section</a>. If you're really keen you could also watch an hour long <a href="https://www.youtube.com/watch?v=I9YNIi-QmR0">video on ProjectTemplate, RStudio, and GitHub</a><br />
<h3>General setup</h3>I have a copy of <a href="https://github.com/jeromyanglim/AnglimModifiedProjectTemplate">my customised version of the ProjectTemplate directory and file structure on github in the AnglimModifiedProjectTemplate repository</a>. Specifically, it has:<br />
<ol><li>Modifications to <code>global.dcf</code> as described below, </li>
<li>a blank <code>readme.md</code> </li>
<li>a couple of directories removed that I don't use (e.g., <code>diagnositics</code>, <code>logs</code>, <code>profiling</code>)</li>
<li>an initial <code>rmd</code> file with the customisations mentioned below in the <code>reports</code> directory</li>
<li>An <code>.Rproj</code> RStudio project file to enable easy launching of RStudio. </li>
<li>An additional <code>output</code> directory for storing tabular, text, and other output</li>
</ol>Thus, whenever I want to start a new data analysis project I can download and extract the <a href="https://github.com/jeromyanglim/AnglimModifiedProjectTemplate/archive/master.zip">zip file of the repository on github</a>).<br />
Thus, after creating a project folder, the following steps can be skipped when using my customised template.<br />
<ul><li>Open RStudio and create RStudio Project in existing directory</li>
<li>Create <code>ProjectTemplate</code> folder structure with <code>library(ProjectTemplate); create.project()</code></li>
<li>Move ProjectTemplate files into folder</li>
<li>Modify <code>global.dcf</code></li>
<li>Setup rmd reports</li>
</ul>I also document below a few additional points about subsequent steps including:<br />
<ul><li>Setting up the data directory</li>
<li>Updating the readme file</li>
<li>Setttig up git repository</li>
</ul><h3>Modifying global.dcf</h3>My preferred starting <code>global.dcf</code> settings are<br />
<pre><code>data_loading: on
cache_loading: off
munging: on
logging: off
load_libraries: on
libraries: psych, lattice, Hmisc
as_factors: off
data_tables: off
</code></pre>A little explanation:<br />
<ul><li><code>as_factors</code> I do quite a bit of string processing, particularly on meta data and on output tables. I find the automatic conversion of strings into factors to be a really annoying feature. Thus, setting this to <code>off</code> is my preferred setting.</li>
<li><code>load_libraries:</code> I always have additional libraries so it makes sense to have this <code>on</code>. </li>
<li><code>libraries:</code> There are many common packages that I use, but I almost always make use of the above comma separate list of packages.</li>
</ul><h3>Setup rmd files</h3><h4>Basics of such files</h4>The first line in the first chunk is always:<br />
<pre><code>```{r}
library(ProjectTemplate); load.project()
```
</code></pre>This loads everything required to get started with the project. <br />
<h4>Setup data folder</h4>ProjectTemplate automatically names resulting data.frames with a name based on the file name. This is convenient. However, it is often the case that the file names need to be changed from some raw data supplied or it may be that the original data format is not perfectly suited for importing. In that case, I store the raw data in a separate folder called <code>raw-data</code> and then export or create a copy in the desired format with the desired name in the <code>data</code> folder.<br />
<h4>Overriding default data import options</h4>Some data files can not be imported using the default data import rules. Of course, you can change the file to comply with the rules. Alternatively, I think the standard solution is to add a file in the <code>lib</code> directory (e.g., <code>data-override.r</code>) that imports the data files. Give the imported data file the same name that ProjectTemplate would.<br />
<h3>Update readme</h3>I change the file to README.md to make it clear that it is a markdown formatted file. I can then add a little information about the project.<br />
<h3>Setup git repository</h3>If using github, I create a new repository on github. <br />
<h3>Output folder</h3>A common workflow for me is to generate tables, text, and figure output fromthe script which is then incorporated into a manuscript document. While I really like Sweave and RMarkdown, I often find it more practical to write a manuscript in Microsoft Word. I use the <code>output</code> folder to store tabular output, standard text output, and figures.<br />
In the case of tabular output, there is the task of ensuring the table is formatted appropriately (e.g., desired number of decimal places, cell alignment, cell borders, font, cell merging, etc.). I typically find this easiest to do in Excel. Thus, I have a file called <code>output-processing.xlsx</code>. I import the tabular data into this file and apply relevant formatting. This can then be incorporated into the manuscript. <a href="http://jeromyanglim.blogspot.com.au/2009/09/formatting-table-in-word-r-to-tab.html">Here are a few more notes about Table conversion in MS Word</a>.jeromyanglimhttp://www.blogger.com/profile/12949204812496382042noreply@blogger.comtag:blogger.com,1999:blog-8909074830238091680.post-51538873055800558412013-12-04T17:22:00.002+11:002017-04-10T10:52:39.035+10:00Using R to replicate common SPSS multiple regression output The following post replicates some of the standard output you might get from a multiple regression analysis in SPSS. A copy of the <a href="https://github.com/jeromyanglim/rcars">code in RMarkdown format is available on github</a>. The post was motivated by <a href="http://jeromyanglim.blogspot.com.au/2013/07/evaluating-potential-incorporation-of-r.html">this previous post that discussed using R to teach psychology students statistics</a>. <br />
<a name='more'></a><pre><code class="r">library(foreign) # read.spss
library(psych) # describe
library(Hmisc) # rcorr
library(QuantPsyc) # lm.beta
library(car) # vif, durbinWatsonTest
library(MASS) # studres
library(lmSupport) #lm.sumSquares
library(perturb) # colldiag
</code></pre>
In order to emulate SPSS output, it is necessary to install several add-on packages. The above <code>library</code> commands load the packages into your R workspace. I've highlighted in the comment the names of the functions that are used in this script. <br />
You may not have the above packages installed.
If not, run commands like:<br />
<ul>
<li><code>install.packages('foreign')</code></li>
<li><code>install.packages('psych')</code></li>
<li>etc.</li>
</ul>
for each of the above packages not installed or use the “packages” tab in RStudio to install.<br />
Note also that much of this analysis could be performed using <a href="http://www.rcommander.com/">Rcommander</a> using a more SPSS-style GUI environment.<br />
<h1>
Import and prepare data</h1>
<pre><code class="r">cars_raw <- read.spss("cars.sav", to.data.frame = TRUE)
# get rid of missing data listwise
cars <- na.omit(cars_raw[, c("accel", "mpg", "engine", "horse", "weight")])
</code></pre>
Ensure that <code>cars.sav</code> is the working directory.<br />
<h1>
Quick look at data</h1>
<pre><code class="r"># note the need to deal with missing data
psych::describe(cars_raw)
</code></pre>
<pre><code>## var n mean sd median trimmed mad min max
## mpg 1 398 23.51 7.82 23.00 23.06 8.90 9.00 46.60
## engine 2 406 194.04 105.21 148.50 183.75 86.73 4.00 455.00
## horse 3 400 104.83 38.52 95.00 100.36 29.65 46.00 230.00
## weight 4 406 2969.56 849.83 2811.00 2913.97 947.38 732.00 5140.00
## accel 5 406 15.50 2.82 15.50 15.45 2.59 8.00 24.80
## year* 6 405 6.94 3.74 7.00 6.93 4.45 1.00 13.00
## origin* 7 405 1.57 0.80 1.00 1.46 0.00 1.00 3.00
## cylinder* 8 405 3.20 1.33 2.00 3.14 0.00 1.00 5.00
## filter_.* 9 398 1.73 0.44 2.00 1.79 0.00 1.00 2.00
## weightKG 10 406 1346.97 385.48 1275.05 1321.75 429.72 332.03 2331.46
## engineLitre 11 406 3.19 1.73 2.44 3.02 1.42 0.07 7.47
## range skew kurtosis se
## mpg 37.60 0.45 -0.53 0.39
## engine 451.00 0.69 -0.81 5.22
## horse 184.00 1.04 0.55 1.93
## weight 4408.00 0.46 -0.77 42.18
## accel 16.80 0.21 0.35 0.14
## year* 12.00 0.02 -1.21 0.19
## origin* 2.00 0.92 -0.81 0.04
## cylinder* 4.00 0.27 -1.69 0.07
## filter_.* 1.00 -1.04 -0.92 0.02
## weightKG 1999.43 0.46 -0.77 19.13
## engineLitre 7.41 0.69 -0.81 0.09
</code></pre>
<pre><code class="r">
dim(cars)
</code></pre>
<pre><code>## [1] 392 5
</code></pre>
<pre><code class="r">head(cars)
</code></pre>
<pre><code>## accel mpg engine horse weight
## 1 12.0 18 307 130 3504
## 2 11.5 15 350 165 3693
## 3 11.0 18 318 150 3436
## 4 12.0 16 304 150 3433
## 5 10.5 17 302 140 3449
## 6 10.0 15 429 198 4341
</code></pre>
<pre><code class="r">str(cars)
</code></pre>
<pre><code>## 'data.frame': 392 obs. of 5 variables:
## $ accel : num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ engine: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horse : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight: num 3504 3693 3436 3433 3449 ...
## - attr(*, "na.action")=Class 'omit' Named int [1:14] 11 12 13 14 15 18 39 40 134 338 ...
## .. ..- attr(*, "names")= chr [1:14] "11" "12" "13" "14" ...
</code></pre>
<h1>
Fit model</h1>
<pre><code class="r">fit <- lm(accel ~ mpg + engine + horse + weight, data = cars)
</code></pre>
<h2>
Descriptive Statistics</h2>
<pre><code class="r"># Descriptive statistics
psych::describe(cars)
</code></pre>
<pre><code>## var n mean sd median trimmed mad min max range
## accel 1 392 15.52 2.78 15.50 15.46 2.52 8 24.8 16.8
## mpg 2 392 23.45 7.81 22.75 22.99 8.60 9 46.6 37.6
## engine 3 392 193.65 104.94 148.50 183.15 86.73 4 455.0 451.0
## horse 4 392 104.21 38.23 93.00 99.61 28.17 46 230.0 184.0
## weight 5 392 2967.38 852.29 2797.50 2909.64 945.90 732 5140.0 4408.0
## skew kurtosis se
## accel 0.27 0.43 0.14
## mpg 0.45 -0.54 0.39
## engine 0.69 -0.77 5.30
## horse 1.09 0.71 1.93
## weight 0.48 -0.76 43.05
</code></pre>
<pre><code class="r">
# correlations
cor(cars)
</code></pre>
<pre><code>## accel mpg engine horse weight
## accel 1.0000 0.4375 -0.5298 -0.6936 -0.4013
## mpg 0.4375 1.0000 -0.7893 -0.7713 -0.8072
## engine -0.5298 -0.7893 1.0000 0.8959 0.9339
## horse -0.6936 -0.7713 0.8959 1.0000 0.8572
## weight -0.4013 -0.8072 0.9339 0.8572 1.0000
</code></pre>
<pre><code class="r">rcorr(as.matrix(cars)) # include sig test for all correlations
</code></pre>
<pre><code>## accel mpg engine horse weight
## accel 1.00 0.44 -0.53 -0.69 -0.40
## mpg 0.44 1.00 -0.79 -0.77 -0.81
## engine -0.53 -0.79 1.00 0.90 0.93
## horse -0.69 -0.77 0.90 1.00 0.86
## weight -0.40 -0.81 0.93 0.86 1.00
##
## n= 392
##
##
## P
## accel mpg engine horse weight
## accel 0 0 0 0
## mpg 0 0 0 0
## engine 0 0 0 0
## horse 0 0 0 0
## weight 0 0 0 0
</code></pre>
<pre><code class="r"># scatterplot matrix if you want
pairs.panels(cars)
</code></pre>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBygm_WP_jDnBQhIpNjcAFBQeUN2bEjNPWMyEzKAmmk8eICmAHbc7HjSLuRyo-fpq7WbbkalNz-bjBtfwWvgf8Pv1oOI_7nuvwoj-mUEOeD8MXdqSFvlHmPEZxFtwn9HC02c4wn_nbMw/s1600/Screen+Shot+2017-04-10+at+10.49.53+am.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="319" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjBygm_WP_jDnBQhIpNjcAFBQeUN2bEjNPWMyEzKAmmk8eICmAHbc7HjSLuRyo-fpq7WbbkalNz-bjBtfwWvgf8Pv1oOI_7nuvwoj-mUEOeD8MXdqSFvlHmPEZxFtwn9HC02c4wn_nbMw/s320/Screen+Shot+2017-04-10+at+10.49.53+am.png" width="320" /></a></div>
<br />
<h2>
Summary of model</h2>
<pre><code class="r"># r-square, adjusted r-square, std. error of estimate, overall ANOVA, df, p,
# unstandardised coefficients, sig tests
summary(fit)
</code></pre>
<pre><code>##
## Call:
## lm(formula = accel ~ mpg + engine + horse + weight, data = cars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.177 -1.023 -0.184 0.936 6.873
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 16.980778 0.977425 17.37 <2e-16 ***
## mpg 0.007476 0.019298 0.39 0.6987
## engine -0.008230 0.002674 -3.08 0.0022 **
## horse -0.087169 0.005204 -16.75 <2e-16 ***
## weight 0.003046 0.000297 10.24 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.7 on 387 degrees of freedom
## Multiple R-squared: 0.631, Adjusted R-squared: 0.627
## F-statistic: 166 on 4 and 387 DF, p-value: <2e-16
</code></pre>
<pre><code class="r">### additional info in terms of sums of squares
anova(fit)
</code></pre>
<pre><code>## Analysis of Variance Table
##
## Response: accel
## Df Sum Sq Mean Sq F value Pr(>F)
## mpg 1 577 577 200.8 <2e-16 ***
## engine 1 272 272 94.7 <2e-16 ***
## horse 1 753 753 261.8 <2e-16 ***
## weight 1 302 302 104.9 <2e-16 ***
## Residuals 387 1113 3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
</code></pre>
<pre><code class="r">
# 95% confidence intervals (defaults to 95%)
confint(fit)
</code></pre>
<pre><code>## 2.5 % 97.5 %
## (Intercept) 15.059049 18.902506
## mpg -0.030466 0.045418
## engine -0.013488 -0.002972
## horse -0.097401 -0.076938
## weight 0.002461 0.003630
</code></pre>
<pre><code class="r"># but can specify different confidence intervals
confint(fit, level = 0.99)
</code></pre>
<pre><code>## 0.5 % 99.5 %
## (Intercept) 14.450621 19.510934
## mpg -0.042478 0.057430
## engine -0.015153 -0.001308
## horse -0.100641 -0.073698
## weight 0.002276 0.003816
</code></pre>
<pre><code class="r">
# standardised coefficients
lm.beta(fit)
</code></pre>
<pre><code>## mpg engine horse weight
## 0.02101 -0.31093 -1.19988 0.93456
</code></pre>
<pre><code class="r">
# or you could do it manually
zcars <- data.frame(scale(cars)) # make all variables z-scores
zfit <- lm(accel ~ mpg + engine + horse + weight, data = zcars)
coef(zfit)[-1]
</code></pre>
<pre><code>## mpg engine horse weight
## 0.02101 -0.31093 -1.19988 0.93456
</code></pre>
<pre><code class="r">
# correlations: zero-order, semi-partial, partial obscure function seems to
# do it
sqrt(lm.sumSquares(fit)[, c(2, 3)])
</code></pre>
<pre><code>## dR-sqr pEta-sqr
## (Intercept) 0.53638 0.6620
## mpg 0.01000 0.0200
## engine 0.09487 0.1546
## horse 0.51711 0.6483
## weight 0.31623 0.4617
## Error (SSE) NA NA
## Total (SST) NA NA
</code></pre>
<pre><code class="r">
# or use own function
cor_lm <- function(fit) {
dv <- names(fit$model)[1]
dv_data <- fit$model[, dv]
ivs <- names(fit$model)[-1]
iv_data <- fit$model[, ivs]
x <- fit$model
x_omit <- lapply(ivs, function(X) x[, c(dv, setdiff(ivs, X))])
names(x_omit) <- ivs
lapply(x_omit, head)
fits_omit <- lapply(x_omit, function(X) lm(as.formula(paste(dv, "~ .")),
data = X))
resid_omit <- sapply(fits_omit, resid)
iv_omit <- lapply(ivs, function(X) lm(as.formula(paste(X, "~ .")), data = iv_data))
resid_iv_omit <- sapply(iv_omit, resid)
results <- sapply(seq(ivs), function(i) c(zeroorder = cor(iv_data[, i],
dv_data), partial = cor(resid_iv_omit[, i], resid_omit[, i]), semipartial = cor(resid_iv_omit[,
i], dv_data)))
results <- data.frame(results)
names(results) <- ivs
results <- data.frame(t(results))
results
}
round(cor_lm(fit), 3)
</code></pre>
<pre><code>## zeroorder partial semipartial
## mpg 0.438 0.020 0.012
## engine -0.530 -0.155 -0.095
## horse -0.694 -0.648 -0.517
## weight -0.401 0.462 0.316
</code></pre>
<h2>
Assumption testing</h2>
<pre><code class="r"># Durbin Watson test
durbinWatsonTest(fit)
</code></pre>
<pre><code>## lag Autocorrelation D-W Statistic p-value
## 1 0.136 1.721 0.004
## Alternative hypothesis: rho != 0
</code></pre>
<pre><code class="r">
# vif
vif(fit)
</code></pre>
<pre><code>## mpg engine horse weight
## 3.085 10.709 5.383 8.736
</code></pre>
<pre><code class="r">
# tolerance
1/vif(fit)
</code></pre>
<pre><code>## mpg engine horse weight
## 0.32415 0.09338 0.18576 0.11447
</code></pre>
<pre><code class="r">
# collinearity diagnostics
colldiag(fit)
</code></pre>
<pre><code>## Condition
## Index Variance Decomposition Proportions
## intercept mpg engine horse weight
## 1 1.000 0.000 0.001 0.001 0.001 0.000
## 2 3.623 0.002 0.051 0.016 0.005 0.001
## 3 16.214 0.006 0.066 0.365 0.763 0.019
## 4 18.519 0.127 0.431 0.243 0.152 0.227
## 5 32.892 0.865 0.451 0.375 0.079 0.753
</code></pre>
<pre><code class="r">
# residual statistics
rfit <- data.frame(predicted = predict(fit), residuals = resid(fit), studentised_residuals = studres(fit))
psych::describe(rfit)
</code></pre>
<pre><code>## var n mean sd median trimmed mad min max
## predicted 1 392 15.52 2.21 16.11 15.80 1.40 3.13 20.06
## residuals 2 392 0.00 1.69 -0.18 -0.11 1.39 -4.18 6.87
## studentised_residuals 3 392 0.00 1.01 -0.11 -0.07 0.82 -2.49 4.47
## range skew kurtosis se
## predicted 16.93 -1.61 4.10 0.11
## residuals 11.05 0.75 1.10 0.09
## studentised_residuals 6.95 0.81 1.38 0.05
</code></pre>
<pre><code class="r">
# distribution of standarised residuals
zresid <- scale(resid(fit))
hist(zresid)
</code></pre>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEivt2kFCtWTkTYMYC653DgGgOnT5HFRFUO0f33nFhCOaJIMsLtIC8FAjOzolFE2_6fRB9kma3theKoPD37AiTuHZJpvfU1ylL2wy1h_YgOcGj4D_p9DSJrwIi1mnfxeKamw3fU5G-9n2Q/s1600/Screen+Shot+2017-04-10+at+10.50.18+am.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEivt2kFCtWTkTYMYC653DgGgOnT5HFRFUO0f33nFhCOaJIMsLtIC8FAjOzolFE2_6fRB9kma3theKoPD37AiTuHZJpvfU1ylL2wy1h_YgOcGj4D_p9DSJrwIi1mnfxeKamw3fU5G-9n2Q/s320/Screen+Shot+2017-04-10+at+10.50.18+am.png" width="318" /></a></div>
<br />
<br />
<pre><code class="r"># or add normal curve http://www.statmethods.net/graphs/density.html
hist_with_normal_curve <- function(x, breaks = 24) {
h <- hist(zresid, breaks = breaks, col = "lightblue")
xfit <- seq(min(x), max(x), length = 40)
yfit <- dnorm(xfit, mean = mean(x), sd = sd(x))
yfit <- yfit * diff(h$mids[1:2]) * length(x)
lines(xfit, yfit, lwd = 2)
}
hist_with_normal_curve(zresid)
</code></pre>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjFyjyTdpIVh7LOETOH_O2GIZg6Ejk1AjgxAuVegUhBHO8MqjJVD8qC_E3tyPGq5VBhQ1IA7-gLdu0vZgFmHCzChq6kb1l9R1tfOgYCgZb_Tvz_dQy5WRAjBMwXRGHVUdrDUUNCSgC-xA/s1600/Screen+Shot+2017-04-10+at+10.50.46+am.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="312" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjFyjyTdpIVh7LOETOH_O2GIZg6Ejk1AjgxAuVegUhBHO8MqjJVD8qC_E3tyPGq5VBhQ1IA7-gLdu0vZgFmHCzChq6kb1l9R1tfOgYCgZb_Tvz_dQy5WRAjBMwXRGHVUdrDUUNCSgC-xA/s320/Screen+Shot+2017-04-10+at+10.50.46+am.png" width="320" /></a></div>
<br />
<pre><code class="r">
# normality of residuals
qqnorm(zresid)
abline(a = 0, b = 1)
</code></pre>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjW4DjqQjZHHp9bXkUNuQu0uz8LKJkJpVVi73AsoTo5Oj7CWYNylResC_0u7ueIhwYboMlXOcln_0yqXPUW52QCxtJ5YwtIcDOQ-sWCDUv9wBWAaIHzK6PawFPfmBH7XbHuL_Bzniws3Q/s1600/download.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjW4DjqQjZHHp9bXkUNuQu0uz8LKJkJpVVi73AsoTo5Oj7CWYNylResC_0u7ueIhwYboMlXOcln_0yqXPUW52QCxtJ5YwtIcDOQ-sWCDUv9wBWAaIHzK6PawFPfmBH7XbHuL_Bzniws3Q/s320/download.png" width="320" /></a></div>
<br />
<pre><code class="r">
# plot predicted by residual
plot(predict(fit), resid(fit))
</code></pre>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhD7FkC_RqnAA2acceX2U2dYliccjTJKQ_HWMe6BlVcZZZilOt421vZQ3g9K_CClYPSf6YoTyNs1oSLFQPNd3zjUsFtaRBNM5302sZ_zeya4b_QZ2DkqF1YCrEQAGWZIWHyf9k5wVWtaQ/s1600/nqqplot.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhD7FkC_RqnAA2acceX2U2dYliccjTJKQ_HWMe6BlVcZZZilOt421vZQ3g9K_CClYPSf6YoTyNs1oSLFQPNd3zjUsFtaRBNM5302sZ_zeya4b_QZ2DkqF1YCrEQAGWZIWHyf9k5wVWtaQ/s320/nqqplot.png" width="320" /></a></div>
<br />
<pre><code class="r">
# plot dependent by residual
plot(cars$accel, resid(fit))
</code></pre>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjXiEXcAqbtAPpgwlI359CzpJnwLJ6laB6Ks-v71KPG1YELqlwYlfHlC2_WHiCRjkZV_r0AKYpvoI_7IgH87ErGJybWOFliJWDcOfYcFwQ2u044C2tmrO2508dJedgAfDjYiTlC8dCzxg/s1600/plotdv.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjXiEXcAqbtAPpgwlI359CzpJnwLJ6laB6Ks-v71KPG1YELqlwYlfHlC2_WHiCRjkZV_r0AKYpvoI_7IgH87ErGJybWOFliJWDcOfYcFwQ2u044C2tmrO2508dJedgAfDjYiTlC8dCzxg/s320/plotdv.png" width="320" /></a></div>
<br />jeromyanglimhttp://www.blogger.com/profile/12949204812496382042noreply@blogger.comtag:blogger.com,1999:blog-8909074830238091680.post-42571798100434844742013-11-19T14:46:00.000+11:002013-11-19T15:23:35.666+11:00Writing a Concise Introduction to a Psychology Journal Article: An Article Deconstruction<p>I often talk about article deconstruction as a useful method for extracting
principles for writing journal articles. The following is an <a href="http://jeromyanglim.blogspot.com.au/2009/09/introduction-to-journal-article.html">article
deconstruction</a>
of the introduction section of Fujita and Diener (2005). The writing principles
extracted may be relevant to others writing introductions to journal articles in
psychology.</p>
<a name='more'></a>
<h1>The article</h1>
<p>The article analyses longitudinal data on the stability of life
satisfaction. Both authors and especially Ed Diener are major figures in
well-being research. As I was reading the article, I found the writing style to
be particularly engaging. Thus, I thought it would be a good article to
deconstruct in order to identify relevant writing principles.
<a href="http://academic.udayton.edu/jackbauer/Readings%20595/Fujita%2005%20happy%20set%20point%20copy.pdf">Here is a PDF of the
article</a></p>
<blockquote><p>Fujita, F., & Diener, E. (2005). Life satisfaction set point: stability and
change. Journal of personality and social psychology, 88(1), 158.</p></blockquote>
<h1>Paragraph Descriptions</h1>
<p>This section analyses each paragraph of the introduction to extract writing
principles.</p>
<h2>1. Overview of study</h2>
<ul>
<li>Description:
<ul>
<li>The opening sentence of the first paragraph states the purpose of the
journal article. i.e., "The purpose of this study was ..."</li>
<li>The second sentence elaborates briefly on the core concept</li>
<li>The third sentence states the empirical method used in the study.</li>
</ul>
</li>
<li>Analysis:
<ul>
<li>This structure really gets to the point quickly about what the study
is about both theoretically and methodologically.</li>
<li>Explicit discussion of the importance of the topic is delayed until
Paragraph 2. This differs from some other papers which commence with
a more general opening paragraph that eludes to the importance of
the problem. It also differs from the hour glass model of lab report
writing (which I dislike quite a bit) which often does not get to the
point of the paper until the final paragraph.</li>
</ul>
</li>
</ul>
<h2>2. Importance of research question</h2>
<ul>
<li>Description:
<ul>
<li>The paragraph is concerned with the importance of the research question.</li>
<li>Theoretical argument for importance: essential to understanding
relationship between core variables in conceptual space.</li>
<li>Applied argument: relevant to societal interventions to improve
people's lives</li>
</ul>
</li>
<li>Analysis
<ul>
<li>I often think of framing the importance in terms of overcoming a gap in
the literature. Instead, this paragraph focuses on the importance of the
research question in absolute terms. Importance of the study is reserved
for later after a critique of the literature is presented. This approach
overcomes some of the challenge of trying to fit too much in the opening
few paragraphs (i.e., trying to summarise a complex critique of the
literature in order to justify the research).</li>
</ul>
</li>
</ul>
<h2>3. Historical development of set-point theory</h2>
<ul>
<li>Description:
<ul>
<li>The first sentence highlights the literature on the broader topic (i.e.,
SWB) and uses a book length treatment as a general citation.</li>
<li>Subsequent sentences present the contributions of several key authors
who have discussed the core idea of the paper. "Headley and Wearing ...
proposed ...", etc.</li>
<li>There are also elaboration sentences.</li>
</ul>
</li>
<li>Analysis
<ul>
<li>The paragraph starts with the general research context, and then
immediately moves to present historical development of the core idea.</li>
<li>One of the challenges of writing a literature review is working out what
to cover and how much general introduction should be provided. In this
case, only a single sentence is provided to link with the general
literature. So, for example, the paper does not provide general
definitions of life satisfaction or subjective well-being.</li>
</ul>
</li>
</ul>
<h2>4. The evidence for stability</h2>
<ul>
<li><p>Description</p>
<ul>
<li>It starts with a general claim: "lines of data suggest ...".</li>
<li>It then presents two pieces of evidence: I.e., "First, ..."; "Second, ..."</li>
<li>In relation to the first point made, there are two sentences. The first
sentence states a general empirical relationship (e.g., "X is related to
Y"). The second sentence provides an illustrative example of a study that
shows the relationship and reports the specific findings. "For example,
Eid and Diener (2004) found ..."</li>
</ul>
</li>
<li><p>Analysis</p>
<ul>
<li>This paragraph links closely with the next one. The paper is about
stability and change. Thus, devoting a paragraph each to stability and
change provides a way of presenting both perspectives. This model would
work in many literature reviews where two contrasting ideas are presented.</li>
<li>The use of the general claim followed by an illustrative study is useful
for making discussion of the literature more concrete.</li>
</ul>
</li>
</ul>
<h2>5. The evidence for change</h2>
<ul>
<li>Description:
<ul>
<li>The first sentence links with the previous paragraph and states the topic
of the paragraph. E.g., "Despite evidence for <code>{claim in previous
paragraph}</code>... there are also indications that <code>{alternative claim
elaborated in this paragraph}</code>..."</li>
<li>Then a series of findings are presented from the literature. Various
transitional words are used to link ideas: "There are ..."; "Also, ...";
"Further, ..."; "Thus, ..."</li>
</ul>
</li>
</ul>
<h2>6. Critique of existing studies</h2>
<ul>
<li>Description:
<ul>
<li>The opening paragraph acknowledges that the central claim made in the
paper has been made before, but that the literature has not yet "fully
explored <code>{the idea}</code>".</li>
<li>Then three major critiques are presented using the linking words: "For one thing ...";
"Furthermore, ..."; "Another limitation ..."</li>
</ul>
</li>
<li>Analysis
<ul>
<li>This paragraph serves to highlight the gap in the literature and
justify the present study.</li>
<li>Interestingly, the critique does not cite any particular studies. Rather
it just states the limitations of the existing literature.</li>
<li>The points made in the critique constitute limitations of previous
research and not fundamental flaws.</li>
<li>Both not citing specific studies and framing issues as limitations helps
to create a positive respectful tone while at the same time justifying the
current study.</li>
<li>Given that the idea itself is not new, this study provides a good example
of showing how the evidence and the methodology can be used to argue for
the unique contribution.</li>
</ul>
</li>
</ul>
<h2>7. Description of current study</h2>
<ul>
<li>Description
<ul>
<li>The paragraph describes the method and sample.</li>
<li>It is stated how this sample helps answer the research question. I.e., "By
using <code>{aspect of method in current study}</code>, we overcome one of the
limitations of past research <code>{state limitation}</code>.</li>
<li>It is stated how the method/design helps answer the research question.</li>
</ul>
</li>
<li>Analysis
<ul>
<li>The benefits of the method of the current study are summarised and
contrasted with the previous literature. Thus, the paragraph also
highlights the gap in the previous literature and the importance of the
current study.</li>
</ul>
</li>
</ul>
<h2>8. Hypotheses</h2>
<ul>
<li>Description
<ul>
<li>Two hypotheses are presented.</li>
<li>Assorted justification for hypotheses are interspersed:
<ul>
<li>"on the basis of past findings"</li>
<li>"because ..."</li>
<li>whole sentences of the form: "people are likely ..."</li>
</ul>
</li>
</ul>
</li>
<li>Analysis
<ul>
<li>The words "hypothesized" and "predicted" seem be to used interchangeably.</li>
<li>The hypotheses are expressed in a fairly verbal manner, still leaving some
scope fo how they will be converted into a numeric test.</li>
<li>In contrast to some studies hypotheses are not numbered in any formal way.
This creates a more flexible, concise and informal approach to specifying
expectations. In many respects I prefer this given that ultimately the
analyses are the evidence.</li>
</ul>
</li>
</ul>
<h2>9. Additional questions examined</h2>
<ul>
<li>Description
<ul>
<li>This paragraph introduces additional, ancillary research questions.</li>
<li>The first sentence links the main research question to the additional
research questions: "In addition to <code>{main research question}</code>, we
examined some related questions."</li>
<li>The remaining sentences are: (2) description of first additional question; (3)
basic expectations and justification for first additional question; (4)
description of second additional question.</li>
</ul>
</li>
<li>Analysis
<ul>
<li>This provides an interesting approach for how to handle additional
questions that will be addressed by the analyses in a study. There may not
be space to address the full literature on these additional questions.
Even if space was available, extensive discussion of the literature on the
additional questions may distract the reader from the core research
question.</li>
<li>In general the paragraph frames and justifies the questions as being
related to the primary question, but not much time is spent discussing the
specific issues.</li>
</ul>
</li>
</ul>
<h1>General reflection on the introduction:</h1>
<h2>Focus and length</h2>
<ul>
<li>It is a relatively short and focussed introduction. Nine paragraphs is not
long. It's about one APA journal page of double column text. Only about three
paragraphs represent a literature review with references and the like, plus
one more for a critique.</li>
<li>In its focus it implies that certain topics don't need to be discussed. It
is assumed that the reader is familiar with them or at least assumes that
it would be distracting from the goals of the paper to have to address
such issues. For example:
<ul>
<li>It does not define SWB or discuss the relationship between SWB and
personality.</li>
<li>It does not attempt to be a complete review of all studies that have
been conducted on the stability or change of SWB.</li>
</ul>
</li>
<li>There are no subheadings. This makes sense given the length. However, it could
easily have had two subheadings, one for the literature review and one for the
current study.</li>
</ul>
<h2>Citation practice</h2>
<p>In a previous post I describe the importance of <a href="http://jeromyanglim.blogspot.com.au/2009/09/how-to-write-literature-review-in.html">making the link between
citations and argument clear</a>.
So for example, it should be clear whether any claim made is a proposal, empirical finding, or something else.
The paper uses the following strategies:</p>
<ul>
<li>Specific words showing link
<ul>
<li>"<code>{Author}</code> proposed"; "<code>{Author}</code> found"; "<code>{Author}</code> suggested"</li>
<li>"Advanced by <code>{one author and colleagues}</code> <code>{insert multiple references}</code>"</li>
</ul>
</li>
<li>Implied that the reference provides empirical support for the claim asserted:
<ul>
<li>"research indicates that <code>{finding}</code> <code>{reference}</code>";</li>
<li>"There are <code>{finding}</code> <code>{reference}</code>";</li>
</ul>
</li>
<li>Illustrative examples with implied empirical citation:
<ul>
<li>"It has been found that <code>{finding}</code> ...such as <code>{one domain}</code>
<code>{reference}</code> another domain <code>{reference}</code>"</li>
</ul>
</li>
</ul>
<h2>Pronouns</h2>
<ul>
<li>The first person pronoun "we" is used quite a lot. Rather than "it was
hypothesized that", the authors write "we hypothezized that". This creates
a fairly engaging tone.</li>
</ul>
<h1>Comparison to my previous writing on introductions</h1>
<p>I've discussed previously about <a href="http://jeromyanglim.blogspot.com.au/2009/12/how-to-write-introduction-section-in.html">how to write an introduction in psychology</a>.
In that post I present one structure for an introduction roughly as follows:</p>
<ul>
<li>Opening (Aim, Gap, Importance, Method)</li>
<li>Literature review</li>
<li>Current study (Study description, Expectations)</li>
</ul>
<p>While fairly similar, this paper differed to that structure in a few ways:</p>
<ul>
<li><strong>Aim</strong> and <strong>method</strong> was placed in the first paragraph. A more fluffy but engaging
opening paragraph was omitted.</li>
<li><strong>Importance</strong>: The second paragraph presents importance.</li>
<li><strong>Gap</strong>: Gap is presented at the end of the literature review component as
a presentation of common limitations with the existing literature. Gap also is
addressed when articulating the current study. Design aspects that overcome
past limitations are articulated.</li>
<li><strong>Study description</strong> and <strong>expectations</strong> are basically as described in the
post.</li>
</ul>
jeromyanglimhttp://www.blogger.com/profile/12949204812496382042noreply@blogger.comtag:blogger.com,1999:blog-8909074830238091680.post-54932764880536991832013-11-11T18:44:00.002+11:002013-11-12T09:54:42.716+11:00How to convert manual APA references to Endnote references in Word<p>When collaborating on a journal article with colleagues I sometimes get
Word documents that have manually formatted references.
I often want to convert the manual references to Endnote references.
The following post discusses a workflow for doing this.</p>
<a name='more'></a>
<h3>Context</h3>
<p>The reality is that most journals in psychology require or prefer submissions to
be in Word format. Endnote works reasonably well for formatting APA style
references in Microsoft Word. Furthermore, most of my students and collaborators
are familiar with Word.</p>
<p>So, I sometimes end up with Word documents written with manual references.
I could just continue on with manual references, but that has a whole host of
problems: (1) it takes about 20 minutes to do a check that citations match the
references, and every time the document is edited, this check needs to be
performed again; (2) there are so many rules when it comes to APA style that it
is better to get an automated tool like Endnote to apply them.</p>
<p>So, I want to convert the manual referencing to Endnote format.
Here's one way to do it.</p>
<h3>Getting references into Endnote</h3>
<p>The first step is to get the references into Endnote.
I asked <a href="http://academia.stackexchange.com/questions/13923/how-to-automatically-import-apa-references-into-reference-manager">here about </a> about automatic import of lists of references into reference databases, but I have not found a solution.
So for now, one option is just to copy and paste each reference into <a href="http://scholar.google.com">Google
Scholar</a>.</p>
<p>Before beginning the process:</p>
<ol>
<li>Create and open the Endnote database for the journal article</li>
<li>Configure Google Scholar to use Endnote as it's default reference manager
(this should then show an "import into Endnote" button for each reference).</li>
<li>Configure your browser to automatically open the "ics" file that results when
you click "import into endnote"</li>
</ol>
<p>To perform the search for each reference, sometimes pasting the whole reference
into the search box will work, other times you need to only provide a portion of
the search such as the title or author and year.
For extra speed, I have an <a href="http://jeromyanglim.tumblr.com/post/33632282039/alfred-more-than-just-a-keyboard-shortcut-to-open-a">Google Scholar Search</a> with the highlighted text using Alfred.</p>
<p>Also for each reference, it is often necessary to add additional information
missed by Google Scholar. Scholar does a good job with journal articles, but
misses information in books, book chapters, and so on.</p>
<h3>Modifying citations</h3>
<p>So the existing Word document has manually written APA citations that need to be
modified to Endnote format. One approach to doing this is to convert all the
manual true citations into temporary Endnote citations. Then pressing format
citations in Endnote will lead Endnote to attempt to match each citation to
a reference in the Endnote database.</p>
<p>The simplest step to do this is to just convert parentheses around citations to
curly braces. I.e., <code>(Smith, 2009)</code> becomes <code>{Smith, 2009}</code>. This works fairly
well, but there are a few things to keep in mind.</p>
<ul>
<li>Removing the "and" symbol between references will improve Endnote matching. So
make <code>(Smith & Jones, 2000)</code> into <code>{Smith Jones, 2000}</code></li>
<li>If you have prefix or postfix text then add the <a href="http://jeromyanglim.tumblr.com/post/54999395670/notes-on-temporary-citations-in-endnote-for-apa-style">temporary symbols described
here</a>:
e.g., <code>(e.g., Smith, 2000)</code> becomes <code>(e.g., \Smith, 2000)</code></li>
<li>If you have references with author outside the reference, then either put
a comma before the reference to just show the year, or add the <code>@@author-year</code>
code to include the author in text but generated by Endnote.</li>
</ul>
<p>So once this is all done, running format temporary citations should lead Endnote
to take you through each citation asking you to link each citation to the
corresponding reference. And if you've done a good job of the preceding step,
the first match should correspond to the article in most cases.</p>
<h3>Checking</h3>
<p>At this point, the Endnote generated references may need to be moved into their
appropriate location and formatted as required.</p>
<p>The final step is to check all the references and citations.</p>
<ul>
<li>Common errors for Google Scholar References:
<ul>
<li>No page range (missing end page number)</li>
<li>Case issues in title</li>
<li>Case issues in journal name</li>
<li>Issues with hyphens</li>
</ul>
</li>
</ul>
<p>There can also be some issues with the citations. When diagnosing problems, one
strategy is to convert all citations to unformatted citations in Endnote.</p>jeromyanglimhttp://www.blogger.com/profile/12949204812496382042noreply@blogger.comtag:blogger.com,1999:blog-8909074830238091680.post-15503700521840698112013-10-10T11:01:00.000+11:002013-10-10T11:06:11.104+11:00Tutorials, Answers, and Data Files for Multivariate Research Methods Course using SPSS and Amos<p>I recently developed a set of tutorials for teaching research
methods using SPSS and Amos to I/O psychology students. I thought they might be useful for other instructors or people learning intermediate multivariate research methods to social and behavioural science students. Thus, I have made the resources available as a downloadable repository.</p>
<a name='more'></a>
<p>Each tutorial includes a set of exercises, data, and extensive answers. A particular emphasis is on using syntax, reproducible workflow in SPSS, managing metadata, and scale construction.</p>
<p>It contains six tutorial exercises.</p>
<ul>
<li>Introduction to data analysis</li>
<li>Correlation and regression</li>
<li>Group differences</li>
<li>Moderators and mediators</li>
<li>Exploratory factor analysis</li>
<li>Confirmatory factor analysis</li>
</ul>
<p>For example, <a href="https://github.com/jeromyanglim/spss-research-methods-tutorials/blob/master/06-sem-confirmatory-factor-analysis/instructions/cfa-exercise-1.docx?raw=true">here is the tutorial on confirmatory factor analysis with Amos in docx format</a>. The repository also includes related data files.</p>
<p><strong>GITHUB Repository Address:</strong> <a href="https://github.com/jeromyanglim/spss-research-methods-tutorials">https://github.com/jeromyanglim/spss-research-methods-tutorials</a></p>
<p>Each exercise includes several folders</p>
<ul>
<li><strong>Instructions</strong>: This folder includes one or more Word documents with the
exercise and answers. These files should be your starting point for getting an understanding of the tutorials.</li>
<li><strong>data</strong>: This folder includes raw data and meta data used in the tutorial
exercises. There are often raw csv files as well as various SPSS sav files.
The exercises are designed to teach students how to import and process csv files
in SPSS.</li>
<li><strong>output</strong>: These folders often include a copy of much of the SPSS output in
PDF form as well as some syntax files.</li>
</ul>
<p>To use the repository it is recommended that you download the <a href="https://github.com/jeromyanglim/spss-research-methods-tutorials/archive/master.zip">ZIP
file</a>.</p>
<p>Earlier versions of the corresponding lectures can basically be found in the <a href="http://jeromyanglim.blogspot.com.au/2009/09/teaching-resources.html">teaching resources
section of my website</a>
under multivariate methods.</p>
<p><strong>Author:</strong> Dr Jeromy Anglim, Deakin University</p>
<p><strong>Licence:</strong> Tutorial exercises are given a creative commons licence <a href="http://creativecommons.org/licenses/by/3.0/">CC BY
3.0</a>. Raw data files and data
descriptions retain whatever licence they had previously.</p>jeromyanglimhttp://www.blogger.com/profile/12949204812496382042noreply@blogger.comtag:blogger.com,1999:blog-8909074830238091680.post-56621649282617143592013-07-17T18:41:00.000+10:002013-07-17T18:42:15.756+10:00Evaluating the Potential Incorporation of R into Research Methods Education in Psychology<p>I was recently completing some professional development activities that required
me to write a report on a self-chosen topic related to diversity in student
backgrounds. I chose to use the opportunity to reflect on the potential for
using R to teach psychology students research methods. I thought I'd share the
report in case it interests anyone.</p>
<a name='more'></a>
<h2>Abstract</h2>
<p>Research methods is fundamental to psychology education at university. Recently, open source software called R has become a compelling alternative to the traditionally used proprietary software called SPSS for teaching research methods. However, despite many strong equity and pedagogical arguments for the use of R, there are also many risks associated with its use. This report reviews the literature on the role of technology in research methods university education. It then reviews literature on the diversity of psychology students in terms of motivations, mathematical backgrounds, and career goals. These reviews are then integrated with a pedagogical assessment of the pros and cons of SPSS and R. Finally, recommendations are made regarding how R could be best implemented in psychology research methods teaching.</p>
<h2>Introduction</h2>
<p>Training in research methods is a fundamental component of university education in psychology. However, for many reasons subjects in research methods are challenging to teach. Students have diverse mathematical, statistical, and computational backgrounds; students often lack motivation as they struggle to see the relevance of statistics. These issues are compounded by undergraduate majors in psychology that typically have several compulsory research methods subjects. Given the competition for entry into fourth year and post-graduate programs, such research methods subjects can be threatening to struggling students.</p>
<p>As with many other universities, research methods in psychology at Deakin University has largely been taught using software called SPSS. This software is typically taught as a menu driven program that is used to analyse data enabling standard data manipulation, analyses, and plotting. While SPSS is relatively user-friendly for standard analyses, there are several problems with teaching students how to use it. In particular, it is very expensive; thus, students can not be assumed to have access to it either from home for doing assignments or in future jobs. In addition, while SPSS makes it easy to perform standard analyses, it is very difficult to alter what SPSS does to perform novel analyses. Thus, for many reasons some lecturers are seeking alternative statistical software for teaching research methods.</p>
<p>While there are many programs for performing statistical analysis, one particularly promising program, known simply as "R", has emerged as a viable alternative to SPSS. R is open source so it is free for students and staff. Thus, students can use R from home when completing assignments, and can use it in any future job. It has a vast array of statistical functionality. Despite these benefits, it does present several challenges to incorporation into psychology. Analyses are typically performed using scripts. It is often less clear how to run certain analyses. The program often assumes a mental model of a statistician rather than an applied researcher.</p>
<p>Thus, the current report had the following aims. The first aim was to evaluate the pros and cons of using R to teaching psychology students research methods. The second aim was to evaluate how best R could be incorporated. In order to achieve these aims, the report is structured into several parts. First, the general literature of software in statistics education is reviewed. A particular focus is placed on diversity in student backgrounds in applied fields. Second, the backgrounds and career goals of psychology students are presented with reference to the literature and practical experience. Third, the pros and cons of using R versus SPSS is presented. Finally, ideas about how best to incorporate R into statistics education are reviewed.</p>
<h2>Statistics education and the role of software</h2>
<p>There is a substantial literature on statistics education and the role of statistical software in statistics education. Tiskovskaya and Lancaster (2012) provide one review of the challenges in statistics education. Their review is structured around teaching and learning, statistical literacy, and statistics as a profession. Of particular relevance to teaching statistics in psychology they outline several problems and provide relevant references to the statistical literature. With references taken from their paper, these issues include: inability to apply mathematics to real world problems (e.g., Garfield, 1995); mathematics and statistics anxiety and motivation issues in students (e.g., Gal & Ginsburg, 1994); inherent difficulty in students understanding probability and statistics (e.g., Garfield & Ben-Zvi, 2008); problems with background mathematical and statistical knowledge (e.g., Batanero et al 1994); the need to develop statistical literacy which translates into everyday life (e.g., Gal, 2002); and the need to develop assessment tools to evaluate statistical literacy. Tiskovskaya and Lancaster (2012) also reviewed potential statistics teaching reforms. They note that there is a need to provide contextualised practice, foster statistical literacy, and create an active learning environment.</p>
<p>Of particular relevance to the current review of statistical software, Tiskovskaya and Lancaster (2012) discuss the role of technology in statistics education. The importance of technology has increased as computers have become more powerful. This has enabled students to run powerful statistical programs on their computer. Some teachers have used this power to focus instruction on interpretation of statistical results rather than computational mechanics. Chance et al (2007) further note the value of using interactive applets to explore statistical concepts and taking advantage of internet resources in teaching.</p>
<p>Chance et al's (2007) review also summarises several useful suggestions for incorporating technology in statistics education. Moore (1997) notes the importance of balancing using technology as a tool with remembering that the aim is to teach statistics and not the tool per se. Chance et al (2007) notes particularly valuable uses of technology include analysing real datasets, exploring data graphically, and performing simulations. Chance et al (2007) also review statistical software packages for statistics education noting both the advantages and disadvantages of menu-driven applications such as SPSS.</p>
<p>Chance et al (2007) offer several recommendations for incorporating technology into statistical education. First, they highlight the importance of getting students practicing not just performing analyses, but also focusing on interpretation. Second, they recommend that tasks be carefully structured around exploration so that students see the bigger picture and do not get overwhelmed with software implementation issues. Third, collaborative exercises can force students to justify to their fellow students their reasoning. Fourth, they encourage the use of cycles of prediction and testing, which technology can facilitate (e.g., proposing a hypothesis for a simulation and then testing it).</p>
<p>Chance et al (2007) summarise the GAISE report by Franklin and Garfield (2006) on issues to consider when choosing software to teach statistics. These include (a) "ease of data entry, ability to import data in multiple formats, (b) Interactive capabilities, (c) Dynamic linking between data, graphical, and numerical analyses, (d) Ease of use for particular audiences, and (f) Availability to students, portability" (p.19). Franklin and Garfield (2006) also discuss a range of other implementation issues, such as the amount of time to allocate to software exploration, how much the software will be used in the course, and how accessible the software will be outside class. Garfield (1995) suggest that computers should be used to encourage students to explore data using analysis and visualisation tools. Running simulations and exploring resulting properties is also particularly useful. Thus, overall these general considerations regarding statistics education can inform the choice of statistical software. However, the above review also highlights that choice of software is only a small part of the overall unit design process.</p>
<h2>Psychology students and the role of statistics</h2>
<p>Pathways of psychological studies in Australian universities typically involve completing a three year undergraduate major in psychology, then a fourth year, followed by post-graduate professional or research degrees at masters or doctoral level. As a result of student interest, specialisation, and competition for places, there is a reduction over year levels. From my experience both at Melbourne University and Deakin University, a ball park estimate of the student numbers as a percentage of first year load, would be 40% at second year, 35% at third year, 10% at fourth year, and 3% at postgraduate level. This is from one to two thousand students at first year. Of course these are just rough estimates, but the point is to highlight that there are huge numbers of students getting a basic undergraduate education in psychology; in contrast, the few that go on to fourth year have both a high skill level in psychology also different needs regarding research methods.</p>
<p>Psychology students are taught using the scientist-practitioner model. A big part of science in psychology is research methods and statistics. Students typically complete two or three research methods subjects at undergraduate level, another unit in fourth year, and potentially further units at postgraduate level. The diverse nature of psychology student backgrounds, motivations, and career outcomes can make research methods a difficult subject to design and teach.
Psychology undergraduate students also have diverse career goals and outcomes. Many go on to some form of further study. Those that exit at the end of third year have diverse employment outcomes. For example, Borden and Rajecki describe one US sample finding that income was lower than many other majors and that roles included administrative support (17.6%), social worker (12.6%), counsellor (7.6%) along with a diverse range of other jobs. Of those that go on, some will continue with research, but others will go into some form of applied practice.</p>
<p>In terms of research methods in psychology, there are a diverse range of goals. First, research methods is meant to help all students learn to reason about the scientific literature in psychology. Second, for students who continue with psychology research methods should give students the skills to be able to complete a quantitative fourth year and postgraduate thesis. For a subset of students, quantitative skills is part of their marketable skillset that they can take into future employment. Furthermore, for a small group of students who go on to do their PhD and then join academia, research methods skills are fundamental to the continuation of good research and the vitality of the discipline.</p>
<p>In addition to diverse aims are the diverse student backgrounds in psychology. In particular, there are typically no mathematics pre-requisites. By casual observation many students seem motivated to find work in the helping professions, and particularly as clinical psychologists. Many studies have discussed the challenges of teaching statistics to psychology students. For example LaLonde and Gardner (1993) proposed and tested a model of statistics achievement that combined mathematical aptitude and effort with anxiety and motivation as predictors.</p>
<p>Thus, in combination this diversity in background and student goals introduce several challenges when teaching research methods. For some students the main goal is to introduce a moderate degree of statistical literacy. For others, it is essential that they are at least able to analyse their thesis data in a basic way. A final group of advanced students needs skills that will allow them to model their data in a sophisticated way to contribute to the research literature. Thus, there is a tension between presenting ideas in an accessible way for all students versus tailoring the material for advanced students so they can truly excel.</p>
<p>This tension exists in many different aspects of research methods curriculum. Research methods can be taught with varying degrees of mathematical rigour and abstraction. Teaching can emphasise interpreting output or it can emphasise computational processes. It can also vary in the prominence of software versus ideas. In particular the correct choice of statistical software can substantially interact with these issues of balancing rigour with accessibility. In particular, tools like SPSS are more limited than R, but such limits can make standard analyses easier.</p>
<p>Aiken et al (2008) reviewed doctoral education in statistics and found that most surveyed programs were using SAS or SPSS primarily. They described a case study in curricular innovation in terms of novel topics emerging followed by initiatives from substantive researchers. Textbooks and software that make techniques accessible to psychology graduates also facilitate the teaching process. In some respects, as R has become more accessible through usability innovation and as the needs of data analysts have become more advanced, the argument for R has become more compelling.</p>
<h2>Whether to use R in psychology research methods</h2>
<h3>Pros and cons of R</h3>
<p>The above review thus provides a background for understanding both statistics education in general and the diversity in the background and goals of psychological students. The following analysis compares and contrasts R and SPSS as software for teaching research methods in psychology. This initial comparison focuses on price, features, usability and other considerations.</p>
<p>In terms of price, an initial benefit of R is that it is free. It is developed under the GNU open-source licence. It is free to the university and free to students. In contrast a student licence to SPSS for a year is around $200; A professional licence is around $2,000; and SPSS charges expensive licencing fees to the university. R would make it easier to get students to complete analyses from home. Requiring students to purchase SPSS creates equity issues and may even encourage some students to engage in software piracy. If as Devlin et al (2008) suggest that essential textbooks create economic hardship, even more expensive statistical software would compound this problem.</p>
<p>In terms of features, SPSS and R both run on Windows, OSX, and Linux. They both support most standard analyses that students may wish to run. However, R has a larger array of contributed packages. SPSS has several features including a data entry tool, a menu-driven GUI, and an output management system for tables and plots that R does not have. R makes it a lot easier to customise analyses, perform reproducible research, and simulations.</p>
<p>In terms of flexibility SPSS and R both have options for performing flexible analyses. However, R makes it a lot easier to gradually introduce customisation by building on standard analyses. It is also flexible in how it can be used because of the open source licence. R is particularly suited to advanced students who can benefit from the easier pathway it provides for growing statistical sophistication.</p>
<p>In terms of usability R and SPSS are quite different. R assumes greater knowledge about statistics. SPSS has an interface that is more familiar to standard Windows-based programs. R is a programming language with a less consistent mental model to standard Windows programs. R has a steeper initial learning curve, but shallower intermediate curve. R encourages students to gradually develop statistical skills. In particular R has several quirks which create difficulties for the novices (e.g., learning details of syntax, escaping spaces in file paths, treating strings as factors versus character variables, etc.). There are also many things that are easy in SPSS that are difficult in R. Some examples include: variable labels and modifying meta data, editing loaded data, browsing loaded data, producing tables of output, viewing and browsing statistical output, generating all the possible bits of output for an analysis, importing data, standard analyses that SPSS already does, and interactive plotting.</p>
<p>R and SPSS can also be compared in terms of existing resources. There are many online resources for both R and SPSS. Psychology-specific R resources exist but are less plentiful than for SPSS. Furthermore, existing psychology supervisors, research methods staff, and tutors are probably more familiar with SPSS which may cause issues when transitioning teaching to R. That said, many supervisors either train their students directly in the software that they want their students to use or they let the student handle details of implementation.</p>
<h3>Mental Models</h3>
<p>When choosing between SPSS and R it is worth considering the mental models required to use SPSS and R. These mental models both guide what needs to be trained and also may suggest the gap that needs to be closed between students' initial mental models and that which is required by the software.</p>
<p>The SPSS mental model is centred around a dataset. The typical workflow is as follows: (a) import or create data; (b) define meta data; (c) menus guide analysis choice; (d) dialog boxes guide choices within analyses; (e) large amounts of output are produced; (f) instructional material facilitates interpretation of output; (g) output can be copy and pasted into Word or another program for a final report. Custom statistical functions or taking SPSS output and using it as input to subsequent functions is not encouraged for regular users. Thus, overall the system guides the user in the analysis.</p>
<p>In contrast, R requires that the user guides the software. Thus, the R workflow is as follows: (a) Setup raw data in another program; (b) import data where often the user will have multiple datasets, meta datasets, and other data objects (e.g., vectors, tables of output); (c) transform data as required using a range of commands; (d) perform analyses, where command identification may involve a Google search or looking up a book, and understanding arguments in a command can be facilitated by internal documentation and online tools; (e) because the resulting output is minimal, the user often has to ask for specific output using additional commands; (f) much of what is standard in SPSS requires a custom command in R, but also much of which does not exist in SPSS can be readily created by an intermediate user; it is much easier to extract out particular statistical results and use that as input for subsequent functions; (g) while output can be incorporated into Word or Excel, users are encouraged to engage in various workflows that emphasise reproducible research.</p>
<h3>Summary</h3>
<p>Thus, overall SPSS is well suited to a menu-driven standardised analysis workflow which meets the needs of many psychology students. R is particularly suited to statisticians that need to perform a diverse range of analyses and are more comfortable with computer programming and statistics in general. R requires greater statistical knowledge and it encourages students to have a plan for their analyses. R also requires students to learn more about computing including programming, the command-line, file formats, and advanced file management. The emphasis on commands creates a greater demand on declarative memory which in turn makes R more suited to students who will perform statistical analysis more regularly. However, the flexibility and nature of R means that it can be used in many more contexts than SPSS such as demonstrating statistical ideas through simulation.</p>
<p>Overall, there are clearly pros and cons of both SPSS and R. R is particularly suited to more advanced students. Occasional users may be more productive initially with SPSS. That said, the many students who never go on with any data analysis work, may learn as much or more by using R. It also remains an empirical question to see how different psychology students might handle R. Thus, the remainder of this report focuses on what implementation of how R could be implemented most effectively.</p>
<h2>How to use R in psychology research methods</h2>
<p>When considering implementation of R in psychology, it is useful to look at existing textbooks and course implementations. When considering textbooks, it is important to note that psychology tends to use a particular subset of statistical analyses. It also often has analysis goals that differ from other fields. For example, there is a greater emphasis on theoretical meaning, effect sizes, complex experimental designs, test reliability, and causal interpretation. While there are many textbooks that teach statistics using R, only recently have books emerged that are specifically designed to teach R to psychology students. The two main books are Andy Field's "Discovering Statistics Using R" and Dan Navarro's "Learning Statistics with R". An alternative model is to take a more generic R textbook or online resource and combine it with a more traditional psychology textbook such as David Howell's "Statistical Methods for Psychology". In particular, there are many user friendly online resources for learning R such as http://www.statmethods.net/ or Venables, Smith and the R Core Team's "An Introduction to R". Whatever textbook option is chosen an important part of learning R involves learning how to get help. Thus, training should include learning how to navigate online learning resources and internet question and answer sites that are very effective in the case of R (e.g., stackoverflow.com).</p>
<p>Dan Navarro (2013) has written a textbook that teaches statistics to psychology students using R. Navarro (2013) presents several argument for using R instead of a different commercial statistics package. These include: (1) the benefits of the software being free and not locking yourself into expensive proprietary software; (2) that R is highly extensible and has many cutting edge statistical techniques; and (3) that R is a programming language and learning to program is a good thing. He also observes that while R has its problems and challenges, overall it provides the best current available option. Thus, overall, his approach is to inspire the student to see the bigger picture about why they are learning R. Navarro then spends two chapters introducing the R programming language. Starting with simple calculations, many basic concepts of variables, assignment, extracting data, and functions are introduced. Then, standard statistical techniques such as ANOVA and regression are presented with R implementations.</p>
<p>Overall, both these textbooks provide insight into how R could be implemented. Teaching with R provides some opportunity to teach statistics in a slightly deeper way. However, various recipes can be provided to perform standard analyses. Teaching R also requires taking a little extra time to teach the language. The menu-driven interface to R called R-Commander also provides a way of introducing R in a more accessible way. The infrastructure provided by R also provides the opportunity to introduce many important topics such as bootstrapping, simulation, power analysis, and customised formulas. Weekly analysis homework not easily possible with SPSS could consolidate R specific skills.</p>
<p>An additional issue of implementation relates to when R should be introduced. Fourth year provides one such opportunity where the students that remain at this level tend to be more capable and have some initial experience in statistics. Fourth year research methods is a very important subject. It is often designed to prepare students to analyse multivariate data. It is also designed to prepare students to be able to analyse data on their own including preliminary analyses, data cleaning, and transformations. R supports all the standard multivariate techniques that are currently taught at fourth year level. These include PCA, factor analysis, logistic regression, DFA, multiple regression, multilevel modelling, CFA, and SEM. R also makes it easier to explore more advanced methods such as bootstrapping and simulations.</p>
<h2>Conclusion</h2>
<p>Ultimately, it is an empirical question as to whether using R would provide a more effective tools for research methods education in psychology. It may be useful to explore the idea with some low-stakes optional post-graduate training modules in R. Such programs may give a sense of the kinds of practical issues that arise with students when learning to use R. If R is to be rolled out to all of fourth year psychology, this would be a high risk exercise. It would be important to evaluate the student learning outcomes in a broad way. In particular, it would be important to see any effect on analysis performance in fourth year theses.</p>
<h2>References</h2>
<ul>
<li>Aiken, L. S., West, S. G., & Millsap, R. E. (2008). Doctoral training in statistics, measurement, and methodology in psychology: Replication and extension of Aiken, West, Sechrest, and Reno's (1990) survey of PhD programs in North America. The American Psychologist, 63(1), 32-50.</li>
<li>Batanero, C., Godino, J., Green, D., and Holmes, P. (1994). Errors and Difficulties in Understanding Introductory Statistical Concepts. International Journal of Mathematical Education in Science and Technology, 25 (4), 527–547.</li>
<li>Borden, V. M., & Rajecki, D. W. (2000). First-year employment outcomes of psychology baccalaureates: Relatedness, preparedness, and prospects.Teaching of Psychology, 27(3), 164-168.</li>
<li>Chance, B., Ben-Zvi, D., Garfield, J., and Medina, E. (2007). The Role of Technology in Improving Student Learning of Statistics. Technology Innovations in Statistics Education, 1(1). url: http://www.escholarship.org/uc/item/8sd2t4rr</li>
<li>Devlin, M., James, R., & Grigg, G. (2008). Studying and working: A national study of student finances and student engagement. Tertiary Education and Management, 14(2), 111-122.</li>
<li>Franklin, C. & Garfield, J. (2006). The GAISE (Guidelines for Assessment and Instruction in Statistics Education) project: Developing statistics education guidelines for pre K-12 and college courses. In G. Burrill (Ed.), 2006 NCTM Yearbook: Thinking and reasoning with data and chance. Reston, VA: National Council of Teachers of Mathematics.</li>
<li>Gal, I. (2002). Adults' Statistical Literacy: Meanings, Components, Responsibilities. With Discussion. International Statistical Review, 70(1), 1-51.</li>
<li>Gal, I. & Ginsburg, L. (1994). The Role of Beliefs and Attitudes in Learning Statistics: Towards an Assessment Framework. Journal of Statistics Education, 2(2). url: http://www.amstat.org/publications/jse/v2n2/gal.html</li>
<li>Garfield, J. (1995). How Students Learn Statistics. International Statistical Review, 63(1), 25-34.</li>
<li>Garfield, J. and Ben-Zvi, D. (2008). Developing Students' Statistical Reasoning: Connecting Research and Teaching Practice, Springer.</li>
<li>Lalonde, R. N., & Gardner, R. C. (1993). Statistics as a second language? A model for predicting performance in psychology students. Canadian Journal of Behavioural Science, 25, 108-125.</li>
<li>Moore, D.S. (1997). New pedagogy and new content: the case of statistics. International Statistical Review, 635, 123-165.</li>
<li>Navarro, D. (2013). Learning statistics with R: A tutorial for psychology students and other beginners (Version 0.3) url: http://ua.edu.au/ccs/teaching/lsr</li>
<li>Tishkovskaya, S., & Lancaster, G. (2012). Statistical education in the 21st century: a review of challenges, teaching innovations and strategies for reform. Journal of Statistics Education, 20(2), 1-55.</li>
<li>Wiberg, M. (2009). Teaching statistics in integration with psychology. Journal of Statistics Education, 17(1), 1-16.</li>
</ul>
jeromyanglimhttp://www.blogger.com/profile/12949204812496382042noreply@blogger.comtag:blogger.com,1999:blog-8909074830238091680.post-29568897848487660902013-03-15T18:55:00.000+11:002013-03-15T19:07:42.940+11:00Google Reader Replacements: Feedly and The Old ReaderThis post discusses the impending demise of Google Reader and configuring Feedly as a replacement.<br />
<a name='more'></a><br />
<!-- break -->
<br />
<h3>
Demise of Google Reader</h3>
I was very disappointed to read about Google terminating its <a href="http://googleblog.blogspot.com.au/2013/03/a-second-spring-of-cleaning.html">Google Reader</a> service.
<br />
Google Reader provided me with a great tool for following hundreds of blogs, journals, and assorted feeds.
The interface was clean and efficient.
I liked the keyboard shortcuts for navigation.
<br />
Some people are saying that Google+, Twitter, Facebook, Reddit, email newsletters, and so on are a substitute for Google Reader. This is rubbish. Google Reader is an efficient way of consuming and scanning new content based on providers that I care about. None of these other tools provide anything like this.
<br />
<br />
For bloggers the concern about the end of Google Reader is that this is one of the major ways that people consume blog content. Even my own small blog has around a thousand RSS subscribers. The most popular RSS reader is Google Reader, and thus there is the concern that the closure of Google Reader may damage this connection between blogs and subscribers. As a consequence we might see fewer subscribers and then fewer incentives to blog and then less great blog content. Thus, I really hope that one or more high quality, trustworthy, multi-device, free web services emerge that continue to provide a great RSS reading experience. Hopefully, this is an opportunity for a service to emerge that is even better than Google Reader.
<br />
<h3>
Feedly</h3>
After an initial exploration I am having a good experience with <a href="http://www.feedly.com/">Feedly</a>.
When you log into Feedly with your Google Account, it immediately synchronises with your Google Reader account.
Supposedly Feedly will switch to their own backend when the Google Reader service ends.
Nonetheless, I have still exported my feeds directly from Google Reader using the Google Reader export facilities.<br />
<br />
I must admit that my first impressions of Feedly were worrying. However, a little persistence showed that I could replicate the Google Reader workflow.
<br />
<br />
First, Feedly runs both in the browser and on various mobile devices.
One drawback is that it does require the installation of a browser plugin and an app on mobile devices. But given that I have admin privileges, this wasn't a major issue.
<br />
<br />
To configure like Google Reder see this <a href="http://blog.feedly.com/2013/03/14/tips-for-google-reader-users-migrating-to-feedly/">blog post for a few tips</a>.
<br />
<br />
After a few customisation steps I'm very happy.
In particular: (1) I set tile view for each of my categories; (2) I saved a bookmark in my browser for feedly to be a particular category. I have my main feeds in a category called "core". This means that the default view when I click on the bookmark is like I'm used to in Google Reader. I find the default Feedly homepage annoying; (3) I learnt the keyboard shortcuts, in particular j and k for navigating between posts (I had to set an exception on Vimium). This was something that I really liked in Google Reader and it's great to see it still available in Feedly. Pressing question mark on the keyboard brings up available shortcut keys.
<br />
That said, it is early days and there are a lot of discussions about what service offers the best Google Reader replacement. I also need to build up trust when it comes to a provider of RSS services. I still need to see whether the migration from the Google Reader backend will be effective. I also don't yet understand feedly's business model and therefore wonder how they will provide the service in the longer term.
<br />
<h3>
Alternatives</h3>
There's a discussion here of some of the <a href="http://webapps.stackexchange.com/questions/41591/alternatives-for-google-reader">alternatives</a>.
<br />
<br />
The Old Reader appears to be a popular choice. It offers an interface nearly identical to Google Reder. It doesn't require a browser plug-in. The <a href="http://blog.theoldreader.com/">development seems friendly</a>. It also did a better job of rendering a few posts with mathematics (e.g., posts from the <a href="http://normaldeviate.wordpress.com/">Normal Deviate</a>, which Feedly struggled with).
<br />
<br />
Anyway, it's nice that at the moment there are at least two reasonable replacements to Google Reader. Presumably much more will evolve in terms of the preferred option over the coming weeks and months.
jeromyanglimhttp://www.blogger.com/profile/12949204812496382042noreply@blogger.comtag:blogger.com,1999:blog-8909074830238091680.post-37013501705397852402012-07-23T23:20:00.000+10:002012-07-23T23:20:42.143+10:00Beamer presentations using pandoc, markdown, LaTeX, and a makefile<p>This post discusses the creation of beamer presentations using a combination of
markdown, pandoc, and LaTeX. This workflow offers the potential to reduce typing
and increase readability of beamer presentation source code. Source code for an example
presentation is provided containing markdown and LaTeX source code along with
a makefile for building the beamer PDF. </p>
<a name='more'></a>
<h3>Motivation</h3>
<p>I've used beamer quite a lot to prepare slides for both research and teaching
purposes (e.g.,
<a href="https://github.com/jeromyanglim/RMeetup_Workflow">this 2010 presentation on R Workflow</a>).
I've also written up a
<a href="http://jeromyanglim.blogspot.com.au/2010/08/getting-started-with-beamer-tips-and.html">guide to getting started with beamer</a> and
<a href="http://jeromyanglim.blogspot.com.au/2010/08/simple-beamer-template-for-getting.html">a simple beamer template</a>.</p>
<p>Nonetheless, for some time I've been concerned about the high ratio of markup to
content in beamer presentations. I even asked a question on TeX.SE on <a href="http://tex.stackexchange.com/questions/1264/typing-and-editing-beamer-presentations">strategies
for dealing with this
issue</a>.
I find that beamer markup is a burden. It interferes with content creation.
Creating, editing, and re-arranging slides is more difficult than it needs to
be. The high quantity of markup also interferes with readability.</p>
<p>Several ways of dealing with this.</p>
<ul>
<li><a href="http://tex.stackexchange.com/questions/4106/what-are-a-good-set-of-macros-for-writing-beamer-presentations?lq=1">Use LaTeX macros</a>: I.e., to shorten common environments.
However, this reduces readability if it is ad hoc.</li>
<li><a href="http://tex.stackexchange.com/a/1303/151">Org Mode in Emacs</a>. This sounds
good, but I'm more experienced with Vim.</li>
<li>Code Snippets. Code snippets partially solve the typing issue, but they don't
solve the readability issue.</li>
</ul>
<p>In the end, I decided to explore the combination of pandoc, markdown, and LaTeX
to create a beamer presentation.
The reasons for this included that:</p>
<ul>
<li><a href="http://daringfireball.net/projects/markdown/">Markdown</a> is a really intuitive
markup format that is easy to read and easy to modify.</li>
<li>When pandoc converts markdown to LaTeX, any LaTeX is passed straight through. Thus,
it is possible to obtain customisation beyond the basic options provided by
markdown.</li>
</ul>
<h3>Existing resources on combining Beamer, markdown and pandoc</h3>
<p>John MacFarlane, author of pandoc, has some <a href="http://johnmacfarlane.net/pandoc/demo/example9/producing-slide-shows-with-pandoc.html">relevant documentation on slide
production</a>
Important points:</p>
<ul>
<li>The basic compilation command is: <code>pandoc -t beamer my_source.md -o my_beamer.pdf</code></li>
<li>The post explains the slide separation rules. </li>
<li>You can have incremental lists by pre-pending dot points with the greater than symbol </li>
<li>Beamer Themes can be used via the <code>-V</code> option e.g., (<code>-V theme:Warsaw</code>)</li>
<li>It shows how the first few lines of the file pre-pended by <code>%</code> are incorporated into the title slide</li>
</ul>
<p>Stephen Sinclair <a href="http://www.music.mcgill.ca/~sinclair/content/blog/using_markdown_for_beamer_presentations">has a tutorial</a>.
Relevant points include:</p>
<ul>
<li>Latex gets passed directly through
<ul>
<li>Equations can be passed directly through</li>
<li>Image size and placement can be controlled in detail with latex e.g., <code>\centerline{\includegraphics[height=2in]{my_image.pdf}}</code></li>
<li>You can use bibtex for citations</li>
</ul></li>
<li>He also mentions a number of other options for compiling the document
involving templates, regular expressions, and so on. </li>
</ul>
<h3>My approach</h3>
<p>My approach involved running a makefile which converted a markdown file into
a tex file, which was then incorporated into another tex file and then converted
into a pdf. </p>
<ul>
<li>The repository containing all files is available on github:
<a href="https://github.com/jeromyanglim/rmarkdown-rmeetup-2012/tree/master/talk">https://github.com/jeromyanglim/rmarkdown-rmeetup-2012/tree/master/talk</a></li>
<li>The <a href="https://github.com/jeromyanglim/rmarkdown-rmeetup-2012/blob/master/talk/main.pdf?raw=true">presentation PDF</a></li>
</ul>
<p>To use the approach you would need the following software:</p>
<ul>
<li>LaTeX distribution with beamer package</li>
<li><a href="http://johnmacfarlane.net/pandoc/">pandoc</a> </li>
<li>support for <code>make</code>: On Linux, make is installed by default; on Windows, you may
need something like <a href="http://www.cygwin.com/">cygwin</a> or
<a href="http://cran.r-project.org/bin/windows/Rtools/">Rtools</a>.</li>
</ul>
<h4>makefile</h4>
<p>The makefile was as follows:</p>
<pre><code>pdf:
pandoc talk.md --slide-level 2 -t beamer -o talk.tex
pdflatex main.tex
pdflatex main.tex
-xdg-open main.pdf
</code></pre>
<ul>
<li><code>pandoc</code> converted <code>talk.md</code> into a beamer latex file <code>talk.tex</code></li>
<li><code>--slide-level 2</code> meant that level 1 markdown headings (i.e., lines preceded
with a single hash: <code># Section name</code>) represented section headings used in the
presentation, and level 2 headings (i.e., lines preceded with two hashes <code>##
Slide Title</code>) represented new slides and their title.</li>
<li>The line <code>-xdg-open main.pdf</code> opens the resulting pdf file on Linux, but
<code>xdg-open</code>
could be replaced by the name of pdf viewer (e.g., on a different operating system).</li>
</ul>
<h4>Preamble LaTeX file: <code>main.tex</code></h4>
<p>I had a main LaTeX file (<code>main.tex</code>) as follows: </p>
<pre><code>\documentclass[t]{beamer}
\usetheme{Berkeley}
\setbeamertemplate{navigation symbols}{}
\title{MY_TALK_TITLE}
\subtitle{MY_TALK_SUBTITLE}
\author{MY_NAME}
\institute{MY_INSTITUTION}
\date{DATE_OF_MY_TALK}
% more preamble...
\begin{document}
\begin{frame}
\titlepage
\end{frame}
\begin{frame}
\frametitle{Outline}
\tableofcontents
\end{frame}
\input{"talk.tex"}
\end{document}
</code></pre>
<ul>
<li>The file is completely in latex and includes the preamble the document
environment, some opening slides with particular features, and the input
command which reads in the file <code>talk.tex</code>.</li>
<li><code>talk.tex</code> is generated by pandoc from <code>talk.md</code> and contains all the
individual content slides.</li>
<li>I prefer to exclude navigation symbols.</li>
<li>Naturally, <code>usetheme</code> could be altered to some other theme (see the <a href="http://www.hartwork.org/beamer-theme-matrix/">beamer
theme matrix</a>), such as <code>default</code>. </li>
</ul>
<h4>Body markdown slide file: <code>talk.md</code></h4>
<p><code>talk.md</code> contained all the individual markdown slides.</p>
<p>For example a basic slide might look as follows:</p>
<pre><code># NAME OF A SECTION
## SLIDE TITLE
* Some point to make
* Another point
* Another point
* Some point to make
* Another point
* Another point
</code></pre>
<ul>
<li>The first line adds a section title. This is not part of the slide, but can be
used to generate table of contents, and in slide navigation.</li>
<li>The second line starts a new slide with the content to the right of the double
hash constituting the slide title.</li>
<li>And then subsequent lines generate a two-level list represented in LaTeX using
the <code>itemize</code> environment.</li>
</ul>
<p>In general, markdown is converted by pandoc into sensible beamer content. See the <a href="https://raw.github.com/jeromyanglim/rmarkdown-rmeetup-2012/master/talk/talk.md">actual
markdown file talk.md</a>
and resulting <a href="https://raw.github.com/jeromyanglim/rmarkdown-rmeetup-2012/master/talk/talk.tex">tex file talk.tex</a>.
However, pandoc passes any LaTeX through as is, and this is sometimes required.</p>
<p>For incorporating images, I found the default markdown image command led to an
excessively large image.
Thus, I used LaTeX for images.
I'd like to think that there is a way of making default images work well, but
I didn't work it out.
Thus, instead, I used commands like the following:</p>
<pre><code>## SLIDE TITLE
\includegraphics[width=4in]{FIGURE_FILE_NAME.PNG}
</code></pre>
<ul>
<li>I often had to tweak the image width to get it the right size.</li>
<li>I also read about some other options, which I <a href="https://github.com/jeromyanglim/rmarkdown-rmeetup-2012/issues/4#issuecomment-6952495">discuss
here</a>.</li>
</ul>
<p>Obviously there are many reasons that you might want to fall back to LaTeX.
In my talk, I tried to keep things simple, so the main instances were:</p>
<ul>
<li>Images as shown above</li>
<li>Small text for footnotes often with a url inside: e.g., <code>\tiny{some text and
a link: \url{http://jeromyanglim.blogspot.com}}</code></li>
<li>Large text at the end of the talk: e.g., <code>\begin{center} \LARGE{Questions?} \end{center}</code></li>
</ul>
<h3>Conclusion</h3>
<p>Overall, there were pros and cons to the approach I adopted. </p>
<ul>
<li>By incorporating markdown and pandoc, there was an extra layer of complexity
to think about to ensure that the conversion process had the desired effect.
Error messages were sometimes more difficult to diagnose. </li>
<li>There were a lot of situations where you might want to have more control
over slide content than what you get by default with Markdown. </li>
<li>There is an argument to suggest that slide creation is best in a WYSIWIG
environment where you can manually tweak image positioning and layout. </li>
</ul>
<p>Nonetheless, I really liked how easy it was to create, edit, and read dot points,
nested dot points, frames, sections, and basic formatting, and in general it was
relatively easy to incorporate LaTeX when required.
I also like the idea of where possible using open plain-text file formats to
take advantage of easier programmability, version control, incorporating into
a powerful text editor, simpler conversion, and so on.</p>
<h3>Other aspects</h3>
<p>The following are a few other aspects that might interest some readers,
particularly Vim users.</p>
<h4>Syntax highlighting of markdown+LaTeX in Vim</h4>
<p>There is a Vim plugin for pandoc that provides many features including syntax
highlighting for documents that combine multiple markups including markdown and
LaTeX. I found it best to install the latest version available on github:
<a href="https://github.com/vim-pandoc/vim-pandoc">https://github.com/vim-pandoc/vim-pandoc</a></p>
<h4>Code folding of markdown-Beamer</h4>
<p>I also have the following script in my <code>.vimrc file</code>.
The great thing about it is that it allows code folding if you use hash style
markdown headings.
It is setup to only fold on headings 1 and 2.
This corresponds to sections and slides in my pandoc setting for beamer markdown documents.
To increase the level, change it to <code>MarkdownLevel(3)</code>, etc.</p>
<pre><code>function! MarkdownLevel(maxlevel)
if a:maxlevel >= 1 && getline(v:lnum) =~ '^# .*$'
return ">1"
endif
if a:maxlevel >= 2 && getline(v:lnum) =~ '^## .*$'
return ">2"
endif
if a:maxlevel >= 3 && getline(v:lnum) =~ '^### .*$'
return ">3"
endif
if a:maxlevel >= 4 && getline(v:lnum) =~ '^#### .*$'
return ">4"
endif
if a:maxlevel >= 5 && getline(v:lnum) =~ '^##### .*$'
return ">5"
endif
if a:maxlevel >= 6 && getline(v:lnum) =~ '^###### .*$'
return ">6"
endif
return "="
endfunction
au BufEnter *.md setlocal foldexpr=MarkdownLevel(2)
au BufEnter *.md setlocal foldmethod=expr
au BufEnter *.md setlocal autoindent
</code></pre>
<p>Then the following Vim commands in normal model make folding, navigation, and
getting a sense of structure really easy:</p>
<ul>
<li><code>zx</code> show current line and necessary headings; close other headings</li>
<li><code>zc</code> close heading</li>
<li><code>zj</code> and <code>zk</code> to move down and up headings</li>
</ul>
<h4>Showing backticks and single quotes properly in code</h4>
<p>I often need to show code, and backticks and single quotes weren't showing
properly.
The following code in my LaTeX preamble drawn from <a href="http://tex.stackexchange.com/questions/63353/">this TeX.SE
question</a> solved the problem:</p>
<pre><code>% enables straight single quote
\makeatletter
\let \@sverbatim \@verbatim
\def \@verbatim {\@sverbatim \verbatimplus}
{\catcode`'=13 \gdef \verbatimplus{\catcode`'=13 \chardef '=13 }}
\makeatother
% enables backticks in verbatim
\makeatletter
{\catcode`\`=13
\xdef\@verbatim{\unexpanded\expandafter{\@verbatim}\chardef\noexpand`=18 }
}
\makeatother
</code></pre>jeromyanglimhttp://www.blogger.com/profile/12949204812496382042noreply@blogger.comtag:blogger.com,1999:blog-8909074830238091680.post-43950542864702512052012-07-19T16:55:00.000+10:002012-07-19T16:55:52.300+10:00Video: knitr, R Markdown, and R Studio: Introduction to Reproducible Analysis<p>This post presents the video of a talk that I presented in July 2012 at
Melbourne R Users on using knitr, R Markdown, and R Studio to perform
reproducible analysis. I also provide links to a github repository where the
R markdown examples can be examined and the slides can be downloaded.</p>
<a name='more'></a>
<h3>Talk Overview</h3>
<p>Reproducible analysis represents a process for transforming text, code, and data
to produce reproducible artefacts including reports, journal articles,
slideshows, theses, and books. Reproducible analysis is important in both
industry and academic settings for ensuring a high quality product. R has
always provided a powerful platform for reproducible analysis. However, in the
first half of 2012, several new tools have emerged that have substantially
increased the ease with which reproducible analysis can be performed. In
particular, knitr, R Markdown, and RStudio combine to create a user-friendly and
powerful set of open source tools for reproducible analysis.</p>
<p>Specifically, in the talk I discuss caching slow analyses, producing attractive plots and
tables, and using RStudio as an IDE. I present three live examples of using
R Markdown. I also show how the markdown package on CRAN can be
used to work with other R development environments and workflows for report
production. </p>
<p>There is a <a href="https://github.com/jeromyanglim/rmarkdown-rmeetup-2012">github repository called rmarkdown-rmeetup-2012</a>
that contains: </p>
<ol>
<li>the slides and source code for the slides (I used a combination of beamer, markdown, and pandoc)</li>
<li>the source code for the R Markdown examples presented in the talk</li>
<li>and assorted brainstorming that recorded some of my thinking as I developed the slides
(see <a href="https://github.com/jeromyanglim/rmarkdown-rmeetup-2012/issues?state=closed">the issue tracker</a>)</li>
</ol>
<p>Follow this <a href="https://github.com/jeromyanglim/rmarkdown-rmeetup-2012/blob/master/talk/main.pdf?raw=true">link to download the slides directly</a>.</p>
<h3>Video of Talk</h3>
<p>The talk is split over two parts.</p>
<iframe width="420" height="315" src="http://www.youtube.com/embed/XqzHnYLr5BE"
frameborder="0" allowfullscreen></iframe>
<iframe width="420" height="315" src="http://www.youtube.com/embed/fNmMgHmjU2w" frameborder="0" allowfullscreen></iframe>
<h3>More Videos from Melbourne R Users</h3>
<p>We are gradually building up a fairly large back catalogue of videos about R all
presented at <a href="http://www.meetup.com/MelbURN-Melbourne-Users-of-R-Network/">Melbourne R Users</a>.</p>
<p>The <a href="http://www.youtube.com/playlist?list=PL2E4B515A6ED513B0">playlist of Melbourne R Users Videos can be viewed here</a>.</p>
<h3>Relevant links:</h3>
<p>The following links were either presented in the talk or are otherwise relevant to reproducible analysis.</p>
<ul>
<li>My post on <a href="http://jeromyanglim.blogspot.com/2012/05/getting-started-with-r-markdown-knitr.html">getting started with R Markdown</a></li>
<li>My thoughts on <a href="http://stats.stackexchange.com/a/15006/183">definitions of reproducible data analysis</a></li>
<li>My thoughts on <a href="https://github.com/jeromyanglim/rmarkdown-rmeetup-2012/issues/11">degrees of reproducible data analysis</a></li>
<li><a href="http://cran.r-project.org/web/views/ReproducibleResearch.html">Reproducible Research Task View on CRAN</a></li>
<li>Software used in talk: <a href="http://www.r-project.org/">R</a>, <a href="http://rstudio.org/">R Studio</a>, <a href="http://johnmacfarlane.net/pandoc/">pandoc</a>
<a href="http://www.latex-project.org/ftp.html">TeX distributions</a>, </li>
<li><a href="http://daringfireball.net/projects/markdown/">Overview of markdown</a></li>
<li><a href="http://jeromyanglim.blogspot.com/2010/10/getting-started-with-writing.html">Getting started with writing LaTeX equations</a></li>
<li><a href="http://yihui.name/slides/2012-knitr-RStudio.html">Slide show on benefits of knitr and Rstudio by Yihui Xie and JJ Allaire</a></li>
<li><a href="http://yihui.name/knitr/options">knitr options home page</a> and <a href="http://yihui.name/knitr/">knitr home page</a></li>
<li><a href="http://rstudio.org/docs/authoring/using_markdown">Documentation on using R Markdown with R Studio</a></li>
<li><a href="http://jeromyanglim.blogspot.com/search/label/reproducible%20research">My existing posts on reproducible analysis</a></li>
<li>Places to ask questions: <a href="http://stackoverflow.com/questions/tagged/r">R on StackOverflow</a>,
<a href="http://tex.stackexchange.com/">LaTeX on TeX.SE</a>, and <a href="https://github.com/yihui/knitr/issues">knitr on github</a>.</li>
<li>Extensive <a href="http://www.youtube.com/user/victoriastodden/videos?view=0">set of YouTube videos on reproducible analysis</a> largely
drawn from a workshop on "Reproducible Research: Tools and Strategies for Scientific Computing".</li>
</ul>
<p>If viewing through syndication, feel free to <a href="http://feeds.feedburner.com/jeromyanglim">subscribe to my blog on psychology and statistics here</a>.</p>jeromyanglimhttp://www.blogger.com/profile/12949204812496382042noreply@blogger.comtag:blogger.com,1999:blog-8909074830238091680.post-32595438946563806362012-06-10T00:20:00.000+10:002012-07-17T16:44:40.637+10:00Converting Sweave LaTeX to knitr LaTeX: A case study<p>The following post documents the steps I needed to take in order to convert a
project using Sweave LaTeX into one using knitr LaTeX.</p>
<a name='more'></a>
<h4>Additional Resources</h4>
<p>It is fairly straightforward to convert a document from Sweave LaTeX to knitr
LaTeX. <a href="http://yihui.name/">Yihui Xie</a> on the knitr website provides the
following useful resources:</p>
<ul>
<li><a href="http://yihui.name/knitr/demo/sweave/">Transition to Sweave from knitr</a>: This
document describes knitr specifically from the perspective of what is the same
as Sweave and what is different from Sweave.</li>
<li><a href="http://yihui.name/knitr/options">knitr options</a>: This includes discussion of
the many R code chunk options in knitr. Many are the same as Sweave, but there
are some new ones, and some modifications.</li>
<li><a href="http://yihui.name/knitr/demo/minimal/">knitr minimal examples</a>: These are
useful for getting started with different types of knitr document including
LaTeX.</li>
</ul>
<h3>My conversion from Sweave to knitr</h3>
<p>The following documents the steps I needed to do in order to convert a journal
article that was in Sweave LaTeX into a knitr LaTeX document.
Most of this was documented in the above mentioned links on the knitr website,
but there were still a few little surprises.</p>
<ul>
<li><strong>Rnw to tex conversion</strong>: Convert <code>R CMD Sweave myfile.rnw</code> to <code>Rscript -e
"library(knitr); knit('myfile.nw')"</code> in makefile (<a href="http://stackoverflow.com/a/10943794/180892">see this SO
question</a> ).</li>
<li><strong>global options</strong>: Replace <code>\SweaveOpts{echo=FALSE}</code> with
<code>\Sexpr{opts_chunk$set(echo=FALSE)}</code>; This needed to appear before the first
R code chunk in order to affect all code chunks in the file. </li>
<li><strong>case on R code chunk options</strong>: Update <code>true</code> and <code>false</code> to <code>TRUE</code> and
<code>FALSE</code> in r code chunk options. </li>
<li><strong>results option</strong>: Update <code>results=tex</code> to <code>results='asis'</code> and in general ensure that text
values in R code chunks are surrounded by quotation marks.</li>
<li><strong>message option</strong>: I needed to prevent the display of messages when certain
packages were loaded using <code>\Sexpr{opts_chunk$set(message=FALSE}}</code>. <br />
These messages did not previously display under sweave.</li>
<li><strong>hiding output</strong>: I had some R code chunks with options <code>print=FALSE,
term=FALSE</code>; I replaced this with <code>results='hide'</code>.</li>
<li><strong>methods package</strong>: I had a <code>densityplot()</code> (i.e., a lattice plot) that
didn't display properly. It instead showed an error: <code>Error using packet 1
could not find function "hasArg"</code>; apparently this is caused by the fact that
the methods package doesn't load by default when using <code>Rscript</code>; thus I
needed to put <code>require(methods)</code> in the first R code chunk. </li>
<li><strong>Sweave.sty</strong>: I removed <code>Sweave.sty</code> from my project directory and removed
the line <code>\usepackage{Sweave}</code> from my rnw file as both things are not needed
in knitr.</li>
<li><strong>caching</strong>: Although there are packages for enabling caching, I'd never
adopted any of them. knitr makes caching very simple. I just added
<code>cache=TRUE</code> to the global chunk options (i.e.,
<code>\Sexpr{opts_chunk$set(echo=FALSE, message=FALSE, cache=TRUE)}</code>. This reduced
the time to build the PDF from around 5 seconds to 1 second. I'm also planning
to incorporate some Bayesian analyses with JAGS and rjags, where I'm expecting
analyses will take several minutes or longer to run. At that point, I'll
really appreciate the speed benefits of caching.</li>
<li><strong>to make or not to make</strong>: I had a custom makefile on the project that kept
everything neat and tidy, copying source files into a build directory, running
all necessary commands to convert from rnw to tex and then to pdf, and then
opening the pdf in a viewer. This still works well. However, the default
"Compile to PDF" option in RStudio was also quite good (after setting tools -
options - Sweave - Weave Rnw files using knitr). In particular, I liked the
synctex support for Sweave that allows you to move from a position in the source
to the corresponding position in the PDF viewer. Also, RStudio in combination
with knitr seems to do a reasonable job of keeping the main project directory
tidy. A few auxiliary files are added, but not too many. I also appreciate the
simplicity that a simple button brings to getting started with analyses.
However, a makefile does make things more portable.</li>
</ul>
<p>My main conclusion from this process is that converting an ongoing Sweave LaTeX
document to knitr LaTeX is fairly straightforward, and there are a number of
useful benefits that arise. In particular, I really appreciate simple caching
and not having to worry about Sweave.sty. Great work Yihui Xie!</p>
<h3>Additional Resources</h3>
<ul>
<li><a href="http://feeds.feedburner.com/jeromyanglim">RSS Subscription options</a> </li>
<li><a href="http://jeromyanglim.blogspot.com.au/2012/06/how-to-convert-sweave-latex-to-knitr-r.html">Convert Sweave LaTEx to knitr R
Markdown</a></li>
<li><a href="
http://jeromyanglim.blogspot.com.au/2012/05/getting-started-with-r-markdown-knitr.html">Getting started with R Markdown</a></li>
<li><a href="http://jeromyanglim.blogspot.com.au/2009/06/learning-r-for-researchers-in.html">Getting started with R</a></li>
<li><a href="http://jeromyanglim.blogspot.com.au/2010/05/videos-on-data-analysis-with-r.html">R Videos</a></li>
<li><a href="http://jeromyanglim.blogspot.com.au/2010/11/makefiles-for-sweave-r-and-latex-using.html">Sweave and
makefiles</a></li>
</ul>jeromyanglimhttp://www.blogger.com/profile/12949204812496382042noreply@blogger.comtag:blogger.com,1999:blog-8909074830238091680.post-11917665398702979512012-06-04T22:19:00.001+10:002012-06-04T22:19:51.511+10:00How to Convert Sweave LaTeX to knitr R Markdown: Winter Olympic Medals Example<p>The following post shows how to manually convert a Sweave LaTeX document into a knitr R Markdown document. The post (1) reviews many of the required changes; (2) provides an example of a document converted to R Markdown format based on an analysis of Winter Olympic Medal data up to and including 2006; and (3) discusses the pros and cons of LaTeX and Markdown for performing analyses.</p>
<a name='more'></a>
<h1>Overview</h1>
<p>The following analyses of Winter Olympic Medals data have gone through several iterations:</p>
<ol>
<li><strong>R Script</strong>: I originally performed <a href="http://jeromyanglim.blogspot.com.au/2010/02/analysis-of-winter-olympic-medal-data.html">similar analyses in February 2010</a>. It was a simple set of commands where you could see the console output and view the plots. </li>
<li><strong>LaTeX Sweave</strong>: In February 2011 I adapted the example to make it a Sweave LaTex document. The <a href="https://github.com/jeromyanglim/Sweave_Winter_Olympics">source fo this is available on github</a>. With Sweave, I was able to create a document that weaved text, commands, console input, console output, and figures.</li>
<li><strong>R Markdown</strong>: Now in June 2012 I'm using the example to review the process of converting a document from Sweave-LaTeX to R Markdown. The <a href="https://github.com/jeromyanglim/Winter_Olympic_Medals_R_Markdown">souce code is available here on github</a> (see the <code>*.rmd</code> file). </li>
</ol>
<h1>Converting from Sweave to R Markdown</h1>
<p>The following changes were required in order to convert my LaTeX Sweave document into an R Markdown document suitable for processing with <code>knitr</code> and <code>RStudio</code>. Many of these changes are fairly obvious if you understand LaTeX and Markdown; but a few are less obvious. And obviously there are many additional changes that might be required on other documents.</p>
<h2>R code chunks</h2>
<ul>
<li><strong>R code chunk delimiters:</strong> Update from <code><< ... >>=</code> and <code>@</code> to R markdown format <code>```{r ...}</code> and <code>```</code></li>
<li><strong>Inline code chunks:</strong> Update from <code>\Sexpr{...}</code> to either <code>`r ...`</code> or <code>`r I(...)`</code> format.</li>
<li><strong>results=tex</strong>: Any <code>results=tex</code> needs to either be removed or converted to <code>results='asis'</code>. Note that string values of knitr options need to be quoted.</li>
<li><strong>Boolean options</strong>: Sweave tolerates lower case <code>true</code> and <code>false</code> for code chunk options, <code>knitr</code> requires <code>TRUE</code> and <code>FALSE</code>.</li>
</ul>
<h2>Figures and Tables</h2>
<ul>
<li><strong>Floats</strong>: Remove figure and table floats (e.g., <code>\begin{table}...\end{table}</code>, <code>\begin{figure}...\end{figure}</code>). In R Markdown and HTML, there are no pages and thus content is just placed immediately in the document.</li>
<li><strong>Figure captions</strong>: Extract content from within the <code>\caption{}</code> command. When using R Markdown, it is often easiest to add captions to the plot itself (e.g., using the <code>main</code> argument in base graphics). </li>
<li><strong>Table captions:</strong> extract content from within the <code>\caption{}</code> command; Table captions can be included in a <code>caption</code> argument using the <code>caption</code> argument to the <code>xtable</code> function (e.g., <code>print(xtable(MY_DAT_FRAME), "html", caption="MY CAPTION", caption.placement="top")</code> ). Caption placement defaults to <code>"bottom"</code> of table but can be optinally specified as <code>"top"</code> either as a global option or in <code>print.xtable</code>. Alternatively table titles can just be included as Markdown text.</li>
<li><strong>References:</strong> Delete table and figure lables (e.g., <code>\label{...}</code>). Replace table and figure references (e.g., <code>\ref{...}</code> with actual numbers or other descriptive terminology. It would also be possible to implement something simple in R that stored table and figure numbers (e.g., initialise table and figure numbers at the start of the document; increment table counter each time a table is created and likewise for figures; store the value of counter in variable; include variable in caption text using <code>paste()</code> or something similar. Include counter in text using inline R code chunks.</li>
<li><strong>Table content</strong>: Markdown supports HTML; so one option is to convert LaTeX tables to HTML tables using a function like <code>print(xtable(MY_DATA_FRAME), type="html")</code>. This is combined with the <code>results='asis'</code> R code chunk option.</li>
</ul>
<h2>Basic formatting</h2>
<ul>
<li><strong>Headings</strong>: if we assume <code>section</code> is the top level: then <code>\section{...}</code> becomes <code># ...</code>, <code>\subsection{...}</code> becomes <code>## ...</code> and <code>\subsubsection{...}</code> becomes <code>### ...</code></li>
<li><strong>Mathematics</strong>: Update latex mathematics to <code>$</code><code>latex ...</code> and <code>$$</code><code>latex ... $$</code> notation if using RStudio.</li>
<li><strong>Paragraph delimiters</strong>: If using RStudio then remove single line breaks that were not intended to be paragraph breaks.</li>
<li><strong>Hyperlinks</strong>: Convert LaTeX Hyperlinks from <code>\href</code> or <code>url</code> to <code>[text](url)</code> format.</li>
</ul>
<h2>LaTeX things</h2>
<ul>
<li><strong>Comments</strong>: Remove any LaTeX comments or switch from <code>% comment</code> to <code><!-- comment --></code></li>
<li><strong>LaTeX escaped characters</strong>: Remove unnecessary escape characters (e.g., <code>\%</code> is just <code>%</code>).</li>
<li><strong>R Markdown escaped characters</strong>: Writing about the R Markdown language in R Markdown sometimes requires the use of HTML codes for special characters such as backticks (<code>&#96;</code>) and backslashes (<code>&#92;</code>) to prevent the text from being interpreted; see <a href="http://www.ascii.cl/htmlcodes.htm">here for a list of HTML character codes</a>.</li>
<li><strong>Header</strong>: Remove the LaTeX header information up to and including <code>\begin{document}</code>; extract any incorporate any relevant content such as title, abstract, author, date, etc.</li>
</ul>
<h1>R Markdown Analysis of Winter Olympic Medal Data</h1>
<p>The following shows the output of the actual analysis after running the rmd source through <code>Knit HTML</code> in Rstudio. If you're curious, you may wish to view the <a href="https://github.com/jeromyanglim/Winter_Olympic_Medals_R_Markdown/blob/master/Winter_Olympics.rmd">rmd source code on GitHub side by side this point at this point</a>.</p>
<h2>Import Dataset</h2>
<pre><code class="r">library(xtable)
options(stringsAsFactors = FALSE)
medals <- read.csv("data/medals.csv")
medals$Year <- as.numeric(medals$Year)
medals <- medals[!is.na(medals$Year), ]
</code></pre>
<p>The Olympic Medals data frame includes <code>2311</code> medals from <code>1924</code> to <code>2006</code>. The data was sourced from <a href="http://www.guardian.co.uk/news/datablog/2010/feb/11/winter-olympics-medals-by-country">The Guardian Data Blog</a>.</p>
<h2>Total Medals by Year</h2>
<pre><code class="r"># http://www.math.mcmaster.ca/~bolker/emdbook/chap3A.pdf
x <- aggregate(medals$Year, list(Year = medals$Year), length)
names(x) <- c("year", "medals")
x$pos <- seq(x$year)
fit <- nls(medals ~ a * pos^b + c, x, start = list(a = 10, b = 1,
c = 50))
</code></pre>
<p>In general over the years the number of Winter Olympic medals awarded has increased. In order to model this relationship, year was converted to ordinal position. A three parameter power function seemed plausible, \( y = ax^b + c \), where \( y \) is total medals awarded and \( x \) is the ordinal position of the olympics starting at one. The best fitting parameters by least-squares were</p>
<p>\[
0.202
x^{2.297 + 50.987}.
\]</p>
<p>The figure displays the data and the line of best fit for the model. The model predicts that 2010, 2014, and 2018 would have <code>271</code>, <code>295</code>, and <code>322</code> medals respectively.</p>
<pre><code class="r">plot(medals ~ pos, x, las = 1,
ylab = "Total Medals Awarded",
xlab = "Ordinal Position of Olympics",
main="Total medals awarded
by ordinal position of Olympics with
predicted three parameter power function fit displayed.",
las = 1,
bty="l")
lines(x$pos, predict(fit))
</code></pre>
<p><img src="http://i.imgur.com/atTmh.png" alt="plot of chunk figure_of_medals"/> </p>
<h1>Gender Ratio by Year</h1>
<pre><code class="r">medalsByYearByGender <- aggregate(medals$Year, list(Year = medals$Year,
Event.gender = medals$Event.gender), length)
medalsByYearByGender <- medalsByYearByGender[medalsByYearByGender$Event.gender !=
"X", ]
propf <- list()
propf$prop <- medalsByYearByGender[medalsByYearByGender$Event.gender ==
"W", "x"]/(medalsByYearByGender[medalsByYearByGender$Event.gender == "W",
"x"] + medalsByYearByGender[medalsByYearByGender$Event.gender == "M", "x"])
propf$year <- medalsByYearByGender[medalsByYearByGender$Event.gender ==
"W", "Year"]
propf$propF <- format(round(propf$prop, 2))
propf$table <- with(propf, cbind(year, propF))
colnames(propf$table) <- c("Year", "Prop. Female")
</code></pre>
<p>The figure shows the number of medals won by males and females by year. The table shows the proportion of medals awarded to females by year. It shows a generally similar pattern for males and females. Medals increase gradually until around the late 1980s after which the rate of increase accelerates. However, females started from a much smaller base. Thus, both the absolute difference and the percentage difference has decreased over time to the point where in 2006 <code>46</code> of medals were won by females.</p>
<pre><code class="r">plot(x ~ Year, medalsByYearByGender[medalsByYearByGender$Event.gender ==
"M", ], ylim = c(0, max(x)), pch = "m", col = "blue", las = 1, ylab = "Total Medals Awarded",
bty = "l", main = "Total Medals Won by Gender and Year")
points(medalsByYearByGender[medalsByYearByGender$Event.gender ==
"W", "Year"], medalsByYearByGender[medalsByYearByGender$Event.gender ==
"W", "x"], col = "red", pch = "f")
</code></pre>
<p><img src="http://i.imgur.com/idGC7.png" alt="plot of chunk fgenderRatioByYear_figure"/> </p>
<pre><code class="r">print(xtable(propf$table,
caption="Proportion of Medals that were awarded to Females by Year"),
type="html",
caption.placement="top",
html.table.attributes='align="center"')
</code></pre>
<!-- html table generated in R 2.15.0 by xtable 1.7-0 package -->
<!-- Mon Jun 4 22:14:27 2012 -->
<TABLE align="center">
<CAPTION ALIGN="top"> Proportion of Medals that were awarded to Females by Year </CAPTION>
<TR> <TH> </TH> <TH> Year </TH> <TH> Prop. Female </TH> </TR>
<TR> <TD align="right"> 1 </TD> <TD> 1924 </TD> <TD> 0.07 </TD> </TR>
<TR> <TD align="right"> 2 </TD> <TD> 1928 </TD> <TD> 0.08 </TD> </TR>
<TR> <TD align="right"> 3 </TD> <TD> 1932 </TD> <TD> 0.08 </TD> </TR>
<TR> <TD align="right"> 4 </TD> <TD> 1936 </TD> <TD> 0.12 </TD> </TR>
<TR> <TD align="right"> 5 </TD> <TD> 1948 </TD> <TD> 0.18 </TD> </TR>
<TR> <TD align="right"> 6 </TD> <TD> 1952 </TD> <TD> 0.23 </TD> </TR>
<TR> <TD align="right"> 7 </TD> <TD> 1956 </TD> <TD> 0.26 </TD> </TR>
<TR> <TD align="right"> 8 </TD> <TD> 1960 </TD> <TD> 0.38 </TD> </TR>
<TR> <TD align="right"> 9 </TD> <TD> 1964 </TD> <TD> 0.37 </TD> </TR>
<TR> <TD align="right"> 10 </TD> <TD> 1968 </TD> <TD> 0.37 </TD> </TR>
<TR> <TD align="right"> 11 </TD> <TD> 1972 </TD> <TD> 0.36 </TD> </TR>
<TR> <TD align="right"> 12 </TD> <TD> 1976 </TD> <TD> 0.35 </TD> </TR>
<TR> <TD align="right"> 13 </TD> <TD> 1980 </TD> <TD> 0.34 </TD> </TR>
<TR> <TD align="right"> 14 </TD> <TD> 1984 </TD> <TD> 0.36 </TD> </TR>
<TR> <TD align="right"> 15 </TD> <TD> 1988 </TD> <TD> 0.37 </TD> </TR>
<TR> <TD align="right"> 16 </TD> <TD> 1992 </TD> <TD> 0.43 </TD> </TR>
<TR> <TD align="right"> 17 </TD> <TD> 1994 </TD> <TD> 0.43 </TD> </TR>
<TR> <TD align="right"> 18 </TD> <TD> 1998 </TD> <TD> 0.44 </TD> </TR>
<TR> <TD align="right"> 19 </TD> <TD> 2002 </TD> <TD> 0.45 </TD> </TR>
<TR> <TD align="right"> 20 </TD> <TD> 2006 </TD> <TD> 0.46 </TD> </TR>
</TABLE>
<h1>Countries with the Most Medals</h1>
<pre><code class="r">cmm <- list()
cmm$medals <- sort(table(medals$NOC), dec = TRUE)
cmm$country <- names(cmm$medals)
cmm$prop <- cmm$medals/sum(cmm$medals)
cmm$propF <- paste(round(cmm$prop * 100, 2), "%", sep = "")
cmm$row1 <- c("Rank", "Country", "Total", "%")
cmm$rank <- seq(cmm$medals)
cmm$include <- 1:10
cmm$table <- with(cmm, rbind(cbind(rank[include], country[include],
medals[include], propF[include])))
colnames(cmm$table) <- cmm$row1
</code></pre>
<p>Norway has won the most medals with <code>280</code> (<code>12.12</code>%). The table shows the top 10. Russia, USSR, and EUN (Unified Team in 1992 Olympics) have a combined total of <code>293</code>. Germany, GDR, and FRG have a combined medal total of <code>309</code>.</p>
<pre><code class="r">print(xtable(cmm$table, caption="Rankings of Medals Won by Country"),
"html", include.rownames=FALSE, caption.placement='top',
html.table.attributes='align="center"')
</code></pre>
<!-- html table generated in R 2.15.0 by xtable 1.7-0 package -->
<!-- Mon Jun 4 22:14:27 2012 -->
<TABLE align="center">
<CAPTION ALIGN="top"> Rankings of Medals Won by Country </CAPTION>
<TR> <TH> Rank </TH> <TH> Country </TH> <TH> Total </TH> <TH> % </TH> </TR>
<TR> <TD> 1 </TD> <TD> NOR </TD> <TD> 280 </TD> <TD> 12.12% </TD> </TR>
<TR> <TD> 2 </TD> <TD> USA </TD> <TD> 216 </TD> <TD> 9.35% </TD> </TR>
<TR> <TD> 3 </TD> <TD> URS </TD> <TD> 194 </TD> <TD> 8.39% </TD> </TR>
<TR> <TD> 4 </TD> <TD> AUT </TD> <TD> 185 </TD> <TD> 8.01% </TD> </TR>
<TR> <TD> 5 </TD> <TD> GER </TD> <TD> 158 </TD> <TD> 6.84% </TD> </TR>
<TR> <TD> 6 </TD> <TD> FIN </TD> <TD> 151 </TD> <TD> 6.53% </TD> </TR>
<TR> <TD> 7 </TD> <TD> CAN </TD> <TD> 119 </TD> <TD> 5.15% </TD> </TR>
<TR> <TD> 8 </TD> <TD> SUI </TD> <TD> 118 </TD> <TD> 5.11% </TD> </TR>
<TR> <TD> 9 </TD> <TD> SWE </TD> <TD> 118 </TD> <TD> 5.11% </TD> </TR>
<TR> <TD> 10 </TD> <TD> GDR </TD> <TD> 110 </TD> <TD> 4.76% </TD> </TR>
</TABLE>
<h1>Proportion of Gold Medals by Country</h1>
<p>Looking only at countries that have won more than 50 medals in the dataset, the figure shows that the proportion of medals won that were gold, silver, or bronze.</p>
<pre><code class="r">NOC50Plus <- names(table(medals$NOC)[table(medals$NOC) > 50])
medalsSubset <- medals[medals$NOC %in% NOC50Plus, ]
medalsByMedalByNOC <- prop.table(table(medalsSubset$NOC, medalsSubset$Medal),
margin = 1)
medalsByMedalByNOC <- medalsByMedalByNOC[order(medalsByMedalByNOC[, "Gold"],
decreasing = TRUE), c("Gold", "Silver", "Bronze")]
barplot(round(t(medalsByMedalByNOC), 2), horiz = TRUE, las = 1,
col=c("gold", "grey71", "chocolate4"),
xlab = "Proportion of Medals",
main="Proportion of medals won that were gold, silver or bronze.")
</code></pre>
<p><img src="http://i.imgur.com/L7f1C.png" alt="plot of chunk proportion_gold"/> </p>
<h1>How many different countries have won medals by year?</h1>
<pre><code class="r">listOfYears <- unique(medals$Year)
names(listOfYears) <- unique(medals$Year)
totalNocByYear <- sapply(listOfYears, function(X) length(table(medals[medals$Year ==
X, "NOC"])))
</code></pre>
<p>The figure shows the total number of countries winning medals by year.</p>
<pre><code class="r">plot(x = names(totalNocByYear), totalNocByYear, ylim = c(0, max(totalNocByYear)),
las = 1, xlab = "Year", main = "Total Number of Countries Winning Medals By Year",
ylab = "Total Number of Countries", bty = "l")
</code></pre>
<p><img src="http://i.imgur.com/VKzmi.png" alt="plot of chunk figure_total_medals"/> </p>
<h1>Australia at the Winter Olympics</h1>
<pre><code class="r">ausmedals <- list()
ausmedals$data <- medals[medals$NOC == "AUS", ]
ausmedals$data <- ausmedals$data[, c("Year", "City", "Discipline",
"Event", "Medal")]
ausmedals$table <- ausmedals$data
</code></pre>
<p>Given that I am an Australian I decided to have a look at the Australian medal count. Australia does not get a lot of snow. Up to and including 2006, Australia has won <code>6</code> medals. It won its first medal in <code>1994</code>. Of the <code>6</code> medals, <code>3</code> were bronze, <code>0</code> were silver, and <code>3</code> were gold. The table lists each of these medals.</p>
<pre><code class="r">print(xtable(ausmedals$table,
caption='List of Australian Medals',
digits=0),
type='html',
caption.placement='top',
include.rownames=FALSE,
html.table.attributes='align="center"')
</code></pre>
<!-- html table generated in R 2.15.0 by xtable 1.7-0 package -->
<!-- Mon Jun 4 22:15:10 2012 -->
<TABLE align="center">
<CAPTION ALIGN="top"> List of Australian Medals </CAPTION>
<TR> <TH> Year </TH> <TH> City </TH> <TH> Discipline </TH> <TH> Event </TH> <TH> Medal </TH> </TR>
<TR> <TD align="right"> 1994 </TD> <TD> Lillehammer </TD> <TD> Short Track S. </TD> <TD> 5000m relay </TD> <TD> Bronze </TD> </TR>
<TR> <TD align="right"> 1998 </TD> <TD> Nagano </TD> <TD> Alpine Skiing </TD> <TD> slalom </TD> <TD> Bronze </TD> </TR>
<TR> <TD align="right"> 2002 </TD> <TD> Salt Lake City </TD> <TD> Short Track S. </TD> <TD> 1000m </TD> <TD> Gold </TD> </TR>
<TR> <TD align="right"> 2002 </TD> <TD> Salt Lake City </TD> <TD> Freestyle Ski. </TD> <TD> aerials </TD> <TD> Gold </TD> </TR>
<TR> <TD align="right"> 2006 </TD> <TD> Turin </TD> <TD> Freestyle Ski. </TD> <TD> aerials </TD> <TD> Bronze </TD> </TR>
<TR> <TD align="right"> 2006 </TD> <TD> Turin </TD> <TD> Freestyle Ski. </TD> <TD> moguls </TD> <TD> Gold </TD> </TR>
</TABLE>
<h1>Ice Hockey</h1>
<pre><code class="r">icehockey <- medals[medals$Sport == "Ice Hockey" & medals$Event.gender ==
"M" & medals$Medal == "Gold", ]
icehockeyf <- medals[medals$Sport == "Ice Hockey" & medals$Event.gender ==
"W" & medals$Medal == "Gold", ]
# names(table(icehockey$NOC)[table(icehockey$NOC) > 1])
</code></pre>
<p>The following are some statistics about Winter Olympics Ice Hockey up to and including the 2006 Winter Olympics. </p>
<ul>
<li>Out of the <code>20</code> Winter Olympics that have been staged, Mens Ice Hockey has been held in <code>20</code> and the Womens in <code>3</code>.</li>
<li>The USSR has won the most mens gold medals with <code>7</code> golds. It goes up to <code>8</code> if the 1992 Unified Team is included. </li>
<li>Canada has the second most golds with <code>6</code>. </li>
<li>After that the only two nations to win more than one gold are Sweden (<code>2</code> golds) and the United States (<code>2</code> golds).</li>
<li> The table shows the countries who won gold and silver medals by year.</li>
<li>In the case of the Women's Ice Hockey, Canada has won <code>2</code> and the United States has won <code>1</code>.</li>
</ul>
<pre><code class="r">icehockeygs <- medals[medals$Sport == "Ice Hockey" &
medals$Event.gender == "M" &
medals$Medal %in% c("Silver", "Gold"), c("Year", "Medal", "NOC")]
icetab <- list()
icetab$data <- reshape(icehockeygs, idvar="Year", timevar="Medal",
direction="wide")
names(icetab$data) <- c("Year", "Gold", "Silver")
print(xtable(icetab$data,
caption ="Country Winning Gold and Silver Medals by Year in Mens Ice Hockey",
digits=0),
type="html",
include.rownames=FALSE,
caption.placement="top",
html.table.attributes='align="center"')
</code></pre>
<!-- html table generated in R 2.15.0 by xtable 1.7-0 package -->
<!-- Mon Jun 4 22:15:10 2012 -->
<TABLE align="center">
<CAPTION ALIGN="top"> Country Winning Gold and Silver Medals by Year in Mens Ice Hockey </CAPTION>
<TR> <TH> Year </TH> <TH> Gold </TH> <TH> Silver </TH> </TR>
<TR> <TD align="right"> 1924 </TD> <TD> CAN </TD> <TD> USA </TD> </TR>
<TR> <TD align="right"> 1928 </TD> <TD> CAN </TD> <TD> SWE </TD> </TR>
<TR> <TD align="right"> 1932 </TD> <TD> CAN </TD> <TD> USA </TD> </TR>
<TR> <TD align="right"> 1936 </TD> <TD> GBR </TD> <TD> CAN </TD> </TR>
<TR> <TD align="right"> 1948 </TD> <TD> CAN </TD> <TD> TCH </TD> </TR>
<TR> <TD align="right"> 1952 </TD> <TD> CAN </TD> <TD> USA </TD> </TR>
<TR> <TD align="right"> 1956 </TD> <TD> URS </TD> <TD> USA </TD> </TR>
<TR> <TD align="right"> 1960 </TD> <TD> USA </TD> <TD> CAN </TD> </TR>
<TR> <TD align="right"> 1964 </TD> <TD> URS </TD> <TD> SWE </TD> </TR>
<TR> <TD align="right"> 1968 </TD> <TD> URS </TD> <TD> TCH </TD> </TR>
<TR> <TD align="right"> 1972 </TD> <TD> URS </TD> <TD> USA </TD> </TR>
<TR> <TD align="right"> 1976 </TD> <TD> URS </TD> <TD> TCH </TD> </TR>
<TR> <TD align="right"> 1980 </TD> <TD> USA </TD> <TD> URS </TD> </TR>
<TR> <TD align="right"> 1984 </TD> <TD> URS </TD> <TD> TCH </TD> </TR>
<TR> <TD align="right"> 1988 </TD> <TD> URS </TD> <TD> FIN </TD> </TR>
<TR> <TD align="right"> 1992 </TD> <TD> EUN </TD> <TD> CAN </TD> </TR>
<TR> <TD align="right"> 1994 </TD> <TD> SWE </TD> <TD> CAN </TD> </TR>
<TR> <TD align="right"> 1998 </TD> <TD> CZE </TD> <TD> RUS </TD> </TR>
<TR> <TD align="right"> 2002 </TD> <TD> CAN </TD> <TD> USA </TD> </TR>
<TR> <TD align="right"> 2006 </TD> <TD> SWE </TD> <TD> FIN </TD> </TR>
</TABLE>
<h1>Reflections on the Conversion Process</h1>
<ul>
<li>Markdown versus LaTeX:
<ul>
<li>I prefer performing analyses with Markdown than I do with LateX. </li>
<li>Markdown is easier to type than LaTeX. </li>
<li>Markdown is easier to read than LaTeX.</li>
<li>It is easier with Markdown to get started with analyses.</li>
<li>Many analyses are only presented on the screen and as such page breaks in LaTeX are a nuisance. This extends to many features of LaTeX such as headers, figure and table placement, margins, table formatting, partiuclarly for long or wide tables, and so on.</li>
<li>That said, journal articles, books, and other artefacts that are bound to the model of a printed page are not going anywhere. </li>
<li>Furthermore, bibliographies, cross-references, elaborate control of table appearance, and more are all features which LaTeX makes easier than Markdown.</li>
</ul></li>
<li>R Markdown to Sweave LaTeX:
<ul>
<li>The more common conversion task that I can imagine is taking some simple analyses in R Markdown and having to convert them into knitr LaTeX in order to include the content in a journal article.</li>
<li>The first time I converted between the formats, it was good to do it in a relatively manual way to get a sense of all the required changes; however, if I had a large document or was doing the task on subsequent occasions, I would look at more automated solutions using string replacement tools (e.g., sed, or even just replacement commands in a text editor such as Vim), and markup conversion tools (e.g., pandoc).</li>
<li>Perhaps if the formats get popular enough, developers will start to build dedicated conversion tools.</li>
</ul></li>
</ul>
<h1>Additional Resources</h1>
<p>If you liked this post, you may want to subscribe to the <a href="http://feeds.feedburner.com/jeromyanglim">RSS feed of my blog</a>. Also see:</p>
<ul>
<li>This post on <a href="http://jeromyanglim.blogspot.com/2012/05/getting-started-with-r-markdown-knitr.html">Getting Started with R Markdown, knitr, and Rstudio 0.96</a></li>
<li>This post for another <a href="http://jeromyanglim.blogspot.com/2012/05/example-reproducile-report-using-r.html">Example Reproducible Report using R Markdown which analyses California Schools Test Data</a></li>
<li>These <a href="http://jeromyanglim.blogspot.com.au/search/label/Sweave">Assorted posts using Sweave</a></li>
<li>The <a href="http://yihui.name/knitr/">knitr</a> home page and <a href="http://yihui.name/knitr/options">knitr options page</a>.</li>
<li>the <a href="http://cran.r-project.org/web/packages/xtable/vignettes/xtableGallery.pdf">xtable LaTeX table gallery</a> which can also be used to generate HTML tables for inclusion in Markdown.</li>
</ul>jeromyanglimhttp://www.blogger.com/profile/12949204812496382042noreply@blogger.comtag:blogger.com,1999:blog-8909074830238091680.post-28168641275618013542012-05-18T21:22:00.000+10:002012-05-19T16:18:11.662+10:00Example Reproducible Report using R Markdown: Analysis of California Schools Test Data<p>This is a quick set of analyses of the California Test Score dataset. The post was produced using R Markdown in RStudio 0.96. The main purpose of this post is to provide a case study of using R Markdown to prepare a quick reproducible report. It provides examples of using plots, output, in-line R code, and markdown. The post is designed to be read along side the <a href="https://gist.github.com/2724711">R Markdown source code, which is available as a gist on github</a>. </p>
<a name='more'></a>
<h3>Preliminaries</h3>
<ul>
<li>This post builds on my earlier post which provided a guide for <a href="http://jeromyanglim.blogspot.com/2012/05/getting-started-with-r-markdown-knitr.html">Getting Started with R Markdown, knitr, and RStudio 0.96</a></li>
<li>The dataset analysed comes from the <code>AER</code> package which is an accompaniment to the book <a href="http://www.amazon.com/Applied-Econometrics-R-Use/dp/0387773169">Applied Econometrics with R</a> written by <a href="http://wwz.unibas.ch/personen/profil/person/kleiber/">Christian Kleiber</a> and <a href="http://eeecon.uibk.ac.at/%7Ezeileis/">Achim Zeileis</a>.</li>
</ul>
<h3>Load packages and data</h3>
<pre><code class="r"># if necessary uncomment and install packages. install.packages('AER')
# install.packages('psych') install.packages('Hmisc')
# install.packages('ggplot2') install.packages('relaimpo')
library(AER) # interesting datasets
library(psych) # describe and psych.panels
library(Hmisc) # describe
library(ggplot2) # plots: ggplot and qplot
library(relaimpo) # relative importance in regression
</code></pre>
<pre><code class="r"># load the California Schools Dataset and give the dataset a shorter name
data(CASchools)
cas <- CASchools
# Convert grade to numeric
# table(cas$grades)
cas$gradesN <- cas$grades == "KK-08"
# Get the set of numeric variables
v <- setdiff(names(cas), c("district", "school", "county", "grades"))
</code></pre>
<h3>Q 1 What does the CASchools dataset involve?</h3>
<p>Quoting the help (i.e., <code>?CASchools</code>), the data is “from all 420 K-6 and K-8 districts in California with data available for 1998 and 1999” and the variables are:</p>
<pre><code class="no-highlight">* district: character. District code.
* school: character. School name.
* county: factor indicating county.
* grades: factor indicating grade span of district.
* students: Total enrollment.
* teachers: Number of teachers.
* calworks: Percent qualifying for CalWorks (income assistance).
* lunch: Percent qualifying for reduced-price lunch.
* computer: Number of computers.
* expenditure: Expenditure per student.
* income: District average income (in USD 1,000).
* english: Percent of English learners.
* read: Average reading score.
* math: Average math score.
</code></pre>
<p>Let's look at the basic structure of the data frame. i.e., the number of observations and the types of values:</p>
<pre><code class="r">str(cas)
</code></pre>
<pre><code class="no-highlight">## 'data.frame': 420 obs. of 15 variables:
## $ district : chr "75119" "61499" "61549" "61457" ...
## $ school : chr "Sunol Glen Unified" "Manzanita Elementary" "Thermalito Union Elementary" "Golden Feather Union Elementary" ...
## $ county : Factor w/ 45 levels "Alameda","Butte",..: 1 2 2 2 2 6 29 11 6 25 ...
## $ grades : Factor w/ 2 levels "KK-06","KK-08": 2 2 2 2 2 2 2 2 2 1 ...
## $ students : num 195 240 1550 243 1335 ...
## $ teachers : num 10.9 11.1 82.9 14 71.5 ...
## $ calworks : num 0.51 15.42 55.03 36.48 33.11 ...
## $ lunch : num 2.04 47.92 76.32 77.05 78.43 ...
## $ computer : num 67 101 169 85 171 25 28 66 35 0 ...
## $ expenditure: num 6385 5099 5502 7102 5236 ...
## $ income : num 22.69 9.82 8.98 8.98 9.08 ...
## $ english : num 0 4.58 30 0 13.86 ...
## $ read : num 692 660 636 652 642 ...
## $ math : num 690 662 651 644 640 ...
## $ gradesN : logi TRUE TRUE TRUE TRUE TRUE TRUE ...
</code></pre>
<pre><code class="r"># Hmisc::describe(cas) # For more extensive summary statistics
</code></pre>
<h3>Q. 2 To what extent does expenditure per student vary?</h3>
<pre><code class="r">qplot(expenditure, data = cas) + xlim(0, 8000) + xlab("Money spent per student ($)") +
ylab("Count of schools")
</code></pre>
<p><img src="http://i.imgur.com/EVAg2.png" alt="plot of chunk cas2"/> </p>
<pre><code class="r">
round(t(psych::describe(cas$expenditure)), 1)
</code></pre>
<pre><code class="no-highlight">## [,1]
## var 1.0
## n 420.0
## mean 5312.4
## sd 633.9
## median 5214.5
## trimmed 5252.9
## mad 487.2
## min 3926.1
## max 7711.5
## range 3785.4
## skew 1.1
## kurtosis 1.9
## se 30.9
</code></pre>
<p>The greatest expenditure per student is around double that of the least expenditure per student.</p>
<h3>Q. 3a What predicts expenditure per student?</h3>
<pre><code class="r"># Compute and format set of correlations
corExp <- cor(cas["expenditure"], cas[setdiff(v, "expenditure")])
corExp <- round(t(corExp), 2)
corExp[order(corExp[, 1], decreasing = TRUE), , drop = FALSE]
</code></pre>
<pre><code class="no-highlight">## expenditure
## income 0.31
## read 0.22
## math 0.15
## calworks 0.07
## lunch -0.06
## computer -0.07
## english -0.07
## teachers -0.10
## students -0.11
## gradesN -0.17
</code></pre>
<p>More is spent per student in schools :</p>
<ol>
<li>where people with greater incomes live</li>
<li>reading scores are higher</li>
<li>that are K-6</li>
</ol>
<h3>Q. 4 what is the relationship between district level maths and reading scores?</h3>
<pre><code class="r">ggplot(cas, aes(read, math)) + geom_point() + geom_smooth()
</code></pre>
<p><img src="http://i.imgur.com/RDniX.png" alt="plot of chunk cas4"/> </p>
<p>At the district level, the correlation is very strong (r = The correlation is <code>0.92</code>). From prior experience I'd expect correlations at the individual-level in the .3 to .6 range. Thus, these results are consistent with group-level relationships being much larger than individual-level relationships.</p>
<h3>Q. 5 What is the relationship between maths and reading after partialling out other effects?</h3>
<pre><code class="r"># command has strange syntax requiring column numbers rather than variable
# names
partial.r(cas[v], c(which(names(cas[v]) == "read"), which(names(cas[v]) ==
"math")), which(!names(cas[v]) %in% c("read", "math")))
</code></pre>
<pre><code class="no-highlight">## partial correlations
## read math
## read 1.00 0.72
## math 0.72 1.00
</code></pre>
<p>The partial correlation is still very strong but is substantially reduced.</p>
<h3>Q. 6 What fraction of a computer does each student have?</h3>
<pre><code class="r">cas$compstud <- cas$computer/cas$students
describe(cas$compstud)
</code></pre>
<pre><code class="no-highlight">## cas$compstud
## n missing unique Mean .05 .10 .25 .50 .75
## 420 0 412 0.1359 0.05471 0.06654 0.09377 0.12546 0.16447
## .90 .95
## 0.22494 0.24906
##
## lowest : 0.00000 0.01455 0.02266 0.02548 0.04167
## highest: 0.32770 0.34359 0.34979 0.35897 0.42083
</code></pre>
<pre><code class="r">qplot(compstud, data = cas)
</code></pre>
<pre><code class="no-highlight">## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
</code></pre>
<p><img src="http://i.imgur.com/SEbj0.png" alt="plot of chunk unnamed-chunk-4"/> </p>
<p>The mean number of computers per student is <code>0.136</code>.</p>
<h3>Q. 7 What is a good model of the combined effect of other variables on academic performance (i.e., math and read)?</h3>
<pre><code class="r"># Examine correlations between variables
psych::pairs.panels(cas[v])
</code></pre>
<p><img src="http://i.imgur.com/FeULR.png" alt="plot of chunk cas7"/> </p>
<p><code>pairs.panels</code> shows correlations in the upper triangle, scatterplots in the lower triangle, and variable names and distributions on the main diagonal.<br/>
After examining the plot several ideas emerge.</p>
<pre><code class="r"># (a) students is a count and could be log transformed
cas$studentsLog <- log(cas$students)
# (b) teachers is not the variable of interest:
# it is the number of students per teacher
cas$studteach <- cas$students /cas$teachers
# (c) computers is not the variable of interest:
# it is the ratio of computers to students
# table(cas$computer==0)
# Note some schools have no computers so ratio would be problematic.
# Take percentage of a computer instead
cas$compstud <- cas$computer / cas$students
# (d) math and reading are correlated highly, reduce to one variable
cas$performance <- as.numeric(
scale(scale(cas$read) + scale(cas$math)))
</code></pre>
<p>Normally, I'd add all these transformations to an initial data transformation file that I call in the first block, but for the sake of the narrative, I'll leave them here.</p>
<p>Let's examine correlations between predictors and outcome.</p>
<pre><code class="r">m1cor <- cor(cas$performance, cas[c("studentsLog", "studteach", "calworks",
"lunch", "compstud", "income", "expenditure", "gradesN")])
t(round(m1cor, 2))
</code></pre>
<pre><code class="no-highlight">## [,1]
## studentsLog -0.12
## studteach -0.23
## calworks -0.63
## lunch -0.87
## compstud 0.27
## income 0.71
## expenditure 0.19
## gradesN -0.16
</code></pre>
<p>Let's examine the multiple regression.</p>
<pre><code class="r">m1 <- lm(performance ~ studentsLog + studteach + calworks + lunch +
compstud + income + expenditure + grades, data = cas)
summary(m1)
</code></pre>
<pre><code class="no-highlight">##
## Call:
## lm(formula = performance ~ studentsLog + studteach + calworks +
## lunch + compstud + income + expenditure + grades, data = cas)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.8107 -0.2963 -0.0118 0.2712 1.5662
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.99e-01 4.98e-01 1.80 0.072 .
## studentsLog -3.83e-02 1.91e-02 -2.01 0.045 *
## studteach -1.11e-02 1.59e-02 -0.70 0.487
## calworks 1.96e-03 2.96e-03 0.66 0.508
## lunch -2.65e-02 1.48e-03 -17.97 < 2e-16 ***
## compstud 7.88e-01 3.86e-01 2.04 0.042 *
## income 2.82e-02 4.89e-03 5.77 1.6e-08 ***
## expenditure 5.87e-05 4.90e-05 1.20 0.232
## gradesKK-08 -1.21e-01 6.49e-02 -1.87 0.062 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.457 on 411 degrees of freedom
## Multiple R-squared: 0.795, Adjusted R-squared: 0.791
## F-statistic: 199 on 8 and 411 DF, p-value: <2e-16
##
</code></pre>
<p>And some indicators of predictor relative importance.</p>
<pre><code class="r"># calc.relimp from relaimpo package.
(m1relaimpo <- calc.relimp(m1, type = "lmg", rela = TRUE))
</code></pre>
<pre><code class="no-highlight">## Response variable: performance
## Total response variance: 1
## Analysis based on 420 observations
##
## 8 Regressors:
## studentsLog studteach calworks lunch compstud income expenditure grades
## Proportion of variance explained by model: 79.48%
## Metrics are normalized to sum to 100% (rela=TRUE).
##
## Relative importance metrics:
##
## lmg
## studentsLog 0.009973
## studteach 0.016695
## calworks 0.177666
## lunch 0.492866
## compstud 0.025815
## income 0.251769
## expenditure 0.014785
## grades 0.010432
##
## Average coefficients for different model sizes:
##
## 1X 2Xs 3Xs 4Xs 5Xs
## studentsLog -0.08771 -0.0650133 -0.0558756 -0.0519312 -4.926e-02
## studteach -0.11918 -0.0861199 -0.0629499 -0.0462155 -3.372e-02
## calworks -0.05473 -0.0427576 -0.0324658 -0.0233760 -1.535e-02
## lunch -0.03199 -0.0310310 -0.0301497 -0.0293300 -2.856e-02
## compstud 4.15870 3.0673338 2.2639604 1.6844348 1.287e+00
## income 0.09860 0.0850555 0.0726892 0.0614726 5.140e-02
## expenditure 0.00030 0.0001986 0.0001374 0.0001013 8.061e-05
## grades -0.45677 -0.3345683 -0.2529014 -0.1981200 -1.628e-01
## 6Xs 7Xs 8Xs
## studentsLog -4.626e-02 -4.252e-02 -3.833e-02
## studteach -2.418e-02 -1.687e-02 -1.109e-02
## calworks -8.399e-03 -2.612e-03 1.962e-03
## lunch -2.785e-02 -2.718e-02 -2.654e-02
## compstud 1.034e+00 8.828e-01 7.884e-01
## income 4.250e-02 3.477e-02 2.821e-02
## expenditure 6.882e-05 6.206e-05 5.871e-05
## grades -1.414e-01 -1.291e-01 -1.215e-01
</code></pre>
<p>Thus, we can conclude that:</p>
<ol>
<li>Income and indicators of income (e.g., low levels of lunch vouchers) are the two main predictors. Thus, schools with greater average income tend to have better student performance.</li>
<li>Schools with more computers per student have better student performance.</li>
<li>Schools with fewer students per teacher have better student performance.</li>
</ol>
<p>For more information about relative importance and the <code>relaimpo</code> package measures check out <a href="http://prof.beuth-hochschule.de/groemping/relaimpo/">Ulrike Grömping's website</a>.<br/>
Of course this is all observational data with the usual caveats regarding causal interpretation.</p>
<h2>Now, let's look at some weird stuff.</h2>
<h3>Q. 8.1 What are common words in Californian School names?</h3>
<pre><code class="r"># create a vector of the words that occur in school names
lw <- unlist(strsplit(cas$school, split = " "))
# create a table of the frequency of school names
tlw <- table(lw)
# extract cells of table with count greater than 3
tlw2 <- tlw[tlw > 3]
# sorted in decreasing order
tlw2 <- sort(tlw2, decreasing = TRUE)
# values as proporitions
tlw2p <- round(tlw2/nrow(cas), 3)
# show this in a bar graph
tlw2pdf <- data.frame(word = names(tlw2p), prop = as.numeric(tlw2p),
stringsAsFactors = FALSE)
ggplot(tlw2pdf, aes(word, prop)) + geom_bar() + coord_flip()
</code></pre>
<p><img src="http://i.imgur.com/eqxKN.png" alt="plot of chunk unnamed-chunk-8"/> </p>
<pre><code class="r"># make it log counts
ggplot(tlw2pdf, aes(word, log(prop * nrow(cas)))) + geom_bar() +
coord_flip()
</code></pre>
<p><img src="http://i.imgur.com/NJPiK.png" alt="plot of chunk unnamed-chunk-9"/> </p>
<p>The word “Elementary” appears in almost all school names (<code>98.3</code>%). The word “Union” appears in around half (<code>43.3</code>%).</p>
<p>Other common words pertain to:</p>
<ul>
<li>Directions (e.g., South, West), </li>
<li>Features of the environment
(e.g., Creek, Vista, View, Valley)</li>
<li>Spanish words (e.g., rio for river; san for saint)</li>
</ul>
<h3>Q. 8.2 Is the number of letters in the school's name related to academic performance?</h3>
<pre><code class="r">cas$namelen <- nchar(cas$school)
table(cas$namelen)
</code></pre>
<pre><code class="no-highlight">##
## 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 37 38 39
## 1 4 9 26 28 31 33 27 30 45 38 28 36 30 18 10 5 4 6 3 1 2 2 2 1
</code></pre>
<pre><code class="r">round(cor(cas$namelen, cas[, c("read", "math")]), 2)
</code></pre>
<pre><code class="no-highlight">## read math
## [1,] 0.03 0
</code></pre>
<p>The answer appears to be “no”.</p>
<h3>Q. 8.3 Is the number of words in the school name related to academic performance?</h3>
<pre><code class="r">cas$nameWordCount <- sapply(strsplit(cas$school, " "), length)
table(cas$nameWordCount)
</code></pre>
<pre><code class="no-highlight">##
## 2 3 4 5
## 140 202 72 6
</code></pre>
<pre><code class="r">round(cor(cas$nameWordCount, cas[, c("read", "math")]), 2)
</code></pre>
<pre><code class="no-highlight">## read math
## [1,] 0.05 0.01
</code></pre>
<p>The answer appears to be “no”.</p>
<h3>Q. 8.4 Are schools with nice popular nature words in their name doing better academically?</h3>
<pre><code class="r">tlw2p #recall the list of popular names
</code></pre>
<pre><code class="no-highlight">## lw
## Elementary Union City Valley Joint View
## 0.983 0.433 0.060 0.040 0.031 0.019
## Pleasant San Creek Oak Santa Lake
## 0.017 0.017 0.014 0.014 0.014 0.012
## Mountain Park Rio Vista Grove Lakeside
## 0.012 0.012 0.012 0.012 0.010 0.010
## South Unified West
## 0.010 0.010 0.010
</code></pre>
<pre><code class="r"># Create a quick and dirty list of popular nature names
naturenames <- c("Valley", "View", "Creek", "Lake", "Mountain", "Park",
"Rio", "Vista", "Grove", "Lakeside")
# work out whether the word is in the school name
schsplit <- strsplit(cas$school, " ")
cas$hasNature <- sapply(schsplit, function(X) length(intersect(X,
naturenames)) > 0)
round(cor(cas$hasNature, cas[, c("read", "math")]), 2)
</code></pre>
<pre><code class="no-highlight">## read math
## [1,] 0.09 0.08
</code></pre>
<p>So we've found a small correlation.<br/><br/>
Let's graph the data to see what it means:</p>
<pre><code class="r">ggplot(cas, aes(hasNature, read)) + geom_boxplot() + geom_jitter(position = position_jitter(width = 0.1)) +
xlab("Has a nature name") + ylab("Mean student reading score")
</code></pre>
<p><img src="http://i.imgur.com/TyyL3.png" alt="plot of chunk unnamed-chunk-14"/> </p>
<p>So in the sample nature schools have slightly better reading score (and if we were to graph it, maths scores). However, the number of schools having nature names is actually somewhat small (n= <code>61</code>) despite the overall quite large sample size.</p>
<p>But is it statistically significant?</p>
<pre><code class="r">t.read <- t.test(cas[cas$hasNature, "read"], cas[!cas$hasNature,
"read"])
t.math <- t.test(cas[cas$hasNature, "math"], cas[!cas$hasNature,
"math"])
</code></pre>
<p>So, the p-value is less than .05 for reading (p = <code>0.046</code>) but not quite for maths (p = <code>0.083</code>). Bingo! After a little bit of data fishing we have found that reading scores are “significantly” greater for those schools with the listed nature names.</p>
<p><strong>But wait</strong>: I've asked three separate exploratory questions or perhaps six if we take maths into account.</p>
<ul>
<li>$\frac{.05}{3} =$ <code>0.0167</code></li>
<li>$\frac{.05}{6} =$ <code>0.0083</code></li>
</ul>
<p>At these Bonferonni corrected p-values, the result is non-significant. Oh well…</p>
<h2>Review</h2>
<p>Anyway, the aim of this post was not to make profound statements about California schools. Rather the aim was to show how easy it is to produce quick reproducible reports with R Markdown. If you haven't already, you may want to open up <a href="https://gist.github.com/2724711">the R Markdown file used to produce this post</a> in RStudio, and compile the report yourself.</p>
<p>In particular, I can see R Markdown being my tool of choice for:</p>
<ul>
<li>Blog posts</li>
<li>Posts to StackExchange sites</li>
<li>Materials for training workshops</li>
<li>Short consulting reports, and</li>
<li>Exploratory analyses as part of a larger project.</li>
</ul>
<p>The real question is how far I can push Markdown before I start to miss the control of LaTeX. Markdown does permit arbitrary HTML. Anyway, if you have any thoughts about the scope of R Markdown, feel free to add a comment.</p>jeromyanglimhttp://www.blogger.com/profile/12949204812496382042noreply@blogger.comtag:blogger.com,1999:blog-8909074830238091680.post-91418556557499428922012-05-17T14:31:00.000+10:002012-05-17T23:42:31.496+10:00Getting Started with R Markdown, knitr, and Rstudio 0.96<p>This post examines the features of <a href="http://www.rstudio.org/docs/authoring/using_markdown">R Markdown</a>
using <a href="http://yihui.name/knitr/">knitr</a> in Rstudio 0.96.
This combination of tools provides an exciting improvement in usability for
<a href="http://stats.stackexchange.com/a/15006/183">reproducible analysis</a>.
Specifically, this post
(1) discusses getting started with R Markdown and <code>knitr</code> in Rstudio 0.96;
(2) provides a basic example of producing console output and plots using R Markdown;
(3) highlights several code chunk options such as caching and controlling how input and output is displayed;
(4) demonstrates use of standard Markdown notation as well as the extended features of formulas and tables; and
(5) discusses the implications of R Markdown.
This post was produced with R Markdown. The <a href="https://gist.github.com/2716336">source code is available here as a gist</a>.
The post may be most useful if the source code and displayed post are viewed side by side.
In some instances, I include a copy of the R Markdown in the displayed HTML, but most of the time I assume you are reading the source and post side by side.</p>
<a name='more'></a>
<h2>Getting started</h2>
<p>To work with R Markdown, if necessary:</p>
<ul>
<li>Install <a href="http://www.r-project.org/">R</a></li>
<li>Install the lastest version of <a href="http://rstudio.org/download/">RStudio</a> (at time of posting, this is 0.96)</li>
<li>Install the latest version of the <code>knitr</code> package: <code>install.packages("knitr")</code></li>
</ul>
<p>To run the basic working example that produced this blog post:</p>
<ul>
<li>Open R Studio, and go to File - New - R Markdown</li>
<li>If necessary install <code>ggplot2</code> and <code>lattice</code> packages: <code>install.packages("ggplot2"); install.packages("lattice")</code></li>
<li>Paste in the contents of <a href="https://gist.github.com/2716336">the gist (which contains the R Markdown file used to produce this post)</a> and save the file with an <code>.rmd</code> extension</li>
<li>Click Knit HTML</li>
</ul>
<pre><code class="r">opts_knit$set(upload.fun = imgur_upload) # upload all images to imgur.com
</code></pre>
<h2>Prepare for analyses</h2>
<pre><code class="r">set.seed(1234)
library(ggplot2)
library(lattice)
</code></pre>
<h2>Basic console output</h2>
<p>To insert an R code chunk, you can type it manually or just press <code>Chunks - Insert chunks</code> or use the shortcut key. This will produce the following code chunk:</p>
<pre><code class="no-highlight">```{r}
```
</code></pre>
<p>Pressing tab when inside the braces will bring up code chunk options.</p>
<p>The following R code chunk labelled <code>basicconsole</code> is as follows:</p>
<pre><code class="no-highlight">```{r basicconsole}
x <- 1:10
y <- round(rnorm(10, x, 1), 2)
df <- data.frame(x, y)
df
```
</code></pre>
<p>The code chunk input and output is then displayed as follows:</p>
<pre><code class="r">x <- 1:10
y <- round(rnorm(10, x, 1), 2)
df <- data.frame(x, y)
df
</code></pre>
<pre><code class="no-highlight">## x y
## 1 1 1.31
## 2 2 2.31
## 3 3 3.36
## 4 4 3.27
## 5 5 5.04
## 6 6 6.11
## 7 7 8.43
## 8 8 8.98
## 9 9 8.38
## 10 10 9.27
</code></pre>
<h2>Plots</h2>
<p>Images generated by <code>knitr</code> are saved in a figures folder. However, they also appear to be represented in the HTML output using a <a href="http://en.wikipedia.org/wiki/Data_URI_scheme">data URI scheme</a>. This means that you can paste the HTML into a blog post or discussion forum and you don't have to worry about finding a place to store the images; they're embedded in the HTML.</p>
<h3>Simple plot</h3>
<p>Here is a basic plot using base graphics:</p>
<pre><code class="no-highlight">```{r simpleplot}
plot(x)
```
</code></pre>
<pre><code class="r">plot(x)
</code></pre>
<p><img src="http://i.imgur.com/JRrm8.png" alt="plot of chunk simpleplot"/> </p>
<p>Note that unlike traditional Sweave, there is no need to write <code>fig=TRUE</code>.</p>
<h3>Multiple plots</h3>
<p>Also, unlike traditional Sweave, you can include multiple plots in one code chunk:</p>
<pre><code class="no-highlight">```{r multipleplots}
boxplot(1:10~rep(1:2,5))
plot(x, y)
```
</code></pre>
<pre><code class="r">boxplot(1:10 ~ rep(1:2, 5))
</code></pre>
<p><img src="http://i.imgur.com/TW0G1.png" alt="plot of chunk multipleplots"/> </p>
<pre><code class="r">plot(x, y)
</code></pre>
<p><img src="http://i.imgur.com/36WWn.png" alt="plot of chunk multipleplots"/> </p>
<h3><code>ggplot2</code> plot</h3>
<p>Ggplot2 plots work well:</p>
<pre><code class="r">qplot(x, y, data = df)
</code></pre>
<p><img src="http://i.imgur.com/s5mct.png" alt="plot of chunk ggplot2ex"/> </p>
<h3><code>lattice</code> plot</h3>
<p>As do lattice plots:</p>
<pre><code class="r">xyplot(y ~ x)
</code></pre>
<p><img src="http://i.imgur.com/qXKUO.png" alt="plot of chunk latticeex"/> </p>
<p>Note that unlike traditional Sweave, there is no need to print lattice plots directly.</p>
<h2>R Code chunk features</h2>
<h3>Create Markdown code from R</h3>
<p>The following code hides the command input (i.e., <code>echo=FALSE</code>), and outputs the content directly as code (i.e., <code>results=asis</code>, which is similar to <code>results=tex</code> in Sweave).</p>
<pre><code class="no-highlight">```{r dotpointprint, results='asis', echo=FALSE}
cat("Here are some dot points\n\n")
cat(paste("* The value of y[", 1:3, "] is ", y[1:3], sep="", collapse="\n"))
```
</code></pre>
<p>Here are some dot points</p>
<ul>
<li>The value of y[1] is 1.31</li>
<li>The value of y[2] is 2.31</li>
<li>The value of y[3] is 3.36</li>
</ul>
<h3>Create Markdown table code from R</h3>
<pre><code class="no-highlight">```{r createtable, results='asis', echo=FALSE}
cat("x | y", "--- | ---", sep="\n")
cat(apply(df, 1, function(X) paste(X, collapse=" | ")), sep = "\n")
```
</code></pre>
<table><thead>
<tr>
<th>x</th>
<th>y</th>
</tr>
</thead><tbody>
<tr>
<td>1</td>
<td>1.31</td>
</tr>
<tr>
<td>2</td>
<td>2.31</td>
</tr>
<tr>
<td>3</td>
<td>3.36</td>
</tr>
<tr>
<td>4</td>
<td>3.27</td>
</tr>
<tr>
<td>5</td>
<td>5.04</td>
</tr>
<tr>
<td>6</td>
<td>6.11</td>
</tr>
<tr>
<td>7</td>
<td>8.43</td>
</tr>
<tr>
<td>8</td>
<td>8.98</td>
</tr>
<tr>
<td>9</td>
<td>8.38</td>
</tr>
<tr>
<td>10</td>
<td>9.27</td>
</tr>
</tbody></table>
<h3>Control output display</h3>
<p>The folllowing code supresses display of R input commands (i.e., <code>echo=FALSE</code>)
and removes any preceding text from console output (<code>comment=""</code>; the default is <code>comment="##"</code>).</p>
<pre><code class="no-highlight">```{r echo=FALSE, comment="", echo=FALSE}
head(df)
```
</code></pre>
<pre><code class="no-highlight"> x y
1 1 1.31
2 2 2.31
3 3 3.36
4 4 3.27
5 5 5.04
6 6 6.11
</code></pre>
<h3>Control figure size</h3>
<p>The following is an example of a smaller figure using <code>fig.width</code> and <code>fig.height</code> options.</p>
<pre><code class="no-highlight">```{r smallplot, fig.width=3, fig.height=3}
plot(x)
```
</code></pre>
<pre><code class="r">plot(x)
</code></pre>
<p><img src="http://i.imgur.com/fg18e.png" alt="plot of chunk smallplot"/> </p>
<h3>Cache analysis</h3>
<p>Caching analyses is straightforward.
Here's example code.
On the first run on my computer, this took about 10 seconds.
On subsequent runs, this code was not run. </p>
<p>If you want to rerun cached code chunks, just <a href="http://stackoverflow.com/a/10629121/180892">delete the contents of the <code>cache</code> folder</a></p>
<pre><code class="no-highlight">```{r longanalysis, cache=TRUE}
for (i in 1:5000) {
lm((i+1)~i)
}
```
</code></pre>
<h2>Basic markdown functionality</h2>
<p>For those not familiar with standard <a href="http://daringfireball.net/projects/markdown/">Markdown</a>, the following may be useful.
See the source code for how to produce such points. However, RStudio does include a Markdown quick reference button that adequatly covers this material.</p>
<h3>Dot Points</h3>
<p>Simple dot points:</p>
<ul>
<li>Point 1</li>
<li>Point 2</li>
<li>Point 3</li>
</ul>
<p>and numeric dot points:</p>
<ol>
<li>Number 1</li>
<li>Number 2</li>
<li>Number 3</li>
</ol>
<p>and nested dot points:</p>
<ul>
<li>A
<ul>
<li>A.1</li>
<li>A.2</li>
</ul></li>
<li>B
<ul>
<li>B.1</li>
<li>B.2</li>
</ul></li>
</ul>
<h3>Equations</h3>
<p>Equations are included by using LaTeX notation and including them either between single dollar signs (inline equations) or double dollar signs (displayed equations).
If you hang around the Q&A site <a href="http://stats.stackexchange.com">CrossValidated</a> you'll be familiar with this idea.</p>
<p>There are inline equations such as $y_i = \alpha + \beta x_i + e_i$.</p>
<p>And displayed formulas:</p>
<p>$$\frac{1}{1+\exp(-x)}$$</p>
<p>knitr provides self-contained HTML code that calls a Mathjax script to display formulas.
However, in order to include the script in my blog posts I <a href="https://gist.github.com/2716053">took the script</a> and incorporated it into my blogger template.
If you are viewing this post through syndication or an RSS reader, this may not work.
You may need to view this post on my website. </p>
<h3>Tables</h3>
<p>Tables can be included using the following notation</p>
<table><thead>
<tr>
<th>A</th>
<th>B</th>
<th>C</th>
</tr>
</thead><tbody>
<tr>
<td>1</td>
<td>Male</td>
<td>Blue</td>
</tr>
<tr>
<td>2</td>
<td>Female</td>
<td>Pink</td>
</tr>
</tbody></table>
<h3>Hyperlinks</h3>
<ul>
<li>If you like this post, you may wish to subscribe to <a href="http://feeds.feedburner.com/jeromyanglim">my RSS feed</a>.</li>
</ul>
<h3>Images</h3>
<p>Here's an example image:</p>
<p><img src="http://i.imgur.com/RVNmr.jpg" alt="image from redmond barry building unimelb"/></p>
<h3>Code</h3>
<p>Here is Markdown R code chunk displayed as code:</p>
<pre><code class="no-highlight">```{r}
x <- 1:10
x
```
</code></pre>
<p>And then there's inline code such as <code>x <- 1:10</code>.</p>
<h3>Quote</h3>
<p>Let's quote some stuff:</p>
<blockquote>
<p>To be, or not to be, that is the question:
Whether 'tis nobler in the mind to suffer
The slings and arrows of outrageous fortune,</p>
</blockquote>
<h2>Conclusion</h2>
<ul>
<li>R Markdown is awesome.
<ul>
<li>The ratio of markup to content is excellent. </li>
<li>For exploratory analyses, blog posts, and the like R Markdown will be a powerful productivity booster. </li>
<li>For journal articles, LaTeX will presumably still be required.</li>
</ul></li>
<li>The RStudio team have made the whole process very user friendly.
<ul>
<li>RStudio provides useful shortcut keys for compiling to HTML, and running code chunks. These shortcut keys are presented in a clear way.</li>
<li>The incorporated extensions to Markdown, particularly formula and table support, are particularly useful.</li>
<li>Jump-to-chunk feature facilitates navigation. It helps if your code chunks have informative names.</li>
<li>Code completion on R code chunk options is really helpful. See also <a href="http://yihui.name/knitr/options">chunk options documentation on the knitr website</a>.</li>
</ul></li>
<li>Other recent posts on R markdown include those by :
<ul>
<li><a href="http://christophergandrud.blogspot.com.au/2012/05/dynamic-content-with-rstudio-markdown.html">Christopher Gandrud</a></li>
<li><a href="http://lamages.blogspot.com.au/2012/05/interactive-reports-in-r-with-knitr-and.html">Markcus Gesmann</a></li>
<li><a href="http://rstudio.org/docs/authoring/using_markdown">Rstudio on R Markdown</a></li>
<li><a href="http://yihui.name/knitr/">Yihui Xie</a>: I really want to thank him for developing <code>knitr</code>.
He has also posted <a href="https://github.com/yihui/knitr/blob/master/inst/examples/knitr-minimal.Rmd">this example of R Markdown</a>.</li>
</ul></li>
</ul>
<h2>Questions</h2>
<p>The following are a few questions I encountered along the way that might interest others.</p>
<h3>Annoying <code><br/></code>'s</h3>
<p><strong>Question:</strong> I asked on the Rstudio discussion site:
<a href="http://support.rstudio.org/help/discussions/problems/2329-why-does-r-markdown-to-html-insert-br-when-there-is-a-new-line-of-text">Why does Markdown to HTML insert <code><br/></code> on new lines?</a></p>
<p><strong>Answer:</strong> I just do a find and delete on this text for now.
Specifically, I have a sed command that extracts just the content between the <code>body</code> tags and removes <code>br</code> tags.
I can then, readily incorporate the result into my blogposts.</p>
<pre><code>sed -i -e '1,/<body>/d' -e'/^<\/body>/,$d' -e 's/<br\/>$//' filename.html
</code></pre>
<h3>Temporarily disable caching</h3>
<p><strong>Question:</strong> I asked on StackOverflow about
<a href="http://stackoverflow.com/q/10628665/180892">How to set cache=FALSE for a knitr markdown document and override code chunk settings?</a></p>
<p><strong>Answer:</strong> Delete the cache folder. But there are other possible workflows.</p>
<h3>Equivalent of Sexpr</h3>
<p><strong>Question:</strong> I asked on Stack Overvlow about <a href="http://stackoverflow.com/q/10629416/180892">whether there an R Markdown equivalent to Sexpr in Sweave?</a>.</p>
<p><strong>Answer:</strong> Include the code between brackets of “backtick r space” and “backtick”.
E.g., in the source code I have calculated 2 + 2 = <code>4</code> .</p>
<h3>Image format</h3>
<p><strong>Question:</strong> When using the URI scheme images don't appear to display in RSS feeds of my blog.
What's a good strategy?</p>
<p><strong>Answer:</strong> One strategy is to upload to imgur.
The <a href="https://raw.github.com/yihui/knitr/master/inst/examples/knitr-upload.Rmd">following provides</a> an example of exporting to imgur.</p>
<p>Add the following lines of code near the top of the file:</p>
<pre><code class="no-highlight">``` {r optsknit}
opts_knit$set(upload.fun = imgur_upload) # upload all images to imgur.com
```
</code></pre>
<p>I found that the function failed when I was at work behind a firewall, but worked at home.</p>jeromyanglimhttp://www.blogger.com/profile/12949204812496382042noreply@blogger.comtag:blogger.com,1999:blog-8909074830238091680.post-46935181122898193342012-05-03T22:31:00.000+10:002012-05-04T09:19:38.367+10:00How to plot three categorical variables and one continuous variable using ggplot2<p>This post shows how to produce a plot involving three categorical variables
and one continuous variable using <code>ggplot2</code> in R.</p>
<a name='more'></a>
<p>The <a href="https://gist.github.com/2585249">following code is also available as a gist on github</a>.</p>
<h5>1. Create Data</h5>
<p>First, let's load <code>ggplot2</code> and create some data to work with:</p>
<pre><code>library(ggplot2)
set.seed(4444)
Data <- expand.grid(group=c("Apples", "Bananas", "Carrots", "Durians",
"Eggplants"),
year=c("2000", "2001", "2002"),
quality=c("Grade A", "Grade B", "Grade C", "Grade D",
"Grade E"))
Group.Weight <- data.frame(
group=c("Apples", "Bananas", "Carrots", "Durians", "Eggplants"),
group.weight=c(1,1,-1,0.5, 0))
Quality.Weight <- data.frame(
quality=c("Grade A", "Grade B", "Grade C", "Grade D", "Grade E"),
quality.weight = c(1,0.5,0,-0.5,-1))
Data <- merge(Data, Group.Weight)
Data <- merge(Data, Quality.Weight)
Data$score <- Data$group.weight + Data$quality.weight +
rnorm(nrow(Data), 0, 0.2)
Data$proportion.tasty <- exp(Data$score)/(1 + exp(Data$score))
</code></pre>
<h5>2. Produce Plot</h5>
<p>And here's the code to produce the plot.</p>
<pre><code>ggplot(data=Data,
aes(x=factor(year), y=proportion.tasty,
group=group,
shape=group,
color=group)) +
geom_line() +
geom_point() +
opts(title =
"Proportion Tasty by Year, Quality, and Group") +
scale_x_discrete("Year") +
scale_y_continuous("Proportion Tasty") +
facet_grid(.~quality )
</code></pre>
<p>And here's what it looks like:</p>
<p><a href="http://imgur.com/UWZgd"><img src="http://i.imgur.com/UWZgd.png"
title="three categorical variables ggplot2" width=520 alt="three categorical variables ggplot2" /></a></p>jeromyanglimhttp://www.blogger.com/profile/12949204812496382042noreply@blogger.comtag:blogger.com,1999:blog-8909074830238091680.post-21619095752847848542012-04-11T15:50:00.029+10:002022-11-03T20:29:59.323+11:00Getting Started with JAGS, rjags, and Bayesian Modelling<p>This post provides links to various resources on getting started with Bayesian modelling using JAGS and R. It discusses: (1) what is JAGS; (2) why you might want to perform Bayesian modelling using JAGS; (3) how to install JAGS; (4) where to find further information on JAGS; (5) where to find examples of JAGS scripts in action; (6) where to ask questions; and (7) some interesting psychological applications of Bayesian modelling.</p>
<a name='more'></a>
<h3>What is JAGS?</h3>
<p>JAGS stands for Just Another Gibbs Sampler. To quote the program author, Martyn Plummer, "It is a program for analysis of Bayesian hierarchical models using Markov Chain Monte Carlo (MCMC) simulation..." It uses a dialect of the BUGS language, similar but a little different to OpenBUGS and WinBUGS.</p>
<h3>Why JAGS?</h3>
<p>The question of why you might want to use JAGS can be approached in several different ways:</p>
<ul>
<li><p><strong>Why Bayesian rather than Null Hypothesis Significance Testing approaches?</strong></p>
<ul>
<li>To quote John D. Cook quoting Anthony O'Hagan, the benefits of "the bayesian approach are that it is 1. fundamentally sound, 2. very flexible, 3. produces clear and direct inferences, and 4. makes use of all available information." (see <a href="http://www.johndcook.com/blog/2009/04/28/reasons-to-use-bayesian-inference/">John's blog post for elaboration</a>)</li>
<li>John K. Kruschke made a similar argument in an Open Letter extolling the benefits of the bayesian approach summarised as: "(1) Scientific disciplines from astronomy to zoology are moving to Bayesian data analysis. We should be leaders of the move, not followers. (2) Modern Bayesian methods provide richer information, with greater flexibility and broader applicability than 20th century methods. Bayesian methods are intellectually coherent and intuitive. Bayesian analyses are readily computed with modern software and hardware. (3) Null-hypothesis significance testing (NHST), with its reliance on p values, has many problems. There is little reason to persist with NHST now that Bayesian methods are accessible to everyone."</li>
</ul>
</li>
<li><p><strong>Why JAGS/BUGS rather than coding in a low-level language?</strong></p>
<ul>
<li>It's simpler; for models that BUGS can handle, BUGS can shield you from some of the thorny details related to numeric integration.</li>
<li>There are simple interfaces with R.</li>
</ul>
</li>
<li><p><strong>Why JAGS rather than WinBUGS or OpenBUGS?</strong></p>
<ul>
<li>I'm using JAGS because it works well on Ubuntu. WinBUGS is broadly Windows specific, although I've read that it may work with the emulation software Wine.</li>
<li>JAGS interfaces well with R. I'm comfortable writing scripts. Thus, I don't personally see the benefits of using a dedicated GUI like WinBUGS. I can leverage what I know about R.</li>
<li>However, ultimately converting code between different flavours of BUGS is not that difficult.</li>
<li>For further discussion of the issue, see <a href="http://stats.stackexchange.com/questions/9202/openbugs-vs-jags">this discussion on CrossValidated</a>.</li>
</ul>
</li>
</ul>
<p><br /></p>
<p>More than anything I found that JAGS provided a useful entry point into the world of Bayesian modelling. This in turn appealed to me for several reasons:</p>
<ol>
<li> Even when I perform analyses using an NHST approach I often intuitively think of empirical research questions in terms of probability densities on a parameter of interest that changes as empirical and theoretical evidence is accumulated. See for example Thompson's (2002) concept of meta-analytic thinking. Bayesian analysis provides tools for formalising this orientation.</li>
<li> More broadly, I appreciate the explicitness that a Bayesian approach requires and encourages. E.g., specifying the distribution of the error term, specifying a prior, specifying the distribution of parameters in a mixed effects model, and so on.</li>
<li> There are several modelling challenges that I'm currently working through where a Bayesian approach offers substantial flexibility and applicability. In particular, I'm interested in modelling individual differences in the effect of practice on strategy use and task performance and then relating these individual differences to factors like intelligence, prior experience, and personality.</li>
</ol>
<h3>JAGS Installation</h3>
<p>JAGS runs on Linux, Mac, and Windows. I run JAGS on Ubuntu through an interface with R called <code>rjags</code>.</p>
<p>The following sets out a basic installation process:</p>
<ol>
<li> If necessary <a href="http://www.r-project.org/">Download and install R</a> and potentially a user interface to R like <a href="http://rstudio.org/">R Studio</a> (see <a href="http://jeromyanglim.blogspot.com.au/2009/06/learning-r-for-researchers-in.html">here for tips on getting started with R</a>).</li>
<li> <a href="http://mcmc-jags.sourceforge.net/">Download and install JAGS</a> as per operating system requriements.</li>
<li> Install additional R packages: e.g., in R <code>install.packages("rjags")</code> . In particular, I use the packages <code>rjags</code> to interface with JAGS and <code>coda</code> to process MCMC output.</li>
</ol>
<h3>Information on JAGS</h3>
<ul>
<li><p>The <a href="http://sourceforge.net/projects/mcmc-jags/files/Manuals/">manual for different versions of JAGS is located here</a>. Several particularly relevant sections include:</p>
<ul>
<li>the list of supported distributions and how they are parameterised. This is often important given that the code looks similar to R but often uses different parameterisation (e.g., precision is used instead of standard deviation for a normal distribution).</li>
<li>It summarises differences between WinBUGS and JAGS.</li>
<li>It sets out available functions and operators.</li>
</ul>
</li>
<li><p>The <a href="http://cran.r-project.org/web/packages/rjags/rjags.pdf"><code>rjags</code> help pdf</a> for information about how to interface with JAGS from R.</p></li>
<li><a href="http://martynplummer.wordpress.com/">Martin Plummer has a blog called JAGS NEWS</a></li>
<li>The <a href="">Bayesian Task View on CRAN</a> lists and briefly describes the many R packages related to Bayesian statistics.</li>
<li>Lunn and colleagues have a 2009 article called <a href="">The BUGS project: Evolution, critique and future directions</a>. It provides a useful historical perspective on the broader BUGS project, although it does not mention much about JAGS specifically.</li>
</ul>
<p>Examples JAGS Scripts</p>
<p>I find it easier to pick up a new language by playing with examples. The following provides links to example JAGS code, often with accompanying explanations:</p>
<ul style="text-align: left;">
<li>Justin Esarey
<ul>
<li>An entire course on <a href="http://jee3.web.rice.edu/teaching.htm">Bayesian Statistics</a> with examples in R and JAGS. It includes 10 lectures and each lecture lasts around 2 hours. The content is designed for a social science audience and it includes a syllabus linking with Simon Jackman's text. The videos are linked from above or available direclty on <a href="http://www.youtube.com/playlist?list=PLAFC5F02F224FA59F">YouTube</a></li>
</ul>
</li>
<li><p>John Myles White</p>
<ul>
<li>A course on statistical models that is under development with <a href="https://github.com/johnmyleswhite/JAGSExamples">JAGS scripts on github</a></li>
<li>A <a href="http://www.johnmyleswhite.com/notebook/2011/03/16/canabalt-revisited-gamma-distributions-multinomial-distributions-and-more-jags-goodness/">model of Cannabalt scores using a gamma distribution</a></li>
<li><a href="http://www.johnmyleswhite.com/notebook/2010/08/20/using-jags-in-r-with-the-rjags-package/">Simple introductory examples of fitting a normal distribution, linear regression, and logistic regression</a></li>
<li>A follow-up post demonstrating the use of the <code>coda</code> package with <code>rjags</code> to <a href="http://www.johnmyleswhite.com/notebook/2010/08/29/mcmc-diagnostics-in-r-with-the-coda-package/">perform MCMC diagnostics</a>.</li>
</ul>
</li>
<li><p>John K. Kruschke</p>
<ul>
<li>John Krushke wrote a book called <em>Doing Bayesian Data Analysis: A Tutorial with R and BUGS</em>. It's an excellent entry point into the world of Bayesian statistics for the social and behavioural scientist who has reasonable quantiative training, but is not necessarily ready to absorb the kinds of books that are used in graduate-level statistics courses.</li>
<li> See this <a href="http://doingbayesiandataanalysis.blogspot.com.au/2012/01/complete-steps-for-installing-software.html">blog post for a link to the zip file containing the JAGS code</a>.</li>
</ul>
</li>
<li><p>BUGS Project</p>
<ul>
<li>BUGS is well known for the large set of examples that accompany the project.</li>
<li>You can see the <a href="http://sourceforge.net/projects/mcmc-jags/files/Examples/2.x/">JAGS code used to run these examples here</a>.</li>
</ul>
</li>
<li><p>Patrick J Mineault</p>
<ul>
<li>An <a href="http://xcorr.net/2011/07/13/gibbs-sampling-made-easy-jags-rkward-coda/">example from Gelman et al examining the effect of training programs on SAT scores</a></li>
</ul>
</li>
<li><p>Simon Jackman wrote the book <em>Bayesian Analysis for the Social Sciences</em> that has accompanying JAGS code.</p></li>
</ul>
<p>More broadly, examples and tutorials designed for WinBUGS can generally be adapted to be useful for JAGS. So for example, you can explore these WinBUGS examples:</p>
<ul>
<li>Michael Lee and Eric-Jan Wagemakers have a free online book called <em>A Course in Bayesian Graphical Modeling for Cognitive Science</em></li>
<li>The website for the book <a href="http://www.dme.ufrj.br/mcmc/">Markov Chain Monte Carlo</a> has several WinBUGS examples.</li>
<li>There is an <a href="http://www.mrc-bsu.cam.ac.uk/bugs/weblinks/webresource.shtml">extensive list of BUGS resources</a> on the BUGS project website.</li>
</ul>
<h3>Asking questions</h3>
<p>There are several places to ask questions about JAGS, R, and Bayesian statistics.</p>
<ul>
<li><a href="http://stats.stackexchange.com/questions/tagged/jags">JAGS</a>, <a href="http://stats.stackexchange.com/questions/tagged/bugs">BUGS</a>, and <a href="http://stats.stackexchange.com/questions/tagged/bayesian">bayesian</a> questions on <a href="http://stats.stackexchange.com/">stats.stackexchange.com</a> (aka CrossValidated).</li>
<li><a href="http://sourceforge.net/projects/mcmc-jags/forums/">JAGS discussion forum</a></li>
<li>There's also a <a href="http://www.mrc-bsu.cam.ac.uk/bugs/overview/list.shtml">BUGS discussion list</a></li>
</ul>
<p>In general, I prefer the Stack Exchange model for asking and answering questions on the internet, although the most important issue is typically where the experts are located.</p>
<h3>Interesting Psychological Applications of Bayesian Modelling</h3>
<p>If you want to see some examples of Bayesian modelling applied to psychological data, I found the following articles quite interesting. PDFs are available online.</p>
<ul>
<li>Shiffrin, Lee, Kim, and Wagenmakers (2008) present a tutorial on hierarchical bayesian methods in the context of cognitive science.</li>
<li>Michael Lee (2011) in Journal of Mathematical Psychology discusses the benefits of hiearchical Bayesian methods to modelling psychological data and provides several example applications.</li>
<li>Lee Averell and Andrew Heathcote (2010) in Journal of Mathematical Psychology analyse individual differences in the forgetting curve using a hierarchical Bayesian approach.</li>
</ul>
<p>If you know of any other interesting JAGS resources or have any comments about my choice of software for Bayesian data analysis, feel free to post a comment.</p>jeromyanglimhttp://www.blogger.com/profile/12949204812496382042noreply@blogger.comtag:blogger.com,1999:blog-8909074830238091680.post-33248605629426381582012-02-17T21:21:00.000+11:002012-02-17T21:21:56.072+11:00New Psychology and Cognitive Science Question and Answer Site: COGSCI.SE<p>There is now a new website for researchers to ask and answer
questions on topics related to psychology and cognitive science.
The site is <a href="http://cogsci.stackexchange.com/">cogsci.stackexchange.com</a>.
From the success of earlier released <a href="http://stackexchange.com/sites">sites in the Stack Exchange
network</a> such as those on
<a href="http://stackoverflow.com/">programming</a>,
<a href="http://stats.stackexchange.com/">statistics</a>,
and <a href="http://tex.stackexchange.com/">latex</a>,
the site for psychology and cognitive science has the potential be a great
resource for researchers.
<a href="http://cogsci.stackexchange.com/users/52/jeromy-anglim">I'm actively
contributing</a> on the site.
So, if you are a researcher in psychology, I hope you'll <a href="http://cogsci.stackexchange.com/">check it
out</a>.
The rest of this post sets out
(a) a little history of Stack Exchange question and answer sites as they relate to psychology and statistics;
(b) why I think this <a href="http://cogsci.stackexchange.com/">new site for psychology and cognitive science</a>
has so much potential; and
(c) why, if you are a professional or student researcher in psychology, you
might want to get involved.</p>
<a name='more'></a>
<h3>A little history</h3>
<p>I first learnt about the Stack Exchange network back in 2009.
While I was busy learning R, a number of people in the online data science world
such as <a href="http://www.cerebralmastication.com/">JD Long</a>,
<a href="http://www.oscon.com/oscon2009/public/schedule/detail/10432">Michael Driscoll</a>,
<a href="http://www.drewconway.com/zia/?p=1172">Drew Conway</a>, and <a href="https://twitter.com/#!/rstatsmob/following">many
more</a> were promoting a programmer's
question and answer site called Stack Overflow as a place to ask and answer R
related questions.
It was a site pitched at overcoming the many problems of discussion boards,
mailing lists, and the like: e.g., off topic threads, spam, extended discussion
difficulty finding the correct answers, poor indexing by Google, etc.
As of Feb 2011, it now has over <a href="http://stackoverflow.com/questions/tagged/r?sort=votes&pagesize=50">10,000 questions with the R
tag</a>.</p>
<p>Shortly afterwards, the Stack Exchange Network developed <a href="http://area51.stackexchange.com">Area51</a>.
The idea was to take the question and answer infrastructure that made Stack
Overflow a success in the programming world, and extend it to all sorts of other
domains.
Instead of going down the model of Quora or, shudder to think, Yahoo Answers,
Stack Exchange did not permit the creation of a site until a sufficient
community of active users existed to maintain the site at a high standard.
Thus, my main interests, statistics and psychology were going to have to wait.</p>
<p>A site for statistics questions was the first to join the network
(<a href="http://stats.stackexchange.com/">stats.stackexchange.com</a>).
<a href="http://robjhyndman.com/researchtips/crossvalidated/">Professor Rob Hyndman proposed the
site</a>, and perhaps given the
overlapping worlds of programmers and data analysts, the
site launched a few months later in July 2010.
At the time of posting it has over <a href="http://stats.stackexchange.com/questions?sort=votes">7,000
questions</a>.
<a href="http://stats.stackexchange.com/users/183/jeromy-anglim">I've been actively involved</a> in the site.
I've used it to get advice on my own research.
I've also used it extensively in various statistical consulting roles.
In particular, I've <a href="http://jeromyanglim.blogspot.com.au/2011/03/how-to-ask-me-statistics-question.html">encouraged others who would otherwise send me an email about
statistics, to post the question on
stats.se</a>
so that any answer can be an ongoing resource for others.</p>
<p>In the case of psychology and cognitive science, I've had to wait a lot longer.
The overlap between programming and psychology communities is much smaller,
and site proposals were split over separate cognitive science, psychology, and
psychiatry proposals.
Finally, in December 2011 these three proposals were merged and on January 19th 2012 the
site was launched in Beta under the title Cognitive Sciences at the url
<a href="http://cogsci.stackexchange.com/">cogsci.stackexchange.com</a>.
Although the initial name is suggestive of a focus on "cognitive" science.
The history of the merging of site proposals, the inclusion of the "s", plural
"sciences", and the attitude of current participants admits the full range of
questions in cognitive science, psychology, and psychiatry.</p>
<p>At the time of posting the site is growing at a healthy rate.
Most questions are getting good answers, and the community norms around question
quality, references, scope and so on are being clarified on the <a href="http://meta.cogsci.stackexchange.com/">meta
site</a>.
However, there is also the challenge of getting the word out about the site to
academics, researchers, and graduate students who are not otherwise familiar
with the Stack Exchange network of sites.
In my opinion, Stack Exchange provides the best currently available
infrastructure for building a high quality question and answer site.
However, it still relies on a community of expert contributors.</p>
<p>So, if you're a researcher in psychology or cognitive science, why might you
want to get involved? And why might you want to talk to fellow researchers about
the?</p>
<h3>Reasons to participate as an academic</h3>
<p>If you are an academic, Lecturer, or Post Doc, there are many reasons why you
might want to participate:</p>
<ul>
<li>Answering questions is a way of facilitating knowledge transfer to the
broader community; this can be intrinsically enjoyable especially when you get direct
feedback on the number of people who read your answers.</li>
<li>If you use your own name, as many people do, the voting and reputation system,
and various other mechanisms provide a means for your contribution to be
recognised.</li>
<li>You get immediate feedback on what others think of your answers; Thus, it
creates an environment of feedback conducive to learning.</li>
<li>I see sites like Stack Exchange as part of a broader model of open science.
As you create and develop knowledge, you encounter challenges. The idea is to
record these challenges as questions and then add the resolutions as answers.
Thus, when others encounter the same problems, good answers are only a Google
search away.
I'm not saying that question and answer sites replace journal articles, but
they can fill a bridging role linking the language of questions to the answers
provided in journal articles.</li>
<li>Furthermore, the content on Stack Exchange is licenced under creative commons,
so even if the site disappeared the content would still be available on other
sites that reproduce the material. This is much better
than the policy of almost all journals which copyright your, typically, state-sponsored
research and lock it up behind a pay wall, thereby frustrating the
process of knowledge dissemination.</li>
<li>While contributing to Wikipedia is another great way to improve the sum of all
knowledge, unlike Wikipedia, your answers generally stay there; in contrast to
Wikipedia, where your contributions can and are often completely removed by
other editors.</li>
</ul>
<h3>Reasons to participate as a student researcher</h3>
<p>If you are doing a thesis in psychology or cognitive science, or possibly even if
you are just studying a few subjects, many of the above reasons for
participating will also apply.</p>
<p>However, you may also find that the capacity to ask questions will be
particularly useful. As a side point, if it is early days in your career, you
may or may not want to use your real name.</p>
<p>In particular, I'd encourage students completing a thesis to incorporate asking
and answering questions into their scholarly process.
You might encounter questions like:</p>
<ul>
<li>Is there a meta analysis on X?</li>
<li>What are the main theory about Y?</li>
<li>What is the best measure of Z?</li>
<li>What is the empirical support for theory W?</li>
</ul>
<p>These kinds of questions come up all the time when doing research.
Of course, as researchers we have strategies for finding answers ourselves.
However, the stack exchange model encourages you to learn from the answers of
others and also to "leave crumbs" so that others can follow in your footsteps
more easily. The idea is not be shy. Post questions frequently. If you're able
to answer your question, contribute a self-answer.</p>
<p>Thus, even if only a handful of people ever read a thesis, by asking and
answering many questions along the way, resources will be left that thousands of
people will learn from and discover through Google searches in the years to
come. Even if you don't have answers, your question can be the trigger for an
expert to share knowledge to create a valuable Internet artefact.</p>
<h5>Getting Started</h5>
<p>If you want to learn more or give the site a go:</p>
<ul>
<li>Have a read through the <a href="http://cogsci.stackexchange.com/faq">FAQ on cogsci.se</a></li>
<li><a href="http://cogsci.stackexchange.com/questions/ask">Ask a question</a></li>
<li>See if you can answer one of the <a href="http://cogsci.stackexchange.com/unanswered">currently unanswered
questions</a></li>
</ul>
<p>I'm <a href="http://cogsci.stackexchange.com/users/52/jeromy-anglim">floating around on the
site</a>, so if you're a
researcher in psychology, I hope to see you
<a href="http://cogsci.stackexchange.com/">there</a>.</p>jeromyanglimhttp://www.blogger.com/profile/12949204812496382042noreply@blogger.comtag:blogger.com,1999:blog-8909074830238091680.post-52561059035694041782011-07-24T21:40:00.000+10:002013-11-14T15:59:39.298+11:00Tips for Undergraduates Interested in a Career in Organisational Psychology: Australian Perspective<p>Undergraduate psychology students often ask me about careers in organisational
psychology.
This post aims to provide a few links and resources to assist such students to
learn about the profession and the career pathways.
The post includes (a) a basic description of organisational psychology, (b)
links to Australian educational and professional society resources, (c)
discussion of PhD and academic options, and (d) additional resources to learn
more about the profession.</p>
<a name='more'></a>
<h3>Overview of Organisational Psychology</h3>
<h4>What is the profession called?</h4>
<p>Before discussing the profession some consideration should be given to what to
call it.
'Organisational psychology' goes by various names and abbreviations:</p>
<ul>
<li>Organisational Psychology (Org Psych) </li>
<li>Industrial/Organisational Psychology (I/O or I/O Psych)</li>
<li>Work Psychology</li>
<li>Occupational Psychology </li>
</ul>
<p>Different names imply both historical and present differences in focus.
However, such terms are also often used interchangeably.
See <a href="
http://www.siop.org/userfiles/file/What's%20In%20A%20Name.pdf">SIOP's 'What's in a Name?'</a>
article for an overview of various job titles.</p>
<p>Names vary by region.
In the United States, "I/O" is preferred.
In Australia, "Organisational Psychology" is arguably the more common term,
consistent with the APS college name and many course names.
Thus, I'll tend to use this term in this post.</p>
<h4>What is organisational psychology?</h4>
<p>Here are a few descriptions:</p>
<ul>
<li>"Organisational Psychology is the science of people at work. Organisational
psychologists specialise in analysing organisations and their people, and
devising strategies to recruit, motivate, develop, change and inspire." -
<a href="
http://www.groups.psychology.org.au/Assets/Files/COP%20AGM%202008%20Reports.pdf
">prize winning elevator pitch (APS COP) </a></li>
<li>Industrial / Organisational psychologists "Apply principles of psychology to
personnel, administration, management, sales, and marketing problems.
Activities may include policy planning; employee screening, training and
development; and organizational development and analysis. May work with
management to reorganize the work setting to improve worker productivity. -
<a href="
http://online.onetcenter.org/link/summary/19-3032.00">Industrial/Organisational Psychologist job description on O*N</a></li>
<li>"Industrial-organizational (I-O) psychology is the scientific study of the
workplace. Rigor and methods of psychology are applied to issues of critical
relevance to business, including talent management, coaching, assessment,
selection, training, organizational development, performance, and work-life
balance." - <a href="http://www.siop.org/studentdefault.aspx">SIOP: Student Section</a></li>
</ul>
<p>The APS College of Organisational Psychology has a page that describes <a href="
http://www.groups.psychology.org.au/cop/about_us/org_psychologists/">"What is an
organisational psychologist" and "Areas of Specialisation"</a>.</p>
<h4>Learning more about the profession:</h4>
<p>A good strategy for learning more about the profession is to browse the various
society pages:</p>
<ul>
<li><a href="
http://www.siop.org/default.aspx">SIOP - Division of the American Psychological Association</a>: The United States is huge; and I/O
is huge in the United States. The SIOP web page has heaps of
useful online resources.</li>
<li><a href="
http://www.bps.org.uk/dop/">Division of Occupational Psychology: British Psychological Society</a></li>
<li><a href="
http://www.groups.psychology.org.au/cop/">Australian Psychological Society: College of Organisational Psychologists</a></li>
</ul>
<h3>Organisational Psychology in Australia</h3>
<h4>Registration</h4>
<ul>
<li>"Psychologist" is a regulated term in Australia.
It is illegal to call yourself a psychologist, if you are not appropriately registered.</li>
<li>Pathways to registration are set out by the <a href="
http://www.psychologyboard.gov.au/">Psychology Board of Australia</a>.</li>
<li>The traditional pathway for registration has involved first completing a four
year accredited undergraduate psychology sequence, followed by either two
years of supervised practice or the completion of an accredited post-graduate
program (e.g., Masters, Doctorate, Masters / PhD).
Over recent years, rules for registration have been changing.
So, make sure you do your own research.</li>
<li>I should also mention that even if you can't call yourself a "psychologist",
completing an undergraduate major in psychology, particularly one with honours
in psychology (and perhaps also an undergraduate
subject in organisational psychology) can open doors to many roles related to
organisational psychology (e.g., HR, selection and recruitment,
marketing research, etc.).</li>
</ul>
<h4>Finding organisational psychology university programs in Australia:</h4>
<ul>
<li>The <a href="http://www.psychologycouncil.org.au/">APAC accreditation site</a> lists
approved postgraduate psychology programs. </li>
<li>To find organisational psychology courses, last I checked, the following
worked
<ul>
<li>Click <a href="http://www.psychologycouncil.org.au/course-search/australia/">Search for courses - Australia</a></li>
<li>Click on the State you want to search for</li>
<li>Search for "<code>org</code>"</li>
</ul></li>
</ul>
<h4>Groups and networking opportunities</h4>
<p><a href="
http://www.groups.psychology.org.au/cop/">The Australian Psychological Society: College of Organisational Psychologists</a> is the main group representing
organisational psychologists in Australia. <br />
It is made up of various state branches. <br />
The society sometimes runs sessions suited to students wanting to
learn more (e.g., careers fairs).</p>
<p>A few informal online groups are also good places to learn more about the
profession in Australia. Both welcome professionals and students:</p>
<ul>
<li>Facebook group: <a href="http://www.facebook.com/group.php?gid=2355243978">Organisational Psychology in
Australia</a> </li>
<li>LinkedIn group: <a href="http://www.linkedin.com/groups?home=&gid=147918&trk=anet_ug_hm">Organisational Psychology in
Australia</a></li>
</ul>
<h4>Salary Surveys</h4>
<p>There are many reasons to find a career in organisational psychology
intellectually stimulating and meaningful.
There has also often been financial reasons to find it attractive:</p>
<ul>
<li><a href="
http://www.groups.psychology.org.au/Assets/Files/Salary_Survey_exec_summary.pdf">Australia: A slightly dated 2006 APS COP Salary survey</a></li>
<li><a href="http://www.siop.org/surveys.aspx">United States: SIOP Salary surveys</a></li>
</ul>
<h3>A career in academia</h3>
<h4>PhD on a topic related to organisational psychology or related area</h4>
<ul>
<li>Doing a PhD on a topic related to organisational psychology can create
many opportunities.
Such a PhD can open up doors to academic positions in a wide range of
departments including, psychology, HRM, management, business, and so on.
The solid background in statistics and research methods provides a
particular advantage for an academic career.
Of course, academic positions are competitive and generally require a good
publication track record.</li>
<li>Choosing a good PhD supervisor is important.
In addition to supervisors in departments that offer organisational psychology
programs, it's possible to look at supervisors in departments and universities
that don't offer such programs.</li>
<li>The skills learnt can also readily be applied in many social science research
related roles in industry.</li>
</ul>
<h4>Examples of eminent organisational psychology academics</h4>
<p>For those considering pursuing an academic career related to organisational
psychology, the <a href="
http://www.siop.org/awardwinners.aspx">past SIOP award recipients</a>,
particularly in the categories Distinguished Scientific Contribution, and
Distinguished Early Career Contribution, provide motivating examples of
successful I/O psychology researchers.</p>
<h4>Example academic websites</h4>
<p>The following links point to examples of successful academics in I/O psychology. <br />
I also selected these particular pages because each one provides PDFs for many
of the respective academic's publications.
This can give a flavour of the kind of work, focus, and specialisation that an
academic in I/O might engage in.</p>
<ul>
<li><a href="http://users.ugent.be/~flievens/">Filip Lievens</a></li>
<li><a href="http://www.bsos.umd.edu/psyc/gelfand/research.html">Michele Gelfand</a></li>
<li><a href="http://iopsych.msu.edu/koz/main.htm">Steve Kozlowski</a></li>
<li><a href="http://www.krannert.purdue.edu/directory/publications.asp?id=7090">Michael Campion</a></li>
<li><a href="http://www.management.wharton.upenn.edu/grant/">Adam Grant</a></li>
<li><a href="http://people.tamu.edu/~mbarrick/pubs.htm">Murray Barrick</a></li>
</ul>
<h4>Journals to read</h4>
<p>Further understanding of the research done in organisational psychology and
related disciplines can be gained from reading some of the core journals.
A good starting point can be gained by perusing
<a href="
http://www.siop.org/tip/backissues/TipApr01/03Zicker.aspx">the following ranked list of journals generated by Michael Zickar and Scott
Highhouse</a> back in
2001 based on a survey of SIOP members:</p>
<ol>
<li>Journal of Applied Psychology</li>
<li>Personnel Psychology</li>
<li>Academy of Management Journal</li>
<li>Academy of Management Review</li>
<li>Organizational Behavior and Human Decision Processes</li>
<li>Administrative Science Quarterly</li>
<li>Journal of Management</li>
<li>Journal of Organizational Behavior</li>
<li>Organizational Research Methods</li>
<li>Journal of Vocational Behavior</li>
</ol>
<h3>Additional Resources</h3>
<ul>
<li>Richard Landers has a series of posts providing advice on pursuing a career in
I/O psychology from the U.S. perspective:
<ul>
<li><a href="
http://neoacademic.com/2011/06/14/grad-school-should-i-get-a-ph-d-or-masters-in-io-psychology/">PhD or Masters in I/O</a> </li>
<li><a href="http://neoacademic.com/2011/07/19/grad-school-prepping-for-the-gre/">Prepping for the GRE</a> </li>
<li><a href="http://neoacademic.com/io-blogosphere/">He also lists other I/O Blogs</a></li>
</ul></li>
<li><a href="http://www.siop.org/tip/tip.aspx">TIP</a> is the official newsletter of SIOP.
Current and back issues are available online and provide a good insight into
the profession including the interface between professional practice and
scientific research.</li>
</ul>jeromyanglimhttp://www.blogger.com/profile/12949204812496382042noreply@blogger.comtag:blogger.com,1999:blog-8909074830238091680.post-44315509213499130342011-07-17T22:30:00.000+10:002011-07-17T22:30:26.567+10:00Correlation Resources: SPSS, R, Causality, Interpretation, and APA Style Reporting<p>This post provides links to a range of resources related to the use and
interpretation of correlations.
I wanted to provide a page with links to a number of additional resources that
would be useful both for those of my students who might be keen to learn more
and for anyone else who might be interested.
Specifically, this post provides links to:
(a) introductory book-style chapters on correlation,
(b) resources related to assorted issues in correlation (i.e., discussion of
causal inference, correlation with various variable types, range restriction,
statistical power, correlation interpretation, and significance testing),
(c) tutorials on computing correlations using SPSS and R, and
(d) tips for reporting correlations in APA Style.</p>
<a name='more'></a>
<h3>Introductions to correlation</h3>
<p>The following provide general textbook style overviews of correlation:</p>
<ul>
<li><a href="
http://davidakenny.net/statbook/chapter_16.pdf">David Kenny's Chapter 16 Testing Measures of Association</a> provides a textbook overview
of correlation designed for psychology undergraduate students.
It also includes several practice questions.
David Kenny has kindly made his <a href="
http://davidakenny.net/statbook/">entire textbook 'Statistics for the Social
and Behavioral Sciences' available online for free</a> as either an <a href="http://davidakenny.net/statbook/kenny87.pdf">overall pdf</a> or
<a href="http://davidakenny.net/statbook/">individual chapters</a>.</li>
<li><a href="
http://www.psychstat.missouristate.edu/introbook/sbk17m.htm">David Stockburger's Introductory Statistics chapter on Correlation</a></li>
<li><a href="
http://web.psych.unimelb.edu.au/jkanglim/correlationandreggression.pdf">My own slides and notes on correlation</a> </li>
</ul>
<h3>Assorted Issues</h3>
<h4>Correlation and Causation</h4>
<p>Knowing how to reason about causality in the behavioural and social sciences is
a really important skill.</p>
<ul>
<li>Check out <a href="
http://jeromyanglim.blogspot.com/2009/10/how-to-reason-about-causes-in.html">this earlier post on correlation and causation</a>
which includes links to PDFs of important journal articles on the topic.</li>
<li><a href="http://www.youtube.com/watch?v=6RzDMEW5omc">Joy of Stats on Correlation</a>
provides a 4 minute video with a few entertaining examples of correlations and
their connection with causal inference.</li>
</ul>
<h4>Types of variables</h4>
<p>The prototypical correlation example is based on two continuous, normally
distributed variables.
However, in practice there are many other types of variables that you might
wish to correlate.
The following provide pages provide links to suggestions for how to analyse some
other common scenarios:</p>
<ul>
<li><a href="
http://stats.stackexchange.com/questions/3730/pearsons-or-spearmans-correlation-with-non-normal-data">What to do when one of the variables is non-normal?</a></li>
<li><a href="
http://stats.stackexchange.com/questions/8956/spearmans-or-pearsons-correlation-with-likert-scales-where-linearity-and-homosc">What to do when one of the variables is a Likert item?</a></li>
<li><a href="
http://jeromyanglim.blogspot.com/2009/10/analysing-ordinal-variables.html">What to do if you want to treat a variable as ordinal?</a></li>
</ul>
<h4>Range restriction</h4>
<ul>
<li>HyperStat has a general discussion of <a href="http://davidmlane.com/hyperstat/A68809.html">range
restriction</a></li>
<li>See this <a href="http://cnx.org/content/m11196/latest/">simulation on connexions showing the effect of range
restriction</a></li>
</ul>
<h4>Statistical Power</h4>
<p>Statistical power within the context of correlation is the probability of
obtaining a statistically significant correlation in a study given that a true
correlation exists.</p>
<ul>
<li><a href="
http://jeromyanglim.blogspot.com/2010/05/statistical-power-analysis-in-gpower-3.html">This earlier post</a>
provides (a) some simple rules of thumb for power analysis for correlations,
(b) how to calculate statistical power using free software called G-Power,
and (c) links to additional reading on the important topic of statistical
power.</li>
</ul>
<h4>Interpretation</h4>
<p>When I first learnt about the correlation coefficient, I found it
challenging to truly grok what a particular value meant.
Learning the standard interpretation was easy.
The challenging part was understanding the practical and theoretical
implications for a correlation of a given size.</p>
<ul>
<li><p>The following are some of the <strong>standard interpretations</strong> of a correlation:</p>
<ul>
<li>Pearson's correlation is an index of the direction and strength of linear
association between two variables.</li>
<li>The square of the correlation between X and Y is the percentage of
variance shared between X and Y (e.g., if <code>r = .50</code>, then the two variables
share <code>.50 * .50 = 25%</code> of variance).</li>
<li>If X and Y were standardised (i.e., made so that the mean of both
variables was zero and the standard deviation was one) then, the
correlation would be the same as the regression coefficient of X
predicting Y or Y predicting X.
Thus, for example, if <code>r = .25</code> you could say that "a value one standard deviation
greater on X predicts a .25 standard deviation greater value on Y".</li>
</ul></li>
<li><p>Strategies for <strong>building an intuition</strong> of what a correlation means:</p>
<ul>
<li>Play with the <a href="
http://www.ruf.rice.edu/~lane/stat_sim/reg_by_eye/">Regression by Eye</a> simulation.
The simulation generates a scatterplot, and you are asked to indicate which of
a set of correlations corresponds to the scatterplot.
It helps to build a mapping between the graphical intuitiveness of a
scatterplot and the numeric summary of the linear association in the
scatterplot (i.e., the correlation coefficient).</li>
<li>Memorise some of the rules of thumbs for describing correlation effect sizes
(see this <a href="http://www.statisticshell.com/effectsizes.pdf">discussion by Andy
Field</a>), but don't take the
rules of thumb too seriously.</li>
<li>Try to build up a frame of reference for correlations in different contexts by
reading results sections. Meta analyses can also be particularly useful in
this regard.</li>
<li>Read the article 'Meyer, G. J., et al (2001). Psychological Testing and Psychological
Assessment: A Review of Evidence and Issues. <em>American Psychologist, 56</em>(2),
128-165.' (<a href="https://mywebspace.wisc.edu/hmarleau/web/edwards/psychometrics/myers.pdf">PDF</a>)
which provides large tables of meta-analytic correlations for a wide range of
medical and psychological domains sorted by the size of the correlation.
Studying these tables can help build an intuition and a context for
interpretation of correlations.</li>
</ul></li>
</ul>
<h4>Graphical approaches</h4>
<p>As with most statistical techniques, there are various ways of representing the
data.
The correlation coefficient provides a very brief summary of the association
between two variables.
However, graphical representations of association are much richer.</p>
<p>The following are some general heuristics that I find useful when plotting data
that might also be represented as a correlation:</p>
<ul>
<li>Use scatterplots to explore features of the association (e.g., presence of
outliers, linearity, distributional properties, spread of data around any
trend line, etc.);</li>
<li>If one of the variables is positively skewed, consider plotting the
corresponding axis on a log scale;</li>
<li>If there are a lot of data points (e.g., <code>n > 1000</code>), adopt a different strategy
such as using some form of partial transparency (e.g., see use of the <a href="http://had.co.nz/ggplot2/geom_point.html">alpha
property in ggplot2</a>), or sampling
the data;</li>
<li>If one of the variables takes on a limited number of discrete categories,
consider using a jitter or a sunflower plot;</li>
<li>If there are three or more variables, consider using a scatterplot matrix;</li>
<li>Fitting some form of trend line is often useful;</li>
<li>Adjust the size of the plotting character to the sample size (for bigger n,
use a smaller plotting character).</li>
</ul>
<!--http://stats.stackexchange.com/questions/13148/good-online-resource-with-tips-on-graphing-association-between-two-numeric-variab-->
<h4>Significance tests on correlations</h4>
<p>There are a wide range of possible significance tests that can be performed on
correlations.
The following links provide some suggestions and links for different scenarios.</p>
<ul>
<li><a href="
http://jeromyanglim.blogspot.com/2009/09/significance-tests-on-correlations.html">General post on comparing significance of two correlations</a>
under various conditions.</li>
<li><a href="
http://www.une.edu.au/WebStat/unit_materials/c6_common_statistical_tests/test_signif_pearson.html">Significance of correlation using Pearson's table</a></li>
</ul>
<h3>Statistical Software</h3>
<p>Calculating a correlation coefficient and its associated statistical
significance is a standard task that almost any statistical package can perform.
Many psychology students are taught to use SPSS. It is a proprietary (i.e., you
can't run it at home without a paid licence)
data analysis system with a strong empahsis on a GUI and making it easy to
perform various standardised analyses common in the social sciences.</p>
<p>My preferred tool for performing data analysis is R.
It is open source (thus, you can run it at home for free) and is often described
as the lingua franca of statistics. It generally requires a more sophisticated
understanding of statistics and computing to use effectively.
Thus, for the interested psychology student or researcher I have this
<a href="
http://jeromyanglim.blogspot.com/2009/06/learning-r-for-researchers-in.html">introduction to R for researchers in psychology</a>.</p>
<p>Below I list resources for performing correlation analysis in SPSS and R.</p>
<h4>SPSS</h4>
<ul>
<li><a href="http://www.statisticshell.com/correlation.pdf">Andy Field has a chapter on correlation</a>
which discusses correlation using SPSS. </li>
<li><a href="http://www.youtube.com/watch?v=loFLqZmvfzU">This video tutorial on running and interpreting a correlation analysis using
SPSS</a> goes for about 7 minutes
and is elementary.</li>
</ul>
<h4>R</h4>
<p>R makes it easy to perform correlations on datasets.
Specifically, the following links provide example syntax:</p>
<ul>
<li><a href="http://www.statmethods.net/stats/correlations.html">Quick-R on correlations</a></li>
<li><a href="http://www.statmethods.net/graphs/scatterplot.html">Quick-R on scatterplots</a></li>
<li>More generally, William Revelle has some great resources on <a href="http://personality-project.org/r/r.guide.html">R for
psychology</a>.</li>
</ul>
<h3>Reporting Correlations in APA Style</h3>
<ul>
<li><strong>APA Style Manual:</strong> When required to report results using APA style, the
authoritative source is the <a href="http://www.apastyle.org/">Publication Manual of the
APA</a>.</li>
<li><strong>Article Deconstruction:</strong> Another general strategy is to find a journal
article that (a) reports a similar statistical test as you require, and (b)
that is published in an APA journal
or at least is in a journal that uses APA style.
<ul>
<li><a href="http://www.apa.org/pubs/journals/">APA journals are listed here</a></li>
<li>A quick search on <a href="http://scholar.google.com.au/">Google Scholar</a> will
often be sufficient and quicker, although PsycInfo (a subscription
service) is more reliable if you have access to it (many universities do).
E.g., a quick search for <a href="http://scholar.google.com.au/scholar?hl=en&q=apa+%22significant+correlation+between%22+psychology&btnG=Search&as_sdt=0%2C5&as_ylo=&as_vis=0">apa "significant correlation between"
psychology</a>
revealed several relevant articles and some with immediate PDF access.</li>
<li>I also have a separate post on this general approach of <a href="
http://jeromyanglim.blogspot.com/2009/09/introduction-to-journal-article.html">deconstructing
journal articles</a>
to discern writing principles.</li>
</ul></li>
<li><strong>Correlation Matrices:</strong> Many psychological studies, particularly those based on
correlational/observational designs, involve the measurement of a range of
numeric variables.
It is particularly useful, and common, in such cases to report a correlation
matrix between sets of variables.
I have a <a href="
http://jeromyanglim.blogspot.com/2009/02/formatting-correlation-matrices-in.html">post with instructions on formatting a correlation matrix</a>
in APA style using a combination of SPSS, Excel, and Word.
The post also includes links to examples of correlation matrices being
reported.</li>
<li><a href="http://huberb.people.cofc.edu/Guide/Reporting_Statistics%20in%20Psychology.pdfs">General overview of reporting statistics including correlations using APA
style</a></li>
</ul>jeromyanglimhttp://www.blogger.com/profile/12949204812496382042noreply@blogger.com