The following post shows how to manually convert a Sweave LaTeX document into a knitr R Markdown document. The post (1) reviews many of the required changes; (2) provides an example of a document converted to R Markdown format based on an analysis of Winter Olympic Medal data up to and including 2006; and (3) discusses the pros and cons of LaTeX and Markdown for performing analyses.
Overview
The following analyses of Winter Olympic Medals data have gone through several iterations:
- R Script: I originally performed similar analyses in February 2010. It was a simple set of commands where you could see the console output and view the plots.
- LaTeX Sweave: In February 2011 I adapted the example to make it a Sweave LaTex document. The source fo this is available on github. With Sweave, I was able to create a document that weaved text, commands, console input, console output, and figures.
- R Markdown: Now in June 2012 I'm using the example to review the process of converting a document from Sweave-LaTeX to R Markdown. The souce code is available here on github (see the
*.rmd
file).
Converting from Sweave to R Markdown
The following changes were required in order to convert my LaTeX Sweave document into an R Markdown document suitable for processing with knitr
and RStudio
. Many of these changes are fairly obvious if you understand LaTeX and Markdown; but a few are less obvious. And obviously there are many additional changes that might be required on other documents.
R code chunks
- R code chunk delimiters: Update from
<< ... >>=
and@
to R markdown format```{r ...}
and```
- Inline code chunks: Update from
\Sexpr{...}
to either`r ...`
or`r I(...)`
format. - results=tex: Any
results=tex
needs to either be removed or converted toresults='asis'
. Note that string values of knitr options need to be quoted. - Boolean options: Sweave tolerates lower case
true
andfalse
for code chunk options,knitr
requiresTRUE
andFALSE
.
Figures and Tables
- Floats: Remove figure and table floats (e.g.,
\begin{table}...\end{table}
,\begin{figure}...\end{figure}
). In R Markdown and HTML, there are no pages and thus content is just placed immediately in the document. - Figure captions: Extract content from within the
\caption{}
command. When using R Markdown, it is often easiest to add captions to the plot itself (e.g., using themain
argument in base graphics). - Table captions: extract content from within the
\caption{}
command; Table captions can be included in acaption
argument using thecaption
argument to thextable
function (e.g.,print(xtable(MY_DAT_FRAME), "html", caption="MY CAPTION", caption.placement="top")
). Caption placement defaults to"bottom"
of table but can be optinally specified as"top"
either as a global option or inprint.xtable
. Alternatively table titles can just be included as Markdown text. - References: Delete table and figure lables (e.g.,
\label{...}
). Replace table and figure references (e.g.,\ref{...}
with actual numbers or other descriptive terminology. It would also be possible to implement something simple in R that stored table and figure numbers (e.g., initialise table and figure numbers at the start of the document; increment table counter each time a table is created and likewise for figures; store the value of counter in variable; include variable in caption text usingpaste()
or something similar. Include counter in text using inline R code chunks. - Table content: Markdown supports HTML; so one option is to convert LaTeX tables to HTML tables using a function like
print(xtable(MY_DATA_FRAME), type="html")
. This is combined with theresults='asis'
R code chunk option.
Basic formatting
- Headings: if we assume
section
is the top level: then\section{...}
becomes# ...
,\subsection{...}
becomes## ...
and\subsubsection{...}
becomes### ...
- Mathematics: Update latex mathematics to
$
latex ...
and$$
latex ... $$
notation if using RStudio. - Paragraph delimiters: If using RStudio then remove single line breaks that were not intended to be paragraph breaks.
- Hyperlinks: Convert LaTeX Hyperlinks from
\href
orurl
to[text](url)
format.
LaTeX things
- Comments: Remove any LaTeX comments or switch from
% comment
to<!-- comment -->
- LaTeX escaped characters: Remove unnecessary escape characters (e.g.,
\%
is just%
). - R Markdown escaped characters: Writing about the R Markdown language in R Markdown sometimes requires the use of HTML codes for special characters such as backticks (
`
) and backslashes (\
) to prevent the text from being interpreted; see here for a list of HTML character codes. - Header: Remove the LaTeX header information up to and including
\begin{document}
; extract any incorporate any relevant content such as title, abstract, author, date, etc.
R Markdown Analysis of Winter Olympic Medal Data
The following shows the output of the actual analysis after running the rmd source through Knit HTML
in Rstudio. If you're curious, you may wish to view the rmd source code on GitHub side by side this point at this point.
Import Dataset
library(xtable)
options(stringsAsFactors = FALSE)
medals <- read.csv("data/medals.csv")
medals$Year <- as.numeric(medals$Year)
medals <- medals[!is.na(medals$Year), ]
The Olympic Medals data frame includes 2311
medals from 1924
to 2006
. The data was sourced from The Guardian Data Blog.
Total Medals by Year
# http://www.math.mcmaster.ca/~bolker/emdbook/chap3A.pdf
x <- aggregate(medals$Year, list(Year = medals$Year), length)
names(x) <- c("year", "medals")
x$pos <- seq(x$year)
fit <- nls(medals ~ a * pos^b + c, x, start = list(a = 10, b = 1,
c = 50))
In general over the years the number of Winter Olympic medals awarded has increased. In order to model this relationship, year was converted to ordinal position. A three parameter power function seemed plausible, \( y = ax^b + c \), where \( y \) is total medals awarded and \( x \) is the ordinal position of the olympics starting at one. The best fitting parameters by least-squares were
\[ 0.202 x^{2.297 + 50.987}. \]
The figure displays the data and the line of best fit for the model. The model predicts that 2010, 2014, and 2018 would have 271
, 295
, and 322
medals respectively.
plot(medals ~ pos, x, las = 1,
ylab = "Total Medals Awarded",
xlab = "Ordinal Position of Olympics",
main="Total medals awarded
by ordinal position of Olympics with
predicted three parameter power function fit displayed.",
las = 1,
bty="l")
lines(x$pos, predict(fit))
Gender Ratio by Year
medalsByYearByGender <- aggregate(medals$Year, list(Year = medals$Year,
Event.gender = medals$Event.gender), length)
medalsByYearByGender <- medalsByYearByGender[medalsByYearByGender$Event.gender !=
"X", ]
propf <- list()
propf$prop <- medalsByYearByGender[medalsByYearByGender$Event.gender ==
"W", "x"]/(medalsByYearByGender[medalsByYearByGender$Event.gender == "W",
"x"] + medalsByYearByGender[medalsByYearByGender$Event.gender == "M", "x"])
propf$year <- medalsByYearByGender[medalsByYearByGender$Event.gender ==
"W", "Year"]
propf$propF <- format(round(propf$prop, 2))
propf$table <- with(propf, cbind(year, propF))
colnames(propf$table) <- c("Year", "Prop. Female")
The figure shows the number of medals won by males and females by year. The table shows the proportion of medals awarded to females by year. It shows a generally similar pattern for males and females. Medals increase gradually until around the late 1980s after which the rate of increase accelerates. However, females started from a much smaller base. Thus, both the absolute difference and the percentage difference has decreased over time to the point where in 2006 46
of medals were won by females.
plot(x ~ Year, medalsByYearByGender[medalsByYearByGender$Event.gender ==
"M", ], ylim = c(0, max(x)), pch = "m", col = "blue", las = 1, ylab = "Total Medals Awarded",
bty = "l", main = "Total Medals Won by Gender and Year")
points(medalsByYearByGender[medalsByYearByGender$Event.gender ==
"W", "Year"], medalsByYearByGender[medalsByYearByGender$Event.gender ==
"W", "x"], col = "red", pch = "f")
print(xtable(propf$table,
caption="Proportion of Medals that were awarded to Females by Year"),
type="html",
caption.placement="top",
html.table.attributes='align="center"')
Year | Prop. Female | |
---|---|---|
1 | 1924 | 0.07 |
2 | 1928 | 0.08 |
3 | 1932 | 0.08 |
4 | 1936 | 0.12 |
5 | 1948 | 0.18 |
6 | 1952 | 0.23 |
7 | 1956 | 0.26 |
8 | 1960 | 0.38 |
9 | 1964 | 0.37 |
10 | 1968 | 0.37 |
11 | 1972 | 0.36 |
12 | 1976 | 0.35 |
13 | 1980 | 0.34 |
14 | 1984 | 0.36 |
15 | 1988 | 0.37 |
16 | 1992 | 0.43 |
17 | 1994 | 0.43 |
18 | 1998 | 0.44 |
19 | 2002 | 0.45 |
20 | 2006 | 0.46 |
Countries with the Most Medals
cmm <- list()
cmm$medals <- sort(table(medals$NOC), dec = TRUE)
cmm$country <- names(cmm$medals)
cmm$prop <- cmm$medals/sum(cmm$medals)
cmm$propF <- paste(round(cmm$prop * 100, 2), "%", sep = "")
cmm$row1 <- c("Rank", "Country", "Total", "%")
cmm$rank <- seq(cmm$medals)
cmm$include <- 1:10
cmm$table <- with(cmm, rbind(cbind(rank[include], country[include],
medals[include], propF[include])))
colnames(cmm$table) <- cmm$row1
Norway has won the most medals with 280
(12.12
%). The table shows the top 10. Russia, USSR, and EUN (Unified Team in 1992 Olympics) have a combined total of 293
. Germany, GDR, and FRG have a combined medal total of 309
.
print(xtable(cmm$table, caption="Rankings of Medals Won by Country"),
"html", include.rownames=FALSE, caption.placement='top',
html.table.attributes='align="center"')
Rank | Country | Total | % |
---|---|---|---|
1 | NOR | 280 | 12.12% |
2 | USA | 216 | 9.35% |
3 | URS | 194 | 8.39% |
4 | AUT | 185 | 8.01% |
5 | GER | 158 | 6.84% |
6 | FIN | 151 | 6.53% |
7 | CAN | 119 | 5.15% |
8 | SUI | 118 | 5.11% |
9 | SWE | 118 | 5.11% |
10 | GDR | 110 | 4.76% |
Proportion of Gold Medals by Country
Looking only at countries that have won more than 50 medals in the dataset, the figure shows that the proportion of medals won that were gold, silver, or bronze.
NOC50Plus <- names(table(medals$NOC)[table(medals$NOC) > 50])
medalsSubset <- medals[medals$NOC %in% NOC50Plus, ]
medalsByMedalByNOC <- prop.table(table(medalsSubset$NOC, medalsSubset$Medal),
margin = 1)
medalsByMedalByNOC <- medalsByMedalByNOC[order(medalsByMedalByNOC[, "Gold"],
decreasing = TRUE), c("Gold", "Silver", "Bronze")]
barplot(round(t(medalsByMedalByNOC), 2), horiz = TRUE, las = 1,
col=c("gold", "grey71", "chocolate4"),
xlab = "Proportion of Medals",
main="Proportion of medals won that were gold, silver or bronze.")
How many different countries have won medals by year?
listOfYears <- unique(medals$Year)
names(listOfYears) <- unique(medals$Year)
totalNocByYear <- sapply(listOfYears, function(X) length(table(medals[medals$Year ==
X, "NOC"])))
The figure shows the total number of countries winning medals by year.
plot(x = names(totalNocByYear), totalNocByYear, ylim = c(0, max(totalNocByYear)),
las = 1, xlab = "Year", main = "Total Number of Countries Winning Medals By Year",
ylab = "Total Number of Countries", bty = "l")
Australia at the Winter Olympics
ausmedals <- list()
ausmedals$data <- medals[medals$NOC == "AUS", ]
ausmedals$data <- ausmedals$data[, c("Year", "City", "Discipline",
"Event", "Medal")]
ausmedals$table <- ausmedals$data
Given that I am an Australian I decided to have a look at the Australian medal count. Australia does not get a lot of snow. Up to and including 2006, Australia has won 6
medals. It won its first medal in 1994
. Of the 6
medals, 3
were bronze, 0
were silver, and 3
were gold. The table lists each of these medals.
print(xtable(ausmedals$table,
caption='List of Australian Medals',
digits=0),
type='html',
caption.placement='top',
include.rownames=FALSE,
html.table.attributes='align="center"')
Year | City | Discipline | Event | Medal |
---|---|---|---|---|
1994 | Lillehammer | Short Track S. | 5000m relay | Bronze |
1998 | Nagano | Alpine Skiing | slalom | Bronze |
2002 | Salt Lake City | Short Track S. | 1000m | Gold |
2002 | Salt Lake City | Freestyle Ski. | aerials | Gold |
2006 | Turin | Freestyle Ski. | aerials | Bronze |
2006 | Turin | Freestyle Ski. | moguls | Gold |
Ice Hockey
icehockey <- medals[medals$Sport == "Ice Hockey" & medals$Event.gender ==
"M" & medals$Medal == "Gold", ]
icehockeyf <- medals[medals$Sport == "Ice Hockey" & medals$Event.gender ==
"W" & medals$Medal == "Gold", ]
# names(table(icehockey$NOC)[table(icehockey$NOC) > 1])
The following are some statistics about Winter Olympics Ice Hockey up to and including the 2006 Winter Olympics.
- Out of the
20
Winter Olympics that have been staged, Mens Ice Hockey has been held in20
and the Womens in3
. - The USSR has won the most mens gold medals with
7
golds. It goes up to8
if the 1992 Unified Team is included. - Canada has the second most golds with
6
. - After that the only two nations to win more than one gold are Sweden (
2
golds) and the United States (2
golds). - The table shows the countries who won gold and silver medals by year.
- In the case of the Women's Ice Hockey, Canada has won
2
and the United States has won1
.
icehockeygs <- medals[medals$Sport == "Ice Hockey" &
medals$Event.gender == "M" &
medals$Medal %in% c("Silver", "Gold"), c("Year", "Medal", "NOC")]
icetab <- list()
icetab$data <- reshape(icehockeygs, idvar="Year", timevar="Medal",
direction="wide")
names(icetab$data) <- c("Year", "Gold", "Silver")
print(xtable(icetab$data,
caption ="Country Winning Gold and Silver Medals by Year in Mens Ice Hockey",
digits=0),
type="html",
include.rownames=FALSE,
caption.placement="top",
html.table.attributes='align="center"')
Year | Gold | Silver |
---|---|---|
1924 | CAN | USA |
1928 | CAN | SWE |
1932 | CAN | USA |
1936 | GBR | CAN |
1948 | CAN | TCH |
1952 | CAN | USA |
1956 | URS | USA |
1960 | USA | CAN |
1964 | URS | SWE |
1968 | URS | TCH |
1972 | URS | USA |
1976 | URS | TCH |
1980 | USA | URS |
1984 | URS | TCH |
1988 | URS | FIN |
1992 | EUN | CAN |
1994 | SWE | CAN |
1998 | CZE | RUS |
2002 | CAN | USA |
2006 | SWE | FIN |
Reflections on the Conversion Process
- Markdown versus LaTeX:
- I prefer performing analyses with Markdown than I do with LateX.
- Markdown is easier to type than LaTeX.
- Markdown is easier to read than LaTeX.
- It is easier with Markdown to get started with analyses.
- Many analyses are only presented on the screen and as such page breaks in LaTeX are a nuisance. This extends to many features of LaTeX such as headers, figure and table placement, margins, table formatting, partiuclarly for long or wide tables, and so on.
- That said, journal articles, books, and other artefacts that are bound to the model of a printed page are not going anywhere.
- Furthermore, bibliographies, cross-references, elaborate control of table appearance, and more are all features which LaTeX makes easier than Markdown.
- R Markdown to Sweave LaTeX:
- The more common conversion task that I can imagine is taking some simple analyses in R Markdown and having to convert them into knitr LaTeX in order to include the content in a journal article.
- The first time I converted between the formats, it was good to do it in a relatively manual way to get a sense of all the required changes; however, if I had a large document or was doing the task on subsequent occasions, I would look at more automated solutions using string replacement tools (e.g., sed, or even just replacement commands in a text editor such as Vim), and markup conversion tools (e.g., pandoc).
- Perhaps if the formats get popular enough, developers will start to build dedicated conversion tools.
Additional Resources
If you liked this post, you may want to subscribe to the RSS feed of my blog. Also see:
- This post on Getting Started with R Markdown, knitr, and Rstudio 0.96
- This post for another Example Reproducible Report using R Markdown which analyses California Schools Test Data
- These Assorted posts using Sweave
- The knitr home page and knitr options page.
- the xtable LaTeX table gallery which can also be used to generate HTML tables for inclusion in Markdown.