Jeromy Anglim's Blog: Psychology and Statistics


Thursday, February 18, 2010

Analysis of Winter Olympic Medal Data Using R

The Winter Olympics are on. The Guardian's DataBlog has graciously compiled a database on Winter Olympic Medals. Thus, I thought I'd run a few quick analyses on the data in R. In this post I was hoping to show how one could quickly churn out some basic analyses (and answer some interesting questions) using R.

First, a disclaimer: I ran these analyses in about 45 minutes. Thus, I make no claims of perfect accuracy or in the source data provided by the Guardian. The data also does not include 2010 medals.

Below you will see:
  1. The R Console input and output
  2. The plots
  3. The source code on its own


CONSOLE INPUT AND OUTPUT
> # tips on reading a Google Spreadsheet:
> # http://blog.revolution-computing.com/2009/09/how-to-use-a-google-spreadsheet-as-data-in-r.html
> # Data taken from:"https://spreadsheets.google.com/ccc?key=0AgdO92JOXxAOdDVlaUpkNlB2WERtV3l1ZVFYbzllQWc"
> # http://www.guardian.co.uk/news/datablog/2010/feb/11/winter-olympics-medals-by-country
> 
> googleLink <- "http://spreadsheets.google.com/pub?key=tsddww6vOYePkhPSxRpDeYw&single=true&gid=1&output=csv"
> medals <- read.csv(googleLink, stringsAsFactors = FALSE)
> savePlot <- TRUE # optional variable used to save or not save plots in code
> 
> # remove rows that do not contain data
> medals$Year <- as.numeric(medals$Year)
Warning message:
NAs introduced by coercion 
> medals <- medals[!is.na(medals$Year), ]
> 
> 
> # Quick look at data
> head(medals)
  Year     City      Sport     Discipline NOC           Event Event.gender
1 1924 Chamonix    Skating Figure skating AUT      individual            M
2 1924 Chamonix    Skating Figure skating AUT      individual            W
3 1924 Chamonix    Skating Figure skating AUT           pairs            X
4 1924 Chamonix  Bobsleigh      Bobsleigh BEL        four-man            M
5 1924 Chamonix Ice Hockey     Ice Hockey CAN      ice hockey            M
6 1924 Chamonix   Biathlon       Biathlon FIN military patrol            M
   Medal
1 Silver
2   Gold
3   Gold
4 Bronze
5   Gold
6 Silver
> sapply(medals, function(x) cbind(sort(table(x), decreasing = TRUE)))
$Year
     [,1]
2006  252
2002  234
1998  205
1994  183
1992  171
1988  138
1984  117
1980  115
1976  111
1968  106
1972  105
1964  103
1960   81
1956   72
1948   68
1952   67
1936   51
1924   49
1932   42
1928   41

$City
                       [,1]
Turin                   252
Salt Lake City          234
Innsbruck               214
Nagano                  205
Lillehammer             183
Albertville             171
Lake Placid             157
Calgary                 138
Sarajevo                117
St. Moritz              109
Grenoble                106
Sapporo                 105
Squaw Valley             81
Cortina d'Ampezzo        72
Oslo                     67
Garmisch-Partenkirchen   51
Chamonix                 49

$Sport
           [,1]
Skiing     1060
Skating     758
Biathlon    162
Bobsleigh   133
Luge        108
Ice Hockey   69
Curling      21

$Discipline
                [,1]
Speed skating    455
Cross Country S  399
Alpine Skiing    367
Figure skating   207
Biathlon         162
Bobsleigh        115
Ski Jumping      114
Luge             108
Short Track S.    96
Nordic Combined   84
Ice Hockey        69
Freestyle Ski.    54
Snowboard         42
Curling           21
Skeleton          18

$NOC
    [,1]
NOR  280
USA  216
URS  194
AUT  185
GER  158
FIN  151
CAN  119
SUI  118
SWE  118
GDR  110
ITA  101
FRA   83
NED   78
RUS   76
FRG   41
CHN   33
JPN   32
KOR   31
TCH   25
EUN   23
GBR   21
EUA   19
CZE   10
LIE    9
POL    8
CRO    7
AUS    6
BLR    6
BUL    6
EST    6
HUN    6
BEL    5
KAZ    5
UKR    5
SLO    4
YUG    4
ESP    2
LUX    2
PRK    2
DEN    1
LAT    1
NZL    1
ROU    1
SVK    1
UZB    1

$Event
                                [,1]
individual                       195
500m                             133
1500m                            111
downhill                          97
slalom                            96
1000m                             94
giant slalom                      90
5000m                             78
singles                           72
ice hockey                        69
10km                              60
50km                              60
K90 individual (70m)              60
pairs                             60
10000m                            57
two-man                           57
four-man                          55
4x10km relay                      51
15km                              48
20km                              45
4x7.5km relay                     42
alpine combined                   42
3000m                             39
30km mass start                   39
doubles                           36
K120 individual (90m)             36
super-G                           36
5km                               30
moguls                            30
4x5km relay                       27
ice dancing                       27
aerials                           24
curling                           21
10km pursuit                      18
18km                              18
Half-pipe                         18
K120 team (90m)                   18
Team                              18
15km mass start                   15
3000m relay                       15
30km                              15
3x5km relay                       15
5000m relay                       15
7.5km                             15
Giant parallel slalom             12
Combined 10km + 15km pursuit       9
Combined 5km + 10km pursuit        9
12.5km pursuit                     6
Alpine combined                    6
giant-slalom                       6
Snowboard Cross                    6
Sprint 1,5km                       6
sprint 1.5km                       6
Team pursuit                       6
Team sprint                        6
12,5km mass start                  3
3x7.5km relay                      3
4x6km relay                        3
5km pursuit                        3
combined (4 events)                3
Combined 15 + 15km mass start      3
Combined 7.5 + 7.5km mass start    3
five-man                           3
Individual                         3
Individual sprint                  3
military patrol                    3
sprint                             3

$Event.gender
  [,1]
M 1386
W  802
X  123

$Medal
       [,1]
Gold    774
Silver  773
Bronze  764

> 
> 
> # How many medals have been awarded in each Olympics?
> medalsByYear <- aggregate(medals$Year, list(Year = medals$Year), length)
> if (savePlot == TRUE)  png("fig1.png")
> plot(x ~ Year, medalsByYear, ylim = c(0,max(x)), 
+     ylab = "Total Medals Awarded", bty="l",
+     main = "Total Medals Awarded in Winter Olympics by Year")
> if (savePlot == TRUE) dev.off()
windows 
      2 
> 
> # How has the amount of medals awarded to males and females changed over the years?
> # Get data.
> medalsByYearByGender <- aggregate(medals$Year, 
+     list(Year = medals$Year, Event.gender = medals$Event.gender), length)
> medalsByYearByGender <- medalsByYearByGender[medalsByYearByGender$Event.gender != "X", ]
> 
> # Plot results.
> if (savePlot == TRUE)  png("fig2.png")
> plot(x ~ Year, medalsByYearByGender[medalsByYearByGender$Event.gender == "M", ], 
+     ylim = c(0,max(x)), pch = "m", col = "blue", 
+     ylab = "Total Medals Awarded", bty="l",
+     main = "Total Medals Awarded in Winter Olympics\n by Gender and by Year")
> points(medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "Year"],
+     medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "x"],
+     col = "red", pch = "f")
> if (savePlot == TRUE) dev.off()
windows 
      2 
> 
> # Table of proportion female
> propFemalePerYear <- medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "x"] / (
+       medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "x"] +
+       medalsByYearByGender[medalsByYearByGender$Event.gender == "M", "x"])
> propFemalePerYear <- round(propFemalePerYear, 2)
> cbind(Year = medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "Year"],
+     PropFemale = propFemalePerYear)
      Year PropFemale
 [1,] 1924       0.07
 [2,] 1928       0.08
 [3,] 1932       0.08
 [4,] 1936       0.12
 [5,] 1948       0.18
 [6,] 1952       0.23
 [7,] 1956       0.26
 [8,] 1960       0.38
 [9,] 1964       0.37
[10,] 1968       0.37
[11,] 1972       0.36
[12,] 1976       0.35
[13,] 1980       0.34
[14,] 1984       0.36
[15,] 1988       0.37
[16,] 1992       0.43
[17,] 1994       0.43
[18,] 1998       0.44
[19,] 2002       0.45
[20,] 2006       0.46
>         
> 
> # Which countries have won the most medals?
> sort(table(medals$NOC), dec = TRUE)

NOR USA URS AUT GER FIN CAN SUI SWE GDR ITA FRA NED RUS FRG CHN JPN KOR TCH EUN 
280 216 194 185 158 151 119 118 118 110 101  83  78  76  41  33  32  31  25  23 
GBR EUA CZE LIE POL CRO AUS BLR BUL EST HUN BEL KAZ UKR SLO YUG ESP LUX PRK DEN 
 21  19  10   9   8   7   6   6   6   6   6   5   5   5   4   4   2   2   2   1 
LAT NZL ROU SVK UZB 
  1   1   1   1   1 
> 
> 
> # Of the countries that have won more than 50 medals,
> # which have the highest percentage of gold medals?
> NOC50Plus <- names(table(medals$NOC)[table(medals$NOC) > 50])
> medalsSubset <- medals[medals$NOC %in% NOC50Plus, ]
> medalsByMedalByNOC <- prop.table(table(medalsSubset$NOC, medalsSubset$Medal), margin = 1)
> medalsByMedalByNOC <- medalsByMedalByNOC[order(medalsByMedalByNOC[, "Gold"], 
+         decreasing = TRUE), c("Gold", "Silver", "Bronze")]
> round(medalsByMedalByNOC, 2)
     
      Gold Silver Bronze
  RUS 0.43   0.32   0.25
  URS 0.40   0.29   0.30
  GER 0.37   0.37   0.26
  SWE 0.36   0.26   0.37
  USA 0.36   0.37   0.27
  ITA 0.36   0.31   0.34
  GDR 0.35   0.33   0.32
  NOR 0.35   0.35   0.30
  SUI 0.32   0.31   0.36
  NED 0.32   0.38   0.29
  CAN 0.32   0.32   0.36
  FRA 0.30   0.29   0.41
  AUT 0.28   0.35   0.38
  FIN 0.27   0.38   0.34
> 
> 
> # How many different countries have won medals by year?
> listOfYears <- unique(medals$Year)
> names(listOfYears) <- unique(medals$Year)
> totalNocByYear <- sapply(listOfYears,  function(X) 
+       length(table(medals[medals$Year == X, "NOC"])))
> 
> # Table
> totalNocByYear  
1924 1928 1932 1936 1948 1952 1956 1960 1964 1968 1972 1976 1980 1984 1988 1992 
  10   12   10   11   13   14   13   14   14   15   17   16   19   17   17   20 
1994 1998 2002 2006 
  22   24   24   26 
> 
> # Plot
> if (savePlot == TRUE)  png("fig3.png")
> plot(x= names(totalNocByYear), totalNocByYear, 
+     ylim = c(0, max(totalNocByYear)),
+     xlab = "Year",
+     ylab = "Total Number of Countries",
+     bty = "l", 
+     main = "Total Number of Countries\n Winning Medals By Year")
> if (savePlot == TRUE) dev.off()
windows 
      2 
> 
> # Which Countries have won a medal at every Olympics? 
> propYearsOnePlusMedals <- apply(table(medals$NOC, medals$Year) > 0, 1, mean)
> 
> #Answer
> names(propYearsOnePlusMedals[propYearsOnePlusMedals == 1.0]) 
[1] "AUT" "CAN" "FIN" "NOR" "SWE" "USA"
> 
> # Table Sorted by Proportion of Olympics with a Medal
> cbind(sort(propYearsOnePlusMedals, decreasing = TRUE)) 
    [,1]
AUT 1.00
CAN 1.00
FIN 1.00
NOR 1.00
SWE 1.00
USA 1.00
FRA 0.95
SUI 0.95
ITA 0.80
GBR 0.65
NED 0.65
TCH 0.55
JPN 0.50
GER 0.45
URS 0.45
FRG 0.35
GDR 0.30
HUN 0.30
CHN 0.25
KOR 0.25
POL 0.25
AUS 0.20
BEL 0.20
BLR 0.20
BUL 0.20
LIE 0.20
RUS 0.20
CZE 0.15
EUA 0.15
UKR 0.15
CRO 0.10
ESP 0.10
EST 0.10
KAZ 0.10
PRK 0.10
SLO 0.10
YUG 0.10
DEN 0.05
EUN 0.05
LAT 0.05
LUX 0.05
NZL 0.05
ROU 0.05
SVK 0.05
UZB 0.05


THE PLOTS






THE R SOURCE CODE
# tips on reading a Google Spreadsheet:
# http://blog.revolution-computing.com/2009/09/how-to-use-a-google-spreadsheet-as-data-in-r.html
# Data taken from:"https://spreadsheets.google.com/ccc?key=0AgdO92JOXxAOdDVlaUpkNlB2WERtV3l1ZVFYbzllQWc"
# http://www.guardian.co.uk/news/datablog/2010/feb/11/winter-olympics-medals-by-country

googleLink <- "http://spreadsheets.google.com/pub?key=tsddww6vOYePkhPSxRpDeYw&single=true&gid=1&output=csv"
medals <- read.csv(googleLink, stringsAsFactors = FALSE)
savePlot <- TRUE # optional variable used to save or not save plots in code

# remove rows that do not contain data
medals$Year <- as.numeric(medals$Year)
medals <- medals[!is.na(medals$Year), ]


# Quick look at data
head(medals)
sapply(medals, function(x) cbind(sort(table(x), decreasing = TRUE)))


# How many medals have been awarded in each Olympics?
medalsByYear <- aggregate(medals$Year, list(Year = medals$Year), length)
if (savePlot == TRUE)  png("fig1.png")
plot(x ~ Year, medalsByYear, ylim = c(0,max(x)), 
    ylab = "Total Medals Awarded", bty="l",
    main = "Total Medals Awarded in Winter Olympics by Year")
if (savePlot == TRUE) dev.off()

# How has the amount of medals awarded to males and females changed over the years?
# Get data.
medalsByYearByGender <- aggregate(medals$Year, 
    list(Year = medals$Year, Event.gender = medals$Event.gender), length)
medalsByYearByGender <- medalsByYearByGender[medalsByYearByGender$Event.gender != "X", ]

# Plot results.
if (savePlot == TRUE)  png("fig2.png")
plot(x ~ Year, medalsByYearByGender[medalsByYearByGender$Event.gender == "M", ], 
    ylim = c(0,max(x)), pch = "m", col = "blue", 
    ylab = "Total Medals Awarded", bty="l",
    main = "Total Medals Awarded in Winter Olympics\n by Gender and by Year")
points(medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "Year"],
    medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "x"],
    col = "red", pch = "f")
if (savePlot == TRUE) dev.off()

# Table of proportion female
propFemalePerYear <- medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "x"] / (
      medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "x"] +
      medalsByYearByGender[medalsByYearByGender$Event.gender == "M", "x"])
propFemalePerYear <- round(propFemalePerYear, 2)
cbind(Year = medalsByYearByGender[medalsByYearByGender$Event.gender == "W", "Year"],
    PropFemale = propFemalePerYear)
        

# Which countries have won the most medals?
sort(table(medals$NOC), dec = TRUE)


# Of the countries that have won more than 50 medals,
# which have the highest percentage of gold medals?
NOC50Plus <- names(table(medals$NOC)[table(medals$NOC) > 50])
medalsSubset <- medals[medals$NOC %in% NOC50Plus, ]
medalsByMedalByNOC <- prop.table(table(medalsSubset$NOC, medalsSubset$Medal), margin = 1)
medalsByMedalByNOC <- medalsByMedalByNOC[order(medalsByMedalByNOC[, "Gold"], 
        decreasing = TRUE), c("Gold", "Silver", "Bronze")]
round(medalsByMedalByNOC, 2)


# How many different countries have won medals by year?
listOfYears <- unique(medals$Year)
names(listOfYears) <- unique(medals$Year)
totalNocByYear <- sapply(listOfYears,  function(X) 
      length(table(medals[medals$Year == X, "NOC"])))

# Table
totalNocByYear  

# Plot
if (savePlot == TRUE)  png("fig3.png")
plot(x= names(totalNocByYear), totalNocByYear, 
    ylim = c(0, max(totalNocByYear)),
    xlab = "Year",
    ylab = "Total Number of Countries",
    bty = "l", 
    main = "Total Number of Countries\n Winning Medals By Year")
if (savePlot == TRUE) dev.off()

# Which Countries have won a medal at every Olympics? 
propYearsOnePlusMedals <- apply(table(medals$NOC, medals$Year) > 0, 1, mean)

#Answer
names(propYearsOnePlusMedals[propYearsOnePlusMedals == 1.0]) 

# Table Sorted by Proportion of Olympics with a Medal
cbind(sort(propYearsOnePlusMedals, decreasing = TRUE)) 

2 comments:

  1. Hi,
    Thank you for all of these and for the videos!
    the input data from Google is no more available.
    Could you please put also the data in a txt format or something. Thank you!
    Marius

    ReplyDelete
  2. @Anonymous
    I have a github project that you can check out here:

    https://github.com/jeromyanglim/Sweave_Winter_Olympics

    The data file is here:

    https://github.com/jeromyanglim/Sweave_Winter_Olympics/blob/master/data/medals.csv

    ReplyDelete