Wednesday, March 13, 2013

Using maps and ggplot2 to visualize college hockey championships

Short:
I plot the frequency of college hockey championships by state using the maps package, and ggplot2

Note: this example is based heavily on the example provided at
http://www.dataincolour.com/2011/07/maps-with-ggplot2/

Question of interest
As a good Minnesotan, I've believed for quite some time that the colder, Northern states enjoy a competitive advantage when it comes to college hockey. Does this advantage exist? How strong is it?

I first downloaded data from wikipedia on past winners of hockey championships, and saved the short list in an excel csv file.

After saving the file, here's how the data look in R:

``````# Visualizing College Hockey Champions by State

# Author: Mark T Patterson Date: March 13, 2013

# Libraries:
library(ggplot2)
library(maps)

# Changing library:
rm(list = ls())  # Clearing the work bench
setwd("C:/Users/Mark/Desktop/Blog/Data")

dat.state$state = tolower(dat.state$state)
``````
``````##           state titles
## 1      michigan     19
## 2 massachusetts     11
## 4  north dakota      7
## 5     minnesota      6
## 6     wisconsin      6
``````

Now that we've loaded the information about hockey championships by state, we just need to load the mapping data. map_data(state') is a dataframe in the maps package. Here, we'll use the region column, which lists state names, to match our state championship data.

``````# Creating mapping dataframe:
us.state = map_data("state")
``````
``````##     long   lat group order  region subregion
## 1 -87.46 30.39     1     1 alabama      <NA>
## 2 -87.48 30.37     1     2 alabama      <NA>
## 3 -87.53 30.37     1     3 alabama      <NA>
## 4 -87.53 30.33     1     4 alabama      <NA>
## 5 -87.57 30.33     1     5 alabama      <NA>
## 6 -87.59 30.33     1     6 alabama      <NA>
``````
``````
# Merging the two datasets:

dat.champs = merge(us.state, dat.state, by.x = "region", by.y = "state",
all = TRUE)

dat.champs <- dat.champs[order(dat.champs\$order), ]
# mapping requires the same order of observations that appear in us.state

``````
``````##    region   long   lat group order subregion titles
## 1 alabama -87.46 30.39     1     1      <NA>     NA
## 2 alabama -87.48 30.37     1     2      <NA>     NA
## 3 alabama -87.53 30.37     1     3      <NA>     NA
## 4 alabama -87.53 30.33     1     4      <NA>     NA
## 5 alabama -87.57 30.33     1     5      <NA>     NA
## 6 alabama -87.59 30.33     1     6      <NA>     NA
``````

With the dat.champs frame created, we're ready to plot

``````# Plotting

(qplot(long, lat, data = dat.champs, geom = "polygon", group = group,
fill = titles) + theme_bw() + labs(x = "", y = "", fill = "") + scale_fill_gradient(low = "#EEEEEE",
high = "darkgreen") + opts(title = "College Hockey Championships By State",
legend.position = "bottom", legend.direction = "horizontal"))
``````

Having plotted the data, it's easy to see the effect of the 'great lakes' region on hockey championships. With the exception of Colorado, only Northern, colder states have won titles.

Ways to improve this analysis
While we observe that college title champions are clustered in the Northern Midwest and Northern East, it's possible that several variables could explain the distribution. We might consider examining 1) state temperature (we might expect that colder temperatures lead to better performance, since teams in colder states get to practice more), 2) distance from great lakes (this might be a proxy for the availability of ice), 3) distance from Canadian hockey cities (it's possible that hockey culture follows from Canadian or other European immigration).

Beyond examining these possible factors, it'd be interesting to try color presentations – I've adopted the same color scheme presented at http://www.dataincolour.com/2011/07/maps-with-ggplot2/ , but it would be good to have some familiarity with other schemes.