Tuesday, March 19, 2013

Behavioral Economics and Beer... highly correlated

Short:
I plot the frequency of wikipedia searches of “Behavioral Economics”, and “Beer” – who knew the correlation would be 0.7!

Data reference:
Data on any wikipedia searches (back to 2007) are available at http://glimmer.rstudio.com/pssguy/wikiSearchRates/. The website allows you to download frequency hits per day as a csv, which is what I've done here.

# Behavioral Economics and Beer:

# Author: Mark T Patterson Date: March 18, 2013

# Clear Workbench:
rm(list = ls())

# libraries:
library(lubridate)
library(ggplot2)
## Find out what's changed in ggplot2 with
## news(Version == "0.9.1", package = "ggplot2")

# data:
curr.wd = getwd()
setwd("C:/Users/Mark/Desktop/Blog/Data")
ts = read.csv("BehavEconBeer.csv", header = TRUE)
setwd(curr.wd)

# cleaning the dataset: str(ts)
ts$date = as.character(ts$date)
ts$date = mdy(ts$date)
## Using date format %m/%d/%Y.
ts = ts[, -1]

Note: the mdy function is in the lubridate package, which cleanly handles time/date data. I've eliminated the first column of data, which just gives row names inherited from excel.

p = ggplot(ts, aes(x = date, y = count)) + geom_line(aes(color = factor(name)), 
    size = 2)
p

plot of chunk unnamed-chunk-2

It turns out the pattern we observe isn't at all unique – many variables follow (predictable) patterns of variation through the week. This doesn't necessarily mean, though, that the correlation between beer and behavioral economics is entirely spurious!

Wednesday, March 13, 2013

Using maps and ggplot2 to visualize college hockey championships

Short:
I plot the frequency of college hockey championships by state using the maps package, and ggplot2

Note: this example is based heavily on the example provided at
http://www.dataincolour.com/2011/07/maps-with-ggplot2/

data reference:
http://en.wikipedia.org/wiki/NCAA_Men%27s_Ice_Hockey_Championship

Question of interest
As a good Minnesotan, I've believed for quite some time that the colder, Northern states enjoy a competitive advantage when it comes to college hockey. Does this advantage exist? How strong is it?

I first downloaded data from wikipedia on past winners of hockey championships, and saved the short list in an excel csv file.

After saving the file, here's how the data look in R:

# Visualizing College Hockey Champions by State

# Author: Mark T Patterson Date: March 13, 2013


# Libraries:
library(ggplot2)
library(maps)

# Changing library:
rm(list = ls())  # Clearing the work bench
setwd("C:/Users/Mark/Desktop/Blog/Data")

# Loading Data:


# Loading state championships data:
dat.state = read.csv("HockeyChampsByState.csv", header = TRUE)
dat.state$state = tolower(dat.state$state)
head(dat.state)
##           state titles
## 1      michigan     19
## 2 massachusetts     11
## 3      colorado      9
## 4  north dakota      7
## 5     minnesota      6
## 6     wisconsin      6

Now that we've loaded the information about hockey championships by state, we just need to load the mapping data. map_data(state') is a dataframe in the maps package. Here, we'll use the region column, which lists state names, to match our state championship data.

# Creating mapping dataframe:
us.state = map_data("state")
head(us.state)
##     long   lat group order  region subregion
## 1 -87.46 30.39     1     1 alabama      <NA>
## 2 -87.48 30.37     1     2 alabama      <NA>
## 3 -87.53 30.37     1     3 alabama      <NA>
## 4 -87.53 30.33     1     4 alabama      <NA>
## 5 -87.57 30.33     1     5 alabama      <NA>
## 6 -87.59 30.33     1     6 alabama      <NA>

# Merging the two datasets:

dat.champs = merge(us.state, dat.state, by.x = "region", by.y = "state", 
    all = TRUE)

dat.champs <- dat.champs[order(dat.champs$order), ]
# mapping requires the same order of observations that appear in us.state

head(dat.champs)
##    region   long   lat group order subregion titles
## 1 alabama -87.46 30.39     1     1      <NA>     NA
## 2 alabama -87.48 30.37     1     2      <NA>     NA
## 3 alabama -87.53 30.37     1     3      <NA>     NA
## 4 alabama -87.53 30.33     1     4      <NA>     NA
## 5 alabama -87.57 30.33     1     5      <NA>     NA
## 6 alabama -87.59 30.33     1     6      <NA>     NA

With the dat.champs frame created, we're ready to plot

# Plotting

(qplot(long, lat, data = dat.champs, geom = "polygon", group = group, 
    fill = titles) + theme_bw() + labs(x = "", y = "", fill = "") + scale_fill_gradient(low = "#EEEEEE", 
    high = "darkgreen") + opts(title = "College Hockey Championships By State", 
    legend.position = "bottom", legend.direction = "horizontal"))

plot of chunk unnamed-chunk-3

Having plotted the data, it's easy to see the effect of the 'great lakes' region on hockey championships. With the exception of Colorado, only Northern, colder states have won titles.

Ways to improve this analysis
While we observe that college title champions are clustered in the Northern Midwest and Northern East, it's possible that several variables could explain the distribution. We might consider examining 1) state temperature (we might expect that colder temperatures lead to better performance, since teams in colder states get to practice more), 2) distance from great lakes (this might be a proxy for the availability of ice), 3) distance from Canadian hockey cities (it's possible that hockey culture follows from Canadian or other European immigration).

Beyond examining these possible factors, it'd be interesting to try color presentations – I've adopted the same color scheme presented at http://www.dataincolour.com/2011/07/maps-with-ggplot2/ , but it would be good to have some familiarity with other schemes.

Thursday, March 7, 2013

ddply in action

Top Batting Averages Over Time

Top Batting Averages Over Time

reference:
http://www.baseball-databank.org/

Short
I'm going to use plyr and ggplot2 to look at how top batting averages have changed over time

First load the data:

options(width = 100)
library(ggplot2)
## Warning message: package 'ggplot2' was built under R version 2.14.2
library(plyr)

data(baseball)
head(baseball)
##            id year stint team lg  g  ab  r  h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp
## 4   ansonca01 1871     1  RC1    25 120 29 39  11   3  0  16  6  2  2  1  NA  NA NA NA   NA
## 44  forceda01 1871     1  WS3    32 162 45 45   9   4  0  29  8  0  4  0  NA  NA NA NA   NA
## 68  mathebo01 1871     1  FW1    19  89 15 24   3   1  0  10  2  1  2  0  NA  NA NA NA   NA
## 99  startjo01 1871     1  NY2    33 161 35 58   5   1  1  34  4  2  3  0  NA  NA NA NA   NA
## 102 suttoez01 1871     1  CL1    29 128 35 45   3   7  3  23  3  1  1  0  NA  NA NA NA   NA
## 106 whitede01 1871     1  CL1    29 146 40 47   6   5  1  21  2  2  4  1  NA  NA NA NA   NA

It looks like we've loaded the data successfully.

Next, We'll add something that is close to batting average: total hits divided by total at-bats:

baseball$ba = baseball$h/baseball$ab
head(baseball)
##            id year stint team lg  g  ab  r  h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp     ba
## 4   ansonca01 1871     1  RC1    25 120 29 39  11   3  0  16  6  2  2  1  NA  NA NA NA   NA 0.3250
## 44  forceda01 1871     1  WS3    32 162 45 45   9   4  0  29  8  0  4  0  NA  NA NA NA   NA 0.2778
## 68  mathebo01 1871     1  FW1    19  89 15 24   3   1  0  10  2  1  2  0  NA  NA NA NA   NA 0.2697
## 99  startjo01 1871     1  NY2    33 161 35 58   5   1  1  34  4  2  3  0  NA  NA NA NA   NA 0.3602
## 102 suttoez01 1871     1  CL1    29 128 35 45   3   7  3  23  3  1  1  0  NA  NA NA NA   NA 0.3516
## 106 whitede01 1871     1  CL1    29 146 40 47   6   5  1  21  2  2  4  1  NA  NA NA NA   NA 0.3219

Finally, we can use the plyr package to look at how batting averages have changed over time. We'll only consider players who have at least 100 at-bats in a season.

Note: ddply essentially splits the dataset into groups based on the year variable, and then performs the same function on each of the subsets (here, we're executing the topBA function). With the calculation performed on each of the subsets, ddply then collects all of the output into a new data frame.


BA.dat = ddply(baseball, .(year), summarise, topBA = max(ba[ab > 100], na.rm = TRUE))
head(BA.dat, 10)
##    year  topBA
## 1  1871 0.3602
## 2  1872 0.4147
## 3  1873 0.3976
## 4  1874 0.3359
## 5  1875 0.3666
## 6  1876 0.3560
## 7  1877 0.3872
## 8  1878 0.3580
## 9  1879 0.3570
## 10 1880 0.3602

Now, we're ready to use ggplot2 to visually examine the data:

p = ggplot(BA.dat, aes(x = year, y = topBA)) + geom_point()
p

plot of chunk unnamed-chunk-4

While it's only a heuristic judgment at this point, it's pretty clear that we have a downward trend over time.