Decisions and R

Wednesday, April 17, 2013

CrossFit weights: gender matters less than you'd think

Exploring Gaussian Mixture Models

This week in the Empirical Research Methods course, we've been talking a lot about measurement error. The idea of having some latent variable of interest, coupled with 'flawed' measures reminded me of a section of Cosma's course I really enjoyed, but haven't gotten a change to go back to – mixture models.

The rough idea here is that we have some observed measure for a population, but suspect the population is composed of a number of distict types, which themselves have different distributions on our measure. Cosma has a nice example using precipitation data – here, we might suspect that snowy days result in a different amount of precipitation than rainy days. Total precipitation, then, is just the sum of a few distributions – one for each type.

Interested in trying to fit my own mixture models, I thought modeling weight by gender might be a reasonable domain for application.

Looking around a bit for open datasets, I found CrossFit data, which needed only a bit of cleaning before I had a data frame composed of weights and genders for about 14000 CrossFit athletes.

Before fitting my model, I wanted to make sure the data were fairly bimodal, so I decided to create a density plot of weight:

plot of chunk unnamed-chunk-2

This turned out to be not nearly as bimodal as I'd hoped. Inspecting the gender distribution shows that men make up vast majority of CrossFit athletes:

## 
##    F    M 
## 0.27 0.73

Regardless of the comparatively smaller share of women, I was optimistic that a mixture model would still reasonably place gaussian distributions near the mean weights for each gender. Before I fit the model, I again plotted the weight densities, this time including gender information:

plot of chunk unnamed-chunk-4

This result looked especially good – finally, before running the model, I used the plyr package to find the mean and standard deviation for each gender:

##   gender mean.weight sd.weight
## 1      F       135.7     17.64
## 2      M       184.4     24.26

Next, I used the mixtools package to fit two gaussians to the data. Note that application of the EM algorithm will not yield the same estimates each time – the results reported come from an interative fitting process.. Here are a few draws from the two-component model:

## number of iterations= 210

## summary of normalmixEM object:
##            comp 1      comp 2
## lambda   0.983262   0.0167383
## mu     170.985903 197.1966562
## sigma   30.039475  68.0108365
## loglik at estimate:  -69972

## number of iterations= 256

## summary of normalmixEM object:
##             comp 1     comp 2
## lambda   0.0167324   0.983268
## mu     197.1994160 170.986019
## sigma   68.0174677  30.039619
## loglik at estimate:  -69972

## number of iterations= 109

## summary of normalmixEM object:
##            comp 1      comp 2
## lambda   0.913756   0.0862443
## mu     170.277126 183.5844313
## sigma   32.336239   8.7492152
## loglik at estimate:  -70004

Here, I was quite surprised. From the plot of the weight densities by gender, I was fairly sure the mixture model would work.. surprisingly, the first few times I ran the model, though, the EM algorithm resulted in a mixture of two gaussians, one around 170, and another around 195 (For a bit of background on this algorithm, and mixture models in general, I highly recommend Cosma's Notes).

When I instead moved to a model with three types, I started reliably picking up a gaussian that's fairly close to the mean female weight:

## number of iterations= 403

## summary of normalmixEM object:
##             comp 1     comp 2     comp 3
## lambda   0.0669021   0.144703   0.788395
## mu     191.9951136 128.521504 177.554028
## sigma   50.1322568  10.480415  24.355337
## loglik at estimate:  -69672

## number of iterations= 294

## summary of normalmixEM object:
##           comp 1     comp 2      comp 3
## lambda   0.14469   0.788445   0.0668648
## mu     128.52087 177.553775 191.9985466
## sigma   10.48005  24.356546  50.1385850
## loglik at estimate:  -69672

## number of iterations= 125

## summary of normalmixEM object:
##            comp 1      comp 2      comp 3
## lambda   0.911272   0.0719816   0.0167468
## mu     170.170622 181.9974482 194.2480171
## sigma   32.328408   7.6615752   8.4118644
## loglik at estimate:  -70003

It's a bit difficult to tell what to make of all this.

The most obvious question in my mind is the failure of the model to pick up the types (i.e. gender) that I expected.

Taking a look at my assumptions, though, I've got to admit that I know essentially nothing about CrossFit. From a bit of browsing around, I found a few blog posts that suggested certain body types would be better for different exercises.

Returning to the model fits, it might make sense to think of different subpopulations based on weight – burly athletes who have a competitive advantage for strength-related competition, thinner athletes who have an advantage with speed-related events.

I should also mention that I originally chose two components based on my expectations about weight distribution following from gender. Cosma has a nice section in the notes on using cross-validation to choose the number of components, and that might be a reasonable way to extend the analysis.

As a final note, I've got this voice ringing in my head that dissuades the reification of the result – the 'types' our model uncovers are only useful insofar as they relate to other variables of interest.. it'd be ambitious to conclude here that there actually are these two subpopulations.. instead, we might do better to use this information to think heuristically about CrossFit athletes.

Here's the code I used for this post:

rm(list = ls())
 
library(mixtools)
library(ggplot2)
 
# Note: data source: 
# http://dl.dropbox.com/u/23802677/xfit2011.csv
dat = read.csv("http://dl.dropbox.com/u/23802677/xfit2011.csv", 
               stringsAsFactors = FALSE, 
               header = FALSE)
 
# Creating gender variable:
dat$gender = substr(dat$V5,1,1)
correct.gender.codes = (dat$gender %in% c("F","M"))
dat$gender[which(correct.gender.codes == FALSE)] = NA
#table(dat$gender)
 
#Eyeballing the data, it looks like weights are reported in either kilograms or pounds.
 
kilog.vec = grep(dat$V7, pattern = "Kilograms")
Lb.vec = grep(dat$V7, pattern = "Pounds")
 
# Defining function to excise number from string:
excise.number = function(string){
  # Note: assumes number starts the string
  as.numeric(unlist(strsplit(string,split = " "))[[1]][1])
}
 
dat$weight = rep(NA,nrow(dat))
dat$weight[Lb.vec] = sapply(dat$V7[Lb.vec],excise.number)
dat$weight[kilog.vec] = sapply(dat$V7[kilog.vec],excise.number)
 #note: weight in kilograms still needs to be converted:
kilog.to.lb = function(x){round(x*2.20462,0)}
 
dat$weight[kilog.vec] = sapply(dat$weight[kilog.vec],kilog.to.lb)
 
# Final (Clean) data:
dat = na.omit(dat[,c("weight","gender")])
 
 
# Weight density:
p = ggplot(dat,aes(x = weight)) + geom_density()
p
 
# Gender Distribution:
round(prop.table(table(dat$gender)),2)
 
# Weight density by gender:
p = ggplot(dat,aes(x = weight, color = factor(gender))) + geom_density()
p
 
# mean and sd for weight by gender:
ddply(dat,.(gender),summarise, mean.weight = mean(weight), sd.weight = sd(weight))
 
 
# Mixture model with 2 components:
weight.k2 = normalmixEM(dat$weight,k=2,maxit = 1000, epsilon = 0.001)
summary(weight.k2)
 
 
# Mixture model with 3 components:
weight.k3 = normalmixEM(dat$weight,3=2,maxit = 1000, epsilon = 0.001)
summary(weight.k2)

Created by Pretty R at inside-R.org

Tuesday, March 19, 2013

Behavioral Economics and Beer... highly correlated

Short:
I plot the frequency of wikipedia searches of “Behavioral Economics”, and “Beer” – who knew the correlation would be 0.7!

Data reference:
Data on any wikipedia searches (back to 2007) are available at http://glimmer.rstudio.com/pssguy/wikiSearchRates/. The website allows you to download frequency hits per day as a csv, which is what I've done here.

# Behavioral Economics and Beer:

# Author: Mark T Patterson Date: March 18, 2013

# Clear Workbench:
rm(list = ls())

# libraries:
library(lubridate)
library(ggplot2)

## Find out what's changed in ggplot2 with
## news(Version == "0.9.1", package = "ggplot2")


# data:
curr.wd = getwd()
setwd("C:/Users/Mark/Desktop/Blog/Data")
ts = read.csv("BehavEconBeer.csv", header = TRUE)
setwd(curr.wd)

# cleaning the dataset: str(ts)
ts$date = as.character(ts$date)
ts$date = mdy(ts$date)

## Using date format %m/%d/%Y.

ts = ts[, -1]

Note: the mdy function is in the lubridate package, which cleanly handles time/date data. I've eliminated the first column of data, which just gives row names inherited from excel.

p = ggplot(ts, aes(x = date, y = count)) + geom_line(aes(color = factor(name)), 
    size = 2)
p

plot of chunk unnamed-chunk-2

It turns out the pattern we observe isn't at all unique – many variables follow (predictable) patterns of variation through the week. This doesn't necessarily mean, though, that the correlation between beer and behavioral economics is entirely spurious!

Wednesday, March 13, 2013

Using maps and ggplot2 to visualize college hockey championships

Short:
I plot the frequency of college hockey championships by state using the maps package, and ggplot2

Note: this example is based heavily on the example provided at
http://www.dataincolour.com/2011/07/maps-with-ggplot2/

data reference:
http://en.wikipedia.org/wiki/NCAA_Men%27s_Ice_Hockey_Championship

Question of interest
As a good Minnesotan, I've believed for quite some time that the colder, Northern states enjoy a competitive advantage when it comes to college hockey. Does this advantage exist? How strong is it?

I first downloaded data from wikipedia on past winners of hockey championships, and saved the short list in an excel csv file.

After saving the file, here's how the data look in R:

# Visualizing College Hockey Champions by State

# Author: Mark T Patterson Date: March 13, 2013


# Libraries:
library(ggplot2)
library(maps)

# Changing library:
rm(list = ls())  # Clearing the work bench
setwd("C:/Users/Mark/Desktop/Blog/Data")

# Loading Data:


# Loading state championships data:
dat.state = read.csv("HockeyChampsByState.csv", header = TRUE)
dat.state$state = tolower(dat.state$state)
head(dat.state)

##           state titles
## 1      michigan     19
## 2 massachusetts     11
## 3      colorado      9
## 4  north dakota      7
## 5     minnesota      6
## 6     wisconsin      6

Now that we've loaded the information about hockey championships by state, we just need to load the mapping data. map_data(state') is a dataframe in the maps package. Here, we'll use the region column, which lists state names, to match our state championship data.

# Creating mapping dataframe:
us.state = map_data("state")
head(us.state)

##     long   lat group order  region subregion
## 1 -87.46 30.39     1     1 alabama      <NA>
## 2 -87.48 30.37     1     2 alabama      <NA>
## 3 -87.53 30.37     1     3 alabama      <NA>
## 4 -87.53 30.33     1     4 alabama      <NA>
## 5 -87.57 30.33     1     5 alabama      <NA>
## 6 -87.59 30.33     1     6 alabama      <NA>


# Merging the two datasets:

dat.champs = merge(us.state, dat.state, by.x = "region", by.y = "state", 
    all = TRUE)

dat.champs <- dat.champs[order(dat.champs$order), ]
# mapping requires the same order of observations that appear in us.state

head(dat.champs)

##    region   long   lat group order subregion titles
## 1 alabama -87.46 30.39     1     1      <NA>     NA
## 2 alabama -87.48 30.37     1     2      <NA>     NA
## 3 alabama -87.53 30.37     1     3      <NA>     NA
## 4 alabama -87.53 30.33     1     4      <NA>     NA
## 5 alabama -87.57 30.33     1     5      <NA>     NA
## 6 alabama -87.59 30.33     1     6      <NA>     NA

With the dat.champs frame created, we're ready to plot

# Plotting

(qplot(long, lat, data = dat.champs, geom = "polygon", group = group, 
    fill = titles) + theme_bw() + labs(x = "", y = "", fill = "") + scale_fill_gradient(low = "#EEEEEE", 
    high = "darkgreen") + opts(title = "College Hockey Championships By State", 
    legend.position = "bottom", legend.direction = "horizontal"))

plot of chunk unnamed-chunk-3

Having plotted the data, it's easy to see the effect of the 'great lakes' region on hockey championships. With the exception of Colorado, only Northern, colder states have won titles.

Ways to improve this analysis
While we observe that college title champions are clustered in the Northern Midwest and Northern East, it's possible that several variables could explain the distribution. We might consider examining 1) state temperature (we might expect that colder temperatures lead to better performance, since teams in colder states get to practice more), 2) distance from great lakes (this might be a proxy for the availability of ice), 3) distance from Canadian hockey cities (it's possible that hockey culture follows from Canadian or other European immigration).

Beyond examining these possible factors, it'd be interesting to try color presentations – I've adopted the same color scheme presented at http://www.dataincolour.com/2011/07/maps-with-ggplot2/ , but it would be good to have some familiarity with other schemes.

Thursday, March 7, 2013

ddply in action

Top Batting Averages Over Time

reference:
http://www.baseball-databank.org/

Short
I'm going to use plyr and ggplot2 to look at how top batting averages have changed over time

First load the data:

options(width = 100)
library(ggplot2)

## Warning message: package 'ggplot2' was built under R version 2.14.2

library(plyr)

data(baseball)
head(baseball)

##            id year stint team lg  g  ab  r  h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp
## 4   ansonca01 1871     1  RC1    25 120 29 39  11   3  0  16  6  2  2  1  NA  NA NA NA   NA
## 44  forceda01 1871     1  WS3    32 162 45 45   9   4  0  29  8  0  4  0  NA  NA NA NA   NA
## 68  mathebo01 1871     1  FW1    19  89 15 24   3   1  0  10  2  1  2  0  NA  NA NA NA   NA
## 99  startjo01 1871     1  NY2    33 161 35 58   5   1  1  34  4  2  3  0  NA  NA NA NA   NA
## 102 suttoez01 1871     1  CL1    29 128 35 45   3   7  3  23  3  1  1  0  NA  NA NA NA   NA
## 106 whitede01 1871     1  CL1    29 146 40 47   6   5  1  21  2  2  4  1  NA  NA NA NA   NA

It looks like we've loaded the data successfully.

Next, We'll add something that is close to batting average: total hits divided by total at-bats:

baseball$ba = baseball$h/baseball$ab
head(baseball)

##            id year stint team lg  g  ab  r  h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp     ba
## 4   ansonca01 1871     1  RC1    25 120 29 39  11   3  0  16  6  2  2  1  NA  NA NA NA   NA 0.3250
## 44  forceda01 1871     1  WS3    32 162 45 45   9   4  0  29  8  0  4  0  NA  NA NA NA   NA 0.2778
## 68  mathebo01 1871     1  FW1    19  89 15 24   3   1  0  10  2  1  2  0  NA  NA NA NA   NA 0.2697
## 99  startjo01 1871     1  NY2    33 161 35 58   5   1  1  34  4  2  3  0  NA  NA NA NA   NA 0.3602
## 102 suttoez01 1871     1  CL1    29 128 35 45   3   7  3  23  3  1  1  0  NA  NA NA NA   NA 0.3516
## 106 whitede01 1871     1  CL1    29 146 40 47   6   5  1  21  2  2  4  1  NA  NA NA NA   NA 0.3219

Finally, we can use the plyr package to look at how batting averages have changed over time. We'll only consider players who have at least 100 at-bats in a season.

Note: ddply essentially splits the dataset into groups based on the year variable, and then performs the same function on each of the subsets (here, we're executing the topBA function). With the calculation performed on each of the subsets, ddply then collects all of the output into a new data frame.


BA.dat = ddply(baseball, .(year), summarise, topBA = max(ba[ab > 100], na.rm = TRUE))
head(BA.dat, 10)

##    year  topBA
## 1  1871 0.3602
## 2  1872 0.4147
## 3  1873 0.3976
## 4  1874 0.3359
## 5  1875 0.3666
## 6  1876 0.3560
## 7  1877 0.3872
## 8  1878 0.3580
## 9  1879 0.3570
## 10 1880 0.3602

Now, we're ready to use ggplot2 to visually examine the data:

p = ggplot(BA.dat, aes(x = year, y = topBA)) + geom_point()
p

plot of chunk unnamed-chunk-4

While it's only a heuristic judgment at this point, it's pretty clear that we have a downward trend over time.