Thursday, March 7, 2013

ddply in action

Top Batting Averages Over Time

Top Batting Averages Over Time

reference:
http://www.baseball-databank.org/

Short
I'm going to use plyr and ggplot2 to look at how top batting averages have changed over time

First load the data:

options(width = 100)
library(ggplot2)
## Warning message: package 'ggplot2' was built under R version 2.14.2
library(plyr)

data(baseball)
head(baseball)
##            id year stint team lg  g  ab  r  h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp
## 4   ansonca01 1871     1  RC1    25 120 29 39  11   3  0  16  6  2  2  1  NA  NA NA NA   NA
## 44  forceda01 1871     1  WS3    32 162 45 45   9   4  0  29  8  0  4  0  NA  NA NA NA   NA
## 68  mathebo01 1871     1  FW1    19  89 15 24   3   1  0  10  2  1  2  0  NA  NA NA NA   NA
## 99  startjo01 1871     1  NY2    33 161 35 58   5   1  1  34  4  2  3  0  NA  NA NA NA   NA
## 102 suttoez01 1871     1  CL1    29 128 35 45   3   7  3  23  3  1  1  0  NA  NA NA NA   NA
## 106 whitede01 1871     1  CL1    29 146 40 47   6   5  1  21  2  2  4  1  NA  NA NA NA   NA

It looks like we've loaded the data successfully.

Next, We'll add something that is close to batting average: total hits divided by total at-bats:

baseball$ba = baseball$h/baseball$ab
head(baseball)
##            id year stint team lg  g  ab  r  h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp     ba
## 4   ansonca01 1871     1  RC1    25 120 29 39  11   3  0  16  6  2  2  1  NA  NA NA NA   NA 0.3250
## 44  forceda01 1871     1  WS3    32 162 45 45   9   4  0  29  8  0  4  0  NA  NA NA NA   NA 0.2778
## 68  mathebo01 1871     1  FW1    19  89 15 24   3   1  0  10  2  1  2  0  NA  NA NA NA   NA 0.2697
## 99  startjo01 1871     1  NY2    33 161 35 58   5   1  1  34  4  2  3  0  NA  NA NA NA   NA 0.3602
## 102 suttoez01 1871     1  CL1    29 128 35 45   3   7  3  23  3  1  1  0  NA  NA NA NA   NA 0.3516
## 106 whitede01 1871     1  CL1    29 146 40 47   6   5  1  21  2  2  4  1  NA  NA NA NA   NA 0.3219

Finally, we can use the plyr package to look at how batting averages have changed over time. We'll only consider players who have at least 100 at-bats in a season.

Note: ddply essentially splits the dataset into groups based on the year variable, and then performs the same function on each of the subsets (here, we're executing the topBA function). With the calculation performed on each of the subsets, ddply then collects all of the output into a new data frame.


BA.dat = ddply(baseball, .(year), summarise, topBA = max(ba[ab > 100], na.rm = TRUE))
head(BA.dat, 10)
##    year  topBA
## 1  1871 0.3602
## 2  1872 0.4147
## 3  1873 0.3976
## 4  1874 0.3359
## 5  1875 0.3666
## 6  1876 0.3560
## 7  1877 0.3872
## 8  1878 0.3580
## 9  1879 0.3570
## 10 1880 0.3602

Now, we're ready to use ggplot2 to visually examine the data:

p = ggplot(BA.dat, aes(x = year, y = topBA)) + geom_point()
p

plot of chunk unnamed-chunk-4

While it's only a heuristic judgment at this point, it's pretty clear that we have a downward trend over time.

1 comment:

  1. If needed, you can easily add a regression line by adding +geom_smooth(method = "lm") to the end of your plotting line. It may not be the most awesome modeling but it's good for a quick eyeball.

    ReplyDelete