Top Batting Averages Over Time
reference:
http://www.baseball-databank.org/
Short
I'm going to use plyr and ggplot2 to look at how top batting averages have changed over time
First load the data:
options(width = 100)
library(ggplot2)
## Warning message: package 'ggplot2' was built under R version 2.14.2
library(plyr)
data(baseball)
head(baseball)
## id year stint team lg g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp
## 4 ansonca01 1871 1 RC1 25 120 29 39 11 3 0 16 6 2 2 1 NA NA NA NA NA
## 44 forceda01 1871 1 WS3 32 162 45 45 9 4 0 29 8 0 4 0 NA NA NA NA NA
## 68 mathebo01 1871 1 FW1 19 89 15 24 3 1 0 10 2 1 2 0 NA NA NA NA NA
## 99 startjo01 1871 1 NY2 33 161 35 58 5 1 1 34 4 2 3 0 NA NA NA NA NA
## 102 suttoez01 1871 1 CL1 29 128 35 45 3 7 3 23 3 1 1 0 NA NA NA NA NA
## 106 whitede01 1871 1 CL1 29 146 40 47 6 5 1 21 2 2 4 1 NA NA NA NA NA
It looks like we've loaded the data successfully.
Next, We'll add something that is close to batting average: total hits divided by total at-bats:
baseball$ba = baseball$h/baseball$ab
head(baseball)
## id year stint team lg g ab r h X2b X3b hr rbi sb cs bb so ibb hbp sh sf gidp ba
## 4 ansonca01 1871 1 RC1 25 120 29 39 11 3 0 16 6 2 2 1 NA NA NA NA NA 0.3250
## 44 forceda01 1871 1 WS3 32 162 45 45 9 4 0 29 8 0 4 0 NA NA NA NA NA 0.2778
## 68 mathebo01 1871 1 FW1 19 89 15 24 3 1 0 10 2 1 2 0 NA NA NA NA NA 0.2697
## 99 startjo01 1871 1 NY2 33 161 35 58 5 1 1 34 4 2 3 0 NA NA NA NA NA 0.3602
## 102 suttoez01 1871 1 CL1 29 128 35 45 3 7 3 23 3 1 1 0 NA NA NA NA NA 0.3516
## 106 whitede01 1871 1 CL1 29 146 40 47 6 5 1 21 2 2 4 1 NA NA NA NA NA 0.3219
Finally, we can use the plyr package to look at how batting averages have changed over time. We'll only consider players who have at least 100 at-bats in a season.
Note: ddply essentially splits the dataset into groups based on the year variable, and then performs the same function on each of the subsets (here, we're executing the topBA function). With the calculation performed on each of the subsets, ddply then collects all of the output into a new data frame.
BA.dat = ddply(baseball, .(year), summarise, topBA = max(ba[ab > 100], na.rm = TRUE))
head(BA.dat, 10)
## year topBA
## 1 1871 0.3602
## 2 1872 0.4147
## 3 1873 0.3976
## 4 1874 0.3359
## 5 1875 0.3666
## 6 1876 0.3560
## 7 1877 0.3872
## 8 1878 0.3580
## 9 1879 0.3570
## 10 1880 0.3602
Now, we're ready to use ggplot2 to visually examine the data:
p = ggplot(BA.dat, aes(x = year, y = topBA)) + geom_point()
p
While it's only a heuristic judgment at this point, it's pretty clear that we have a downward trend over time.
If needed, you can easily add a regression line by adding +geom_smooth(method = "lm") to the end of your plotting line. It may not be the most awesome modeling but it's good for a quick eyeball.
ReplyDelete