Sunday, October 26, 2014

Quarterback Completion Heatmap Using dplyr

Several months ago, I found Bryan Povlinkski's (really nicely cleaned) dataset with 2013 NFL play-by-play information, based on data released by Brian Burke at Advanced Football Analytics.

I decided to browse QB completion rates based on Pass Location (Left, Middle, Right), Pass Distance (Short or Deep), and Down. I ended up focusing on the 5 quarterbacks with the most passing attempts.

The plot above (based on code below) shows a heatmap based on completion rate. Darker colors correspond to a better completion percentage.

Because we've only got data from one year, even looking at the really high-volume passers means that the data are pretty sparse for some combinations of these variables. It's a little rough, but in these cases, I deced not to plot anything. This plot could definitely be improved by plotting gray areas instead of white.

There are a few patterns here – first, it's iteresting to look at each player's success with Short compared to Deep passes. Every player, as we would expect, has more success with Short rather than Deep passes, but this difference seems especially pronounced for Drew Brees (who seems to have more success with Short passes compared to the other players). Brees seems to have pretty uniform completion rates across the three pass locations at short distance too – most other players have slightly better completion rates to the outside, espeically at short distance.

As we would expect, we can also see a fairly pronounced difference in completion rates for deep throws on 3rd down vs. 1st and 2nd down. The sample size is small, so the estimates aren't very precise, this pattern is definitely there – probably best exemplified by Tom Brady and Peyton Manning's data.

As a next step, it would be interesting to make the same plot with pass attempts rather than completion rates.

# note: change path to the dataset
df = read.csv("C:/Users/Mark/Desktop/RInvest/nflpbp/2013 NFL Play-by-Play Data.csv",
              stringsAsFactors = F)
passers = df %>% filter(Play.Type == "Pass") %>% group_by(Passer) %>% summarize(n.obs = length(Play.Type)) %>% arrange(desc(n.obs))
top.passers = head(passers$Passer,5)
df %>% filter(Play.Type == "Pass",
              Passer %in% top.passers) %>% 
  mutate(Pass.Distance = factor(Pass.Distance, levels = c("Short","Deep"))) %>%
  group_by(Down,Passer,Pass.Location, Pass.Distance) %>%  summarize(share = (sum(Pass.Result == "Complete") / length(Pass.Result)),
                                                             n.obs = length(Pass.Result)) %>%
  filter(n.obs > 5) %>%
  ggplot(., aes(Pass.Location, Pass.Distance)) + geom_tile(aes(fill = share),
                                                           colour = "white") + 
  facet_wrap(Passer ~ Down, ncol = 3) +
  scale_fill_gradient(low = "white", high = "steelblue", limits = c(0,1)) + theme_bw() +
  ggtitle("NFL QB completion by Pass Distance, Location, and Down")

Created by Pretty R at

Wednesday, August 27, 2014

The first rule of brainstorming is..

About a year ago I was at a workshop on behavioral economics and public policy, and Mike Norton, who was leading the session, laid out a 'first rule' of brainstorming that has really resonated with me ever since. Mike's first rule was taken from the rules of improv comedy -- no matter how ridiculous / impractical / nonsensical the idea a person comes up with, you respond with "Yes, and..." There's nothing that kills a brainstorming session like someone pointing out that a particular candidate solution won't work, or doesn't quite solve the initial problem. As well, it's often the really off-the-wall ideas that actually lead to solutions.

Tuesday, August 26, 2014

Replication wiki for econ papers

I found this on Andrew Gelman's Blog -- a page for replications of experimental results in economics. This seems like a great idea! Here's the link

Prizes in public health -- the new kaggle?

This morning I saw a short piece by James Surowiecki on the absence of a vaccination for the Ebola virus. Surowiecki points out that the incentive structure for pharmaceutical companies rewards work on drugs that are likely to be taken by i) a large number of people, ii) Westerners, and iii) are likely to be taken over a long period of time. This means, Surowiecki argues, that under the present incentive scheme we are unlikely to develop cures for things like Ebola, which is essentially confined to the developing world, and up until recently did not affect many people.

As a possible solution, Surowiecki offers 'prizes' -- sponsored by governments -- which compensate firms in exchange for the right to manufacture the resulting pharmaceuticals. Put differently, the government can intervene on the market value of development in these areas by paying 'more than the going rate.'

I think Surowiecki is surely right about the potential benefits of this sort of approach -- paying companies more will definitely give them an incentive to change their research priorities. Thinking about all of this, though, reminded me of the kaggle-style data competitions that are growing increasingly popular. Here, a problem is posted (sometimes with a large prize for the best solution), and data scientists of all stripe work on solutions. The competition that really vaulted these into the mainstream was the Netflix Prize -- offering $1 Million for the best improvement in predictive film ratings. I remember going to a talk a few years ago by the 'BellKor' team that ended up winning the competition, where the winners remarked that while the prize seemed big, the tremendous time the team put into the challenge meant that they were actually working at a really low wage-rate.. and they were the one's who actually won!

I bring all this up because it seems that, at least in the case of data competitions, it's not really the money that's driving entry (or work) on these problems. The competition, collaboration, and social value of doing well seem like much more important causes. Now, data science is quite different from pharmaceutical research -- the start up costs are MUCH lower, and it can be a much more individual activity -- but after reading Surowiecki's article, I'm especially curious to see whether some of the non-monitary incentives that we're seeing at work in the data science world might emerge if public health adopts a similar incentive structure around sponsored prizes.

Sunday, May 18, 2014

T-Shirts ... designed with R!

On Friday, I saw David Smith's post on a competition to design this year's useR! conference t-shirt. The goal is to create a design generated using an R script, which will be featured on the back of the shirt.

Having a bit of time this weekend, I decided to try plotting the R logo, using base graphics, represented by a scatter of points – one for each package published on CRAN. Having recently posted on a very similar idea for visualizing twitter followers, I realized I could take advantage of my past code. With a bit of tweaking, I came up with the image above.

The code I used is below – you should feel free to tweak / improve / experiment with it! Before running, you'll need to install the EBImage package available on bioconductor. Roughly, the script works by first downloading a copy of the R logo (this step might make the entry illegal for the purposes of the contest..), as well as the current number of R packages. Next, there are a few functions to simplify the colors presented in the image – this part probably isn't necessary, but I think it makes the final result look a bit better. Finally, the image is actually generated by sampling pixels from the modified image, and replotting.

If you're interested in trying out your own ideas (which you definitely should!) you can submit entires to the contest as pull requests on Github.

# Script by Mark T Patterson
# May 17, 2014
# twitter: @M_T_Patterson
# General Notes:
# This script creates an image of the R logo 
# represented by n points, 
# where n is the current number of packages on CRAN
# note: this script requries the EBImage package
# available from bioconductor:
# approximate run time: 2 mins
#### initialize ####
# clear workspace
rm(list = ls())
# load libraries
# coordinate the version of the program:
#### gather web data: reference image and CRAN package count ####
# load the R logo, save the rgb values:
img = readImage("")
img.2 = img[,,1:3] = ""
lns = readLines(
ref.line = grep(lns, pattern = "CRAN package repository features")
package.count = as.numeric(strsplit(lns[ref.line],split = "\\s")[[1]][7])
#### helper functions ####
# functions for color simplification: = function(x1){
  ref.dat = data.frame(num = 10:15, let = LETTERS[1:6])
  out = as.character(x1)
  if(x1 %in% 10:15){out = as.character(ref.dat$let[which(ref.dat$num == x1)])}
rgb.func = function(vec){
  #note: vec is a triple of color intensities
  r1 = floor(255*vec[1])
  g1 = floor(255*vec[2])
  b1 = floor(255*vec[3])
  x1 = r1 %/% 16
  x2 = r1 %% 16
  x3 = g1 %/% 16
  x4 = g1 %% 16
  x5 = b1 %/% 16
  x6 = b1 %% 16
  x1 =
  x2 =
  x3 =
  x4 =
  x5 =
  x6 =
  out = paste("#",x1,x2,x3,x4,x5,x6, sep = "")
im.func.1 = function(image, k.cols = 5, samp.val = 3000){
  # creating a dataframe:
  test.mat = matrix(image,ncol = 3)
  df = data.frame(test.mat)
  colnames(df) = c("r","g","b")
  df$y = rep(1:dim(image)[1],dim(image)[2])
  df$x = rep(1:dim(image)[2], each = dim(image)[1])
  samp.indx = sample(1:nrow(df),samp.val)
  work.sub = df[samp.indx,]
  # extracting colors:
  k2 = kmeans(work.sub[,1:3],k.cols)
  # adding centers back:
  fit.test = fitted(k2)
  work.sub$r.pred = fit.test[,1]
  work.sub$g.pred = fit.test[,2]
  work.sub$b.pred = fit.test[,3]
add.cols = function(dat){
# general plotting function
plot.func = function(dat){
  # assumes dat has colums x, ym cols
  plot(dat$y,max(dat$x) - dat$x, col = dat$cols,
       main = "A point for each CRAN package",
       xlab = "useR!",
       ylab = "2014",
#### simplify colors; sample n points ###
temp = im.func.1(img.2, samp.val = 25000, k = 12)
temp$cols = add.cols(temp[,6:8])
final = temp[sample(1:nrow(temp), package.count),]
#### generate plot ####

Created by Pretty R at

Friday, May 2, 2014

Function for rounding a group of numbers

Sometimes when I'm creating summary statistics for factor variables (usually demographics), I find I need to round percentages a bit. If I round each number individually, I occasionally (and frustratingly) change the total sum. For example, suppose I've got information on how many individuals are in each of four groups:

group.totals = c(13, 39, 16, 11)

and I'd like to report the distribution as a share of the total number of individuals:

(tab = prop.table(group.totals))
## [1] 0.1646 0.4937 0.2025 0.1392

however, I only want to report 2 significant digits after the decimal:

( = round(tab, 2))
## [1] 0.16 0.49 0.20 0.14

Here, the rounding process (annoyingly) changes the sum:

## [1] 1
## [1] 0.99

To fix this (a bit), here's a quick function which rounds a group of numbers together: = function(vec, digits) {
    r.vec = round(vec, digits)
    total.resid = sum(vec) - sum(r.vec)
    sq.diffs = ((r.vec + total.resid) - vec)^2
    indx = which.min(sq.diffs)
    r.vec.copy = r.vec
    r.vec.copy[indx] = r.vec.copy[indx] + total.resid
    out = r.vec.copy

This solves some of the problems:

( =, 2))
## [1] 0.17 0.49 0.20 0.14
## [1] 1

But has sort of unusual behavior for some inputs:

bug.vec = c(0.4, 0.4, 0.4, 0.4, 9.2, 9.2), 0)
## [1] 2 0 0 0 9 9

Despite being a bit buggy, this function does well enough for my purposes.. if you'd like to find a better version, or are generally interested, here's a link to a nice discussion on group rounding at stackoverflow.

Monday, April 21, 2014

Organizing data-cleaning scripts

About two years ago it finally dawned on me that having a single gigantic R file for a project wasn't all that practical.  Since then, I've been trying out a few systems for breaking the larger project into smaller scripts.  Today, I came across this introduction to data cleaning in R, which nicely divides the project into several steps (in the figure above).  The authors suggest (at a minimum) saving the data at each of these stages, which seems totally reasonable.

Roughly, the stages of the data cleaning process can be broken down into: 1) Raw data: this is the format of the original data source -- it's possible that some sort of conversion is necessary before the data can even be read into R, or that once the data are are loaded, the variable types or column names have problems.  2) Technically correct data is the result of the most basic cleaning process -- at the very least, your data should be the "shape" you expect (the right number of rows and columns if you're expecting rectangular data), numbers should look like numbers rather than strings, etc.  Technically correct data, despite the proper formatting, may have erroneous values -- these may range from 'outlawed' values (like negative durations), to suspicious values (e.g. an individual's height entered as 9').  3) Consistent data is ready for analysis.

Clearly this division won't be sufficient for all file-organization needs, but it seems like a nice thing to keep in the back of the mind..

Saturday, April 19, 2014

Play 2048... using R!

I've lost about 100 hours over the past week to the black hole of 2048. In an attempt to extricate myself, I thought I'd try writing an R script to play for me. While there are already a ton of great algorithms for the game, I haven't seen any implemented in R.

There's a recent package, RSelenium that allows you to drive your browser through R, so we can jump right into playing the game. As an aside, it's definitely worth browsing through the really nice vignette to get a sense of just how cool this package is.

The code below is divided into a few sections. Section I loads RSelenium, and navigates to the 2048 site. Make sure that in the remDr command, you specify the right sort of browser – I'm using firefox, but it's pretty easy to adjust this (see the vignette).

Sections II - VI break down different steps in the development of an algorithm to play the game. The rough steps are (SII) writing a function to predict the next board states depending on which move is selected, (SIII) writing a few functions to report on features of these boards (for example, the sum of the scores on the tiles in a particular column), (SIV) writing functions to score the various potential future positions based on the features of the boards, and (SV, SVI) putting these together to actually play.

The algorithm I've written is mediocre at best – its mean score is around 7000, but it's won once or twice. It's a pretty thrown-together attempt, but I think it's fun to watch. Hopefully this'll be the antidote I've needed… code is below.

#### SECTION I: fire up the game ####
remDr <- remoteDriver(remoteServerAddr = "localhost" 
                      , port = 4444
                      , browserName = "firefox"
# navigate to page
#### SECTION II: functions for predicting board states ####
# functions to determine current board state:
pos.strip = function(string){
  first.cut = strsplit(string,split = " tile-position-")[[1]]
  val.sub = as.numeric(strsplit(first.cut[1],split = "-")[[1]][2])
  pos.sub = first.cut[2]
  second.cut = strsplit(pos.sub,split = " ")[[1]][1]
  third.cut = strsplit(second.cut, split = "-")[[1]] = as.numeric(third.cut)
  rev.order = rev(
  out = c(rev.order,val.sub)
} = function(htmlParsedPage){
n1 = xpathSApply(htmlParsedPage,"//div[@class='tile-container']",xmlValue)
n2 = xpathSApply(htmlParsedPage,"//div[@class='tile-container']//@class")
n2 = n2[-1]
curr.len = length(n2)
n2 = n2[which((1:curr.len %% 2) == 1)]
mat = t(sapply(n2,pos.strip))
rownames(mat) = 1:nrow(mat)
colnames(mat) = c("x","y","val")
mat = data.frame(mat)
empty.frame = matrix(rep(NA,16),nrow = 4)
for(i in 1:nrow(mat)){
  empty.frame[mat$x[i],mat$y[i]] = mat$val[i]
## predicting next board state:
comb.func = function(vec){
  empty.vec = rep(NA,4)
  four.three = as.numeric(sum(vec[4] == vec[3],na.rm = TRUE))
  three.two = as.numeric(sum(vec[3] == vec[2],na.rm = TRUE)) = as.numeric(sum(vec[2] == vec[1],na.rm = TRUE))
  layout.vec = c(,three.two,four.three)
  if(all(layout.vec == c(1,1,1))){
    empty.vec[3] = 2*vec[2]
    empty.vec[4] = 2*vec[4]
  if(all(layout.vec == c(0,0,1))){
    empty.vec[4] = 2*vec[4]
    empty.vec[1:3] = c(NA,vec[1:2])
  if(all(layout.vec == c(0,1,0))){
    empty.vec[3] = 2*vec[3]
    empty.vec[2] = vec[1]
    empty.vec[4] = vec[4]
  if(all(layout.vec == c(1,0,0))){
    empty.vec[2] = 2*vec[2]
    empty.vec[3:4] = vec[3:4]
  if(all(layout.vec == c(0,1,1))){
    empty.vec[4] = 2*vec[4]
    empty.vec[1:3] = c(NA,vec[1:2])
  if(all(layout.vec == c(1,0,1))){
    empty.vec[3] = 2*vec[2]
    empty.vec[4] = 2*vec[4]
  if(all(layout.vec == c(1,1,0))){
    empty.vec[3] = 2*vec[3]
    empty.vec[2] = vec[1]
    empty.vec[4] = vec[4]
  if(all(layout.vec == c(0,0,0))){
    empty.vec = vec
collect.right = function(board){
  first.move = t(apply(board,1,function(x){ = sum(
  stripped = x[!]
  comb = c(rep(NA,,stripped)
  second.move = t(apply(first.move,1,comb.func))
ninety.rot = function(mat){
  empty = matrix(rep(NA,16),nrow = 4)
  empty[1,] = mat[,4]
  empty[2,] = mat[,3]
  empty[3,] = mat[,2]
  empty[4,] = mat[,1]
collect.down = function(board){
  temp.turn = ninety.rot(board)
  collapse = collect.right(temp.turn)
  turn.back = ninety.rot(ninety.rot(ninety.rot(collapse)))
collect.up = function(board){
  temp.turn = ninety.rot(ninety.rot(ninety.rot(board)))
  collapse = collect.right(temp.turn)
  turn.back = ninety.rot(collapse)
collect.left = function(board){
  temp.turn = ninety.rot(ninety.rot(board))
  collapse = collect.right(temp.turn)
  turn.back = ninety.rot(ninety.rot(collapse))
count.tiles = function(board){
preds.lst = function(Parsed){ 
  board.temp =
  preds = list(orig = board.temp,
               left = collect.left(board.temp),
               right = collect.right(board.temp),
               up = collect.up(board.temp),
               down = collect.down(board.temp))
allowed.func = function(lst){
  # note: this is a function of the output from preds.lst
  # returns the directions that are currently allowed.
  vals = unlist(lapply(lst[2:5],function(x){identical(x,lst[[1]])}))
  sub = names(vals)[which(vals == F)]
legal.sub = function(Parsed){
  preds = preds.lst(Parsed)
  moves = allowed.func(preds)
  out = preds[names(preds) %in% moves]
} = function(choice.arrow){
  return(paste(choice.arrow,"_arrow",sep = ""))
send.func = function(prepped.choice){
  remDr$sendKeysToActiveElement(list(key = prepped.choice))
comb.move = function(arrow){
#### Section III: functions to determine properties of boards #### = function(board){
} = function(board){
  sum(board[,4],na.rm = TRUE)
bottom.right.val = function(board){
  return(sum(board[4,4],na.rm = TRUE))
bottom.right.third.val = function(board){
  return(sum(board[3,4],na.rm = TRUE))
bottom.right.sec.val = function(board){
  return(sum(board[2,4],na.rm = TRUE))
bottom.right.first.val = function(board){
  return(sum(board[1,4],na.rm = TRUE))
} = function(board){
  sum(board[,3] == board[,4],na.rm = TRUE)
} = function(board){
  sum(board[,2] == board[,3],na.rm = TRUE)
} = function(board){
  sum(board[,1] == board[,2],na.rm = TRUE)
#### SECTION IV: scoring boards ####
top.val.moves = function(score.vec){
  raw.scores = score.vec
  temp.max = max(raw.scores)
  indx = which(raw.scores == temp.max)
  maxima = names(raw.scores)[indx]
score.em = function(legal.board, FUN){
#### SECTION V: algorithm for a single play ####
play.func = function(parsed){
  legal.boards = legal.sub(parsed)
  bottom.right = score.em(legal.boards, bottom.right.val)
  leftover.moves = top.val.moves(bottom.right)
  leftover.boards = legal.boards[leftover.moves]
  if(length(leftover.boards) == 1){
  bottom.right.third = score.em(leftover.boards, bottom.right.third.val)
  leftover.moves = top.val.moves(bottom.right.third)
  leftover.boards = legal.boards[leftover.moves]
  if(length(leftover.boards) == 1){
  bottom.right.sec = score.em(leftover.boards, bottom.right.sec.val)
  leftover.moves = top.val.moves(bottom.right.sec)
  leftover.boards = legal.boards[leftover.moves]
  if(length(leftover.boards) == 1){
  tot.fourth.scores = score.em(leftover.boards,
  leftover.moves = top.val.moves(tot.fourth.scores)
  leftover.boards = legal.boards[leftover.moves]
  if(length(leftover.boards) == 1){
  prep.scores = score.em(leftover.boards,
  leftover.moves = top.val.moves(prep.scores)
  leftover.boards = legal.boards[leftover.moves]
  if(length(leftover.boards) == 1){
  tile.tots = score.em(leftover.boards, function(x){20 - count.tiles(x)})
  leftover.moves = top.val.moves(tile.tots)
  leftover.boards = legal.boards[leftover.moves]
  if(length(leftover.boards) == 1){
  prep.scores.third = score.em(leftover.boards,
  leftover.moves = top.val.moves(prep.scores.third)
  leftover.boards = legal.boards[leftover.moves]
  if(length(leftover.boards) == 1){
  prep.scores.second = score.em(leftover.boards,
  leftover.moves = top.val.moves(prep.scores.second)
  leftover.boards = legal.boards[leftover.moves]
  if(length(leftover.boards) == 1){
  rand.choice = leftover.moves[sample(1:length(leftover.boards),1)]
execute = function(){
  temp = htmlParse(remDr$getPageSource()[[1]])
#### SECTION VI Playing the game #### = function(){
temp2 = rep("Continue",2)
while(temp2[2] != "Game over!"){
  temp = htmlParse(remDr$getPageSource()[[1]])
  temp2 = xpathSApply(temp,"//p",xmlValue)
  curr.score = as.numeric(strsplit(xpathSApply(temp,"//div[@class='score-container']",xmlValue),split = "\\+")[[1]][1])
# example:

Created by Pretty R at