Monday, April 21, 2014

Organizing data-cleaning scripts


About two years ago it finally dawned on me that having a single gigantic R file for a project wasn't all that practical.  Since then, I've been trying out a few systems for breaking the larger project into smaller scripts.  Today, I came across this introduction to data cleaning in R, which nicely divides the project into several steps (in the figure above).  The authors suggest (at a minimum) saving the data at each of these stages, which seems totally reasonable.

Roughly, the stages of the data cleaning process can be broken down into: 1) Raw data: this is the format of the original data source -- it's possible that some sort of conversion is necessary before the data can even be read into R, or that once the data are are loaded, the variable types or column names have problems.  2) Technically correct data is the result of the most basic cleaning process -- at the very least, your data should be the "shape" you expect (the right number of rows and columns if you're expecting rectangular data), numbers should look like numbers rather than strings, etc.  Technically correct data, despite the proper formatting, may have erroneous values -- these may range from 'outlawed' values (like negative durations), to suspicious values (e.g. an individual's height entered as 9').  3) Consistent data is ready for analysis.

Clearly this division won't be sufficient for all file-organization needs, but it seems like a nice thing to keep in the back of the mind..


Saturday, April 19, 2014

Play 2048... using R!

I've lost about 100 hours over the past week to the black hole of 2048. In an attempt to extricate myself, I thought I'd try writing an R script to play for me. While there are already a ton of great algorithms for the game, I haven't seen any implemented in R.

There's a recent package, RSelenium that allows you to drive your browser through R, so we can jump right into playing the game. As an aside, it's definitely worth browsing through the really nice vignette to get a sense of just how cool this package is.

The code below is divided into a few sections. Section I loads RSelenium, and navigates to the 2048 site. Make sure that in the remDr command, you specify the right sort of browser – I'm using firefox, but it's pretty easy to adjust this (see the vignette).

Sections II - VI break down different steps in the development of an algorithm to play the game. The rough steps are (SII) writing a function to predict the next board states depending on which move is selected, (SIII) writing a few functions to report on features of these boards (for example, the sum of the scores on the tiles in a particular column), (SIV) writing functions to score the various potential future positions based on the features of the boards, and (SV, SVI) putting these together to actually play.

The algorithm I've written is mediocre at best – its mean score is around 7000, but it's won once or twice. It's a pretty thrown-together attempt, but I think it's fun to watch. Hopefully this'll be the antidote I've needed… code is below.

#### SECTION I: fire up the game ####
 
require(RSelenium)
 
checkForServer()
startServer()
 
remDr <- remoteDriver(remoteServerAddr = "localhost" 
                      , port = 4444
                      , browserName = "firefox"
)
Sys.sleep(10)
 
remDr$open()
 
# navigate to page
remDr$navigate("http://gabrielecirulli.github.io/2048/")
 
 
#### SECTION II: functions for predicting board states ####
# functions to determine current board state:
pos.strip = function(string){
 
  first.cut = strsplit(string,split = " tile-position-")[[1]]
  val.sub = as.numeric(strsplit(first.cut[1],split = "-")[[1]][2])
  pos.sub = first.cut[2]
  second.cut = strsplit(pos.sub,split = " ")[[1]][1]
  third.cut = strsplit(second.cut, split = "-")[[1]]
  conv.to.num = as.numeric(third.cut)
  rev.order = rev(conv.to.num)
  out = c(rev.order,val.sub)
  return(out)
}
 
conv.to.frame = function(htmlParsedPage){
 
n1 = xpathSApply(htmlParsedPage,"//div[@class='tile-container']",xmlValue)
n2 = xpathSApply(htmlParsedPage,"//div[@class='tile-container']//@class")
n2 = n2[-1]
curr.len = length(n2)
n2 = n2[which((1:curr.len %% 2) == 1)]
 
mat = t(sapply(n2,pos.strip))
 
rownames(mat) = 1:nrow(mat)
colnames(mat) = c("x","y","val")
mat = data.frame(mat)
 
empty.frame = matrix(rep(NA,16),nrow = 4)
 
for(i in 1:nrow(mat)){
  empty.frame[mat$x[i],mat$y[i]] = mat$val[i]
}
 
return(empty.frame)
}
 
## predicting next board state:
comb.func = function(vec){
  empty.vec = rep(NA,4)
  four.three = as.numeric(sum(vec[4] == vec[3],na.rm = TRUE))
  three.two = as.numeric(sum(vec[3] == vec[2],na.rm = TRUE))
  two.one = as.numeric(sum(vec[2] == vec[1],na.rm = TRUE))
  layout.vec = c(two.one,three.two,four.three)
 
  if(all(layout.vec == c(1,1,1))){
    empty.vec[3] = 2*vec[2]
    empty.vec[4] = 2*vec[4]
  }
 
  if(all(layout.vec == c(0,0,1))){
    empty.vec[4] = 2*vec[4]
    empty.vec[1:3] = c(NA,vec[1:2])
  }
 
 
  if(all(layout.vec == c(0,1,0))){
    empty.vec[3] = 2*vec[3]
    empty.vec[2] = vec[1]
    empty.vec[4] = vec[4]
  }
  if(all(layout.vec == c(1,0,0))){
    empty.vec[2] = 2*vec[2]
    empty.vec[3:4] = vec[3:4]
  }
 
  if(all(layout.vec == c(0,1,1))){
    empty.vec[4] = 2*vec[4]
    empty.vec[1:3] = c(NA,vec[1:2])
  }
 
  if(all(layout.vec == c(1,0,1))){
    empty.vec[3] = 2*vec[2]
    empty.vec[4] = 2*vec[4]
  }
  if(all(layout.vec == c(1,1,0))){
    empty.vec[3] = 2*vec[3]
    empty.vec[2] = vec[1]
    empty.vec[4] = vec[4]
  }
  if(all(layout.vec == c(0,0,0))){
    empty.vec = vec
  }
  return(empty.vec)
}
 
collect.right = function(board){
  first.move = t(apply(board,1,function(x){
  n.na = sum(is.na(x))
  stripped = x[!is.na(x)]
  comb = c(rep(NA,n.na),stripped)
  return(comb)
  }))
  second.move = t(apply(first.move,1,comb.func))
  return(second.move)
}
 
ninety.rot = function(mat){
  empty = matrix(rep(NA,16),nrow = 4)
  empty[1,] = mat[,4]
  empty[2,] = mat[,3]
  empty[3,] = mat[,2]
  empty[4,] = mat[,1]
  return(empty)
}
 
collect.down = function(board){
  temp.turn = ninety.rot(board)
  collapse = collect.right(temp.turn)
  turn.back = ninety.rot(ninety.rot(ninety.rot(collapse)))
  return(turn.back)
}
 
collect.up = function(board){
  temp.turn = ninety.rot(ninety.rot(ninety.rot(board)))
  collapse = collect.right(temp.turn)
  turn.back = ninety.rot(collapse)
  return(turn.back)
}
 
collect.left = function(board){
  temp.turn = ninety.rot(ninety.rot(board))
  collapse = collect.right(temp.turn)
  turn.back = ninety.rot(ninety.rot(collapse))
  return(turn.back)
}
 
count.tiles = function(board){
  sum(!is.na(board))
}
 
preds.lst = function(Parsed){ 
  board.temp = conv.to.frame(Parsed)
  preds = list(orig = board.temp,
               left = collect.left(board.temp),
               right = collect.right(board.temp),
               up = collect.up(board.temp),
               down = collect.down(board.temp))
  return(preds)
}
 
allowed.func = function(lst){
  # note: this is a function of the output from preds.lst
  # returns the directions that are currently allowed.
  vals = unlist(lapply(lst[2:5],function(x){identical(x,lst[[1]])}))
  sub = names(vals)[which(vals == F)]
  return(sub)
}
 
legal.sub = function(Parsed){
  preds = preds.lst(Parsed)
  moves = allowed.func(preds)
  out = preds[names(preds) %in% moves]
  return(out)
}
 
prep.to.send = function(choice.arrow){
  return(paste(choice.arrow,"_arrow",sep = ""))
}
 
send.func = function(prepped.choice){
  remDr$sendKeysToActiveElement(list(key = prepped.choice))
}
 
comb.move = function(arrow){
  return(send.func(prep.to.send(arrow)))
}
 
 
#### Section III: functions to determine properties of boards ####
 
tiles.in.fourth = function(board){
  sum(!is.na(board[,4]))
}
 
tot.sum.in.fourth = function(board){
  sum(board[,4],na.rm = TRUE)
}
 
 
bottom.right.val = function(board){
  return(sum(board[4,4],na.rm = TRUE))
}
 
bottom.right.third.val = function(board){
  return(sum(board[3,4],na.rm = TRUE))
}
 
bottom.right.sec.val = function(board){
  return(sum(board[2,4],na.rm = TRUE))
}
 
bottom.right.first.val = function(board){
  return(sum(board[1,4],na.rm = TRUE))
}
 
prep.for.next = function(board){
  sum(board[,3] == board[,4],na.rm = TRUE)
}
 
prep.for.next.third = function(board){
  sum(board[,2] == board[,3],na.rm = TRUE)
}
 
prep.for.next.second = function(board){
  sum(board[,1] == board[,2],na.rm = TRUE)
}
 
 
#### SECTION IV: scoring boards ####
top.val.moves = function(score.vec){
  raw.scores = score.vec
  temp.max = max(raw.scores)
  indx = which(raw.scores == temp.max)
  maxima = names(raw.scores)[indx]
  return(maxima)  
}
 
score.em = function(legal.board, FUN){
  return(unlist(lapply(legal.board,FUN)))  
}
 
 
 
#### SECTION V: algorithm for a single play ####
 
play.func = function(parsed){
  legal.boards = legal.sub(parsed)
 
  bottom.right = score.em(legal.boards, bottom.right.val)
  leftover.moves = top.val.moves(bottom.right)
  leftover.boards = legal.boards[leftover.moves]
  if(length(leftover.boards) == 1){
    return(comb.move(names(leftover.boards)))}
 
  bottom.right.third = score.em(leftover.boards, bottom.right.third.val)
  leftover.moves = top.val.moves(bottom.right.third)
  leftover.boards = legal.boards[leftover.moves]
  if(length(leftover.boards) == 1){
    return(comb.move(names(leftover.boards)))}
 
  bottom.right.sec = score.em(leftover.boards, bottom.right.sec.val)
  leftover.moves = top.val.moves(bottom.right.sec)
  leftover.boards = legal.boards[leftover.moves]
  if(length(leftover.boards) == 1){
    return(comb.move(names(leftover.boards)))}  
 
  tot.fourth.scores = score.em(leftover.boards, tot.sum.in.fourth)
  leftover.moves = top.val.moves(tot.fourth.scores)
  leftover.boards = legal.boards[leftover.moves]
  if(length(leftover.boards) == 1){
    return(comb.move(names(leftover.boards)))}
 
  prep.scores = score.em(leftover.boards, prep.for.next)
  leftover.moves = top.val.moves(prep.scores)
  leftover.boards = legal.boards[leftover.moves]
  if(length(leftover.boards) == 1){
    return(comb.move(names(leftover.boards)))}
 
  tile.tots = score.em(leftover.boards, function(x){20 - count.tiles(x)})
  leftover.moves = top.val.moves(tile.tots)
  leftover.boards = legal.boards[leftover.moves]
  if(length(leftover.boards) == 1){
    return(comb.move(names(leftover.boards)))}
 
  prep.scores.third = score.em(leftover.boards, prep.for.next.third)
  leftover.moves = top.val.moves(prep.scores.third)
  leftover.boards = legal.boards[leftover.moves]
  if(length(leftover.boards) == 1){
    return(comb.move(names(leftover.boards)))}
 
  prep.scores.second = score.em(leftover.boards, prep.for.next.second)
  leftover.moves = top.val.moves(prep.scores.second)
  leftover.boards = legal.boards[leftover.moves]
  if(length(leftover.boards) == 1){
    return(comb.move(names(leftover.boards)))}
 
  rand.choice = leftover.moves[sample(1:length(leftover.boards),1)]
  return(comb.move(rand.choice))
}
 
 
execute = function(){
  temp = htmlParse(remDr$getPageSource()[[1]])
  play.func(temp)
}
 
 
 
 
#### SECTION VI Playing the game ####
 
grand.play = function(){
remDr$navigate("http://gabrielecirulli.github.io/2048/")
temp2 = rep("Continue",2)
while(temp2[2] != "Game over!"){
  temp = htmlParse(remDr$getPageSource()[[1]])
  execute()
  temp2 = xpathSApply(temp,"//p",xmlValue)
  curr.score = as.numeric(strsplit(xpathSApply(temp,"//div[@class='score-container']",xmlValue),split = "\\+")[[1]][1])
}
return(curr.score)
}
 
# example:
grand.play()

Created by Pretty R at inside-R.org

Friday, April 11, 2014

Rblogger Posting Patterns Analyzed with R

I've been a big fan of rbloggers for quite some time, but have only recently started contributing myself. After my first post yesterday, I immidiately started wondering how long most other bloggers go between posts.

I decided to gather the list of past posts to rbloggers to investigate a bit. I've posted the data (as of yesterday evening) here – I'm a bit new to github, but the file (RBloggersData.csv) should be there.

I started by using plyr to calculate the average delay between each author's posts. It turns out that this distribution has a ton of right-skew, and looks fairly normal (or at least mound-shaped.. see plot above) when logged. Depending on how 0s are handled, the average (log) delay between posts is around 3.5 to 3.75, meaning most people post around once each month.

Next, still pretty new to blogging, I wondered which day of the week most people are posting. The distrubution we get shows that weekends have markedly fewer posts than weekdays, and there's a fairly strong downward trend over the course of the week. I'm guessing most people (like me) end up experimenting with data over the weekends, and scaping together a post for Monday. (See first figure below)

Finally, even though I've been seeing the feed of rbloggers posts for a while, I'd never really tracked the total number of posts per day. When I collected the data at the day level, I was surprised to find what explosive growth the site had starting around 2009. After fitting a nonparametric line (see second figure below), we can see the average posts per day roughly double from 2009 to 2010, and double again between 2010 and 2012! Below are the figures and code used to generate.

load("C:/Users/Mark/Desktop/RInvest/WebScraping/rblogger.RData")
 
library(ggplot2)
library(plyr)
library(lubridate)
library(np)
 
find.avg = function(post.inputs){
  if(length(post.inputs) == 1){out = NA} else {
    diffs.raw = difftime(post.inputs,c(post.inputs[-1],tail(post.inputs,1)),
                         units = "days")
    diffs = diffs.raw[-length(diffs.raw)]
    out = mean(diffs)}
  return(out)
}
 
 
delay.frame = ddply(base,.(author),summarize,
           avg.delay = round(as.numeric(find.avg(date.format)),2),
           tot.posts = length(date.format))
 
 
p = ggplot(delay.frame,aes(x = log(avg.delay))) + geom_density()
p + xlab("average delay between posts (log days)") + theme_bw()
ggsave("avgDelay.png")
 
 
log.delay = log(delay.frame$avg.delay)
log.delay[which(log.delay == -Inf)] = 0
 
mean(log.delay,na.rm = TRUE)
 
 
 
base$dow = wday(base$date.format, label = T)
base$month = month(base$date.format, label = T)
base$year = year(base$date.format)
 
 
p = ggplot(base, aes(x = dow)) + geom_bar(fill = "blue")
p + theme_bw() + xlab("day of week") + ylab("total posts")
ggsave("dayOfWeek.png")
 
 
## how many posts per day?
 
post.per.day.frame = ddply(base,.(date.format),
                           summarize,
                           tot.posts = length(title))
 
 
post.per.day.frame$time = as.numeric(difftime(post.per.day.frame$date.format, 
                                   rep(min(post.per.day.frame$date.format),nrow(post.per.day.frame)),
                                   units = "days"))
 
np.1 = npreg(tot.posts ~ time, data = post.per.day.frame)
 
post.per.day.frame$pred = predict(np.1, newdata = post.per.day.frame)
 
p = ggplot(post.per.day.frame, aes(x = date.format,
                                   y = tot.posts)) + geom_point() +
  geom_line(aes(x = date.format, y = pred), color = "red", size = 2)
 
p + theme_bw() + xlab("date") + ylab("total posts")
ggsave("postsPerDay.png")

Created by Pretty R at inside-R.org