Thursday, November 7, 2013

College Basketball: Presence in the NBA over Time

Interested in practicing a bit of web-scraping, I decided to make use of a nice dataset provided by Databasebasketball.com in order to examine the representation of various college programs in the NBA/ABA over time. This dataset only includes retired players, and ends in 2010, so I decided to only plot data through 2000.

Originally, I was excited to try out a googleVis motion chart using this data, but the result turned out less exciting that I expected.

Here, I've restricted my attention to teams which (at some point) have at least 11 players in the league simultaneously – this turns out limit the inclusion to a handful of programs.

While enthusiasts of NBA history surely will not need this plot to recall these periods of schools' strong presence in the league, I think the plot nicely captures the story behind several programs. It's easy to see the relatively recent emergence of Georgia Tech and Arizona, the slow climb of UNC and Michigan, the powerhouse years of Kentucky (1950s), and UCLA (1980s).

Generating code is below.

# data scrape:
site = "http://www.databasebasketball.com/players/playerbycollege.htm"
 
# turn off warnings:
options(warn = -1)
 
# readlines:
tab = readLines(site)
 
trim = function(x){
  temp = substr(x,9,nchar(x)-8)
  temp2 = strsplit(temp,split = ">")[[1]][1]
  paste("http://www.databasebasketball.com",temp2,sep = "")
}
 
sub = tab[81:553]
sites = sapply(sub, trim)
 
# find lines around players:
 
dates.grab = function(s1){
  temp = readLines(s1)
  start = grep("listed separately",temp)
  end = grep("font class=foot",temp)
  sub = temp[(start+3):(end-2)]
  pattern = "[[:digit:]]+-[[:digit:]]+"
  m = gregexpr(pattern, sub)
  unlist(regmatches(sub,m))
}
 
 
df = data.frame(unlist(lapply(sites,dates.grab)))
 
 
names(df) = c("years")
 
 
test = rownames(df)[1]
 
clean.school = function(name){
temp = strsplit(name,split = ">")[[1]][2]
substr(temp,1,nchar(temp)-3)
}
 
df$school = unlist(lapply(rownames(df),clean.school))
rownames(df) = 1:nrow(df)
 
df$year.start = unlist(lapply(df$years, function(x){substr(x,1,4)}))
df$year.end = unlist(lapply(df$years, function(x){substr(x,6,10)}))
 
df = df[,2:4]
 
df$year.start = as.numeric(df$year.start)
df$year.end = as.numeric(df$year.end)
 
min(df$year.start)
max(df$year.end)
 
 
#looking for players in 1946:
 
was.playing.func = function(years,test.year){
  as.numeric(test.year %in% years[1]:years[2])
}
 
 
# 65 years
 
mat = matrix(rep(NA,nrow(df)*65),ncol = 65)
 
for(i in 1:65){
  mat[,i] = apply(df[,2:3],1,function(x){was.playing.func(x,(i + 1945))})
}
 
 
copy = df
 
copy = cbind(copy, mat)
 
names(copy)[4:ncol(copy)] = 1946:2010
 
 
 
library(reshape)
mdata = melt(copy, id = "school")
 
mdata = mdata[-which(mdata$variable %in% c("year.start","year.end")),]
 
names(mdata) = c("school","year","players")
 
mdata$year = as.numeric(as.character(mdata$year))
 
 
library(plyr)
comb = ddply(mdata,.(school,year),summarise,tot.players = sum(players))
 
 
# looking at a subset:
comb2 = comb[comb$year<2001,]
top.sub = unique(comb2$school[which(comb2$tot.players > 10)])
 
df2 = comb2[which(comb2$school %in% top.sub),]
 
library(ggplot2)
 
p = ggplot(df2, aes(x = year, y = tot.players, col = school)) + geom_line(lwd = 2) +
  facet_grid(school~.)
print(p + ylab("players in the NBA/ABA")) + opts(strip.text.y = theme_blank())
 
 
ggsave(file = "topCollegeNBA.png",height = 8)

Created by Pretty R at inside-R.org

Wednesday, November 6, 2013

Twitter follower counts are log-normal

Continuing my investigation of Jeff Gentry's twitteR package, I decided to take a look at the distribution of twitter users' followers.

As a rough place to start, I examined the distribution of followers for those who follow me – that is, I first gather a dataframe with all my followers, then I look at the number of followers those users have. Fortunately, Jeff's package makes this really easy (see my code below).

Since I was expecting a distribution with a very long right-tail, I decided to plot the logarithm of the number of followers.

The result was an almost perfect normal distribution, which was surprising given my small sample-size (I have about 650 followers).

To give a sense of reference, I added the log-follower count for some famous folks (and plotted my own as well).

# note: you'll need to save the 'credentials' file, and load
# it before you can access twitter data.  
 
# for help with this, see this post: 
#https://sites.google.com/site/dataminingatuoc/home/data-from-twitter/r-oauth-for-twitter
 
load("C:/Users/Mark/Documents/twitteR_credentials")
registerTwitterOAuth(Cred)
 
me = getUser("M_T_Patterson", cainfo = "cacert.pem")
#this works
 
# What can I learn about a user?
 
me$getFavorites(cainfo = "cacert.pem")
 
fl = me$getFollowers(cainfo = "cacert.pem")
 
df = data.frame(name = sapply(fl,function(x) x$screenName),
                id = sapply(fl,function(x) x$id),
                #last.tweet.date = sapply(fl,function(x) x$lastStatus$created),
                followers = sapply(fl,function(x) x$followersCount),
                location = sapply(fl,function(x) x$location))
 
# sorting by number of followers:
df.f = df[order(df$followers,decreasing = TRUE),]
 
head(df.f,50)
 
 
#(Not run)
#library(ggplot2)
#p = ggplot(df.f, aes(x = log(followers))) + geom_density()
#p + geom_text(aes(log(refs$followers), y = 0.3, label = refs$name, fill = "blue", size = 5))
 
 # this is interesting -- it looks like a log-normal distribution.
 
# adding some references:
 
refs = data.frame(
  name = c("Graduate Student:\nMark Patterson",
           "Famous R Statistician:\nHadley Wickham",
           "Famous Journalist:\nThomas Friedman",
           "Famous Heartthrob:\nJustin Bieber"),
  followers = c(656,5446,234686,46602072))
 
# a bit more on the density of the distribution at various points:
dens = density(log(df.f$followers))
 
refs$log.followers = log(refs$followers)
 
# find.closest dens.value:
 
dens.lookup = function(val){
  dens$y[which.min(abs(val - dens$x))]
}
 
refs$dens = sapply(refs$log.followers, function(x){dens.lookup(x)})
 
 
 
library(ggplot2)
p = ggplot(df.f, aes(x = log(followers))) + geom_density()
p + geom_text(aes(log(refs$followers), y = refs$dens, label = refs$name, size = 5),color = "blue")+
  theme(legend.position = "none") + scale_x_continuous(limits = c(0,20)) +
  labs(title = "Density of log(Followers) on twitter")

Created by Pretty R at inside-R.org

Monday, November 4, 2013

Using R to find Obama's most frequent twitter hashtags

I've been exploring Jeff Gentry's twitteR package, which has a ton of great functionality for intereacting with twitter data in R. Today, I thought a bit about a problem I've noticed several times on twitter: users profiles are often only noisy signals of the content they tweet about!

I decided that a table of a user's commonly-used tweets might give a better sense of the content a user tweets about. My code to extract the hashtags is below (note: you'll need to load the twitteR package, and complete the OAuth Authentication first.. if you're having trouble with this, try visiting this page)

Here's the code I used:

tw = userTimeline("BarackObama", cainfo = x1, n = 3200)
tw = twListToDF(tw)
vec1 = tw$text
 
extract.hashes = function(vec){
 
hash.pattern = "#[[:alpha:]]+"
have.hash = grep(x = vec, pattern = hash.pattern)
 
hash.matches = gregexpr(pattern = hash.pattern,
                        text = vec[have.hash])
extracted.hash = regmatches(x = vec[have.hash], m = hash.matches)
 
df = data.frame(table(tolower(unlist(extracted.hash))))
colnames(df) = c("tag","freq")
df = df[order(df$freq,decreasing = TRUE),]
return(df)
}
 
dat = head(extract.hashes(vec1),50)
dat2 = transform(dat,tag = reorder(tag,freq))
 
 
library(ggplot2)
 
p = ggplot(dat2, aes(x = tag, y = freq)) + geom_bar(fill = "blue")
p + coord_flip() + labs(title = "Hashtag frequencies in the tweets of the Obama team (@BarackObama)")

Created by Pretty R at inside-R.org

Sunday, November 3, 2013

Simulating Abstract Art with R

Piet Mondrian Composition with Red, Blue, Black, Yellow, and Gray (1921):

An example draw from my simulation function:

We're in the midst of planning our spring course on Empirical Research Methods, and as a result, I've found myself spending a lot of time thinking about some of the fist ideas in statistics – For example, that the data we observe are actually draws from some (usually unobserved) generating function.

On a recent museum triup, I started thinking about how we could apply this idea to abstract art. Here, I thought, we could think of a particular painting as a single manifestation of some set of generating rules.

Piet Mondrian has a ton of work in a style I thought would be interesting to explore – his work involves combinations of a small number of lines and filled rectangles.

In order to experiment, I decided to write a function which would simulate the (1921) Composition with Red, Blue, Black, Yellow, and Gray (1921) featured above.

My first idea was to maintain the same set lines and colors, while varying (slightly) the locations and widths of those lines – we can think of these locations as parameters drawn from specified distributions. To start, I only use uniform distributions, with means near the values in the original.

For the color, I found a site that extracts hexadecimal colors, which R recognizes.

The result, I think, is interesting. While the variation is still quite constrained, this lets us examine the abstract art.. abstractly (ha!)

Here are a few more draws from the function:

Here is the code for the simulation – (it's a bit of a mess). Feel free to download, experiment with, and improve it!

sim.func = function() {

    # pick a place (near the middle) for central anchor line:
    left.anchor = runif(1, 40, 60)

    width = runif(1, 2, 3)

    right.anchor = left.anchor + width

    # plot:
    plot(0, 0, type = "n", xlim = c(0, 100), ylim = c(0, 10), xaxt = "n", yaxt = "n", 
        ann = FALSE)

    polygon(c(left.anchor, left.anchor, right.anchor, right.anchor), c(-1, 11, 
        11, -1), col = "#1c1b23")

    upper.left = runif(1, 7, 9)
    lower.left = runif(1, 2, 5)

    small.line.height = runif(1, 0.1, 0.2)

    polygon(c(-100, -100, left.anchor, left.anchor), c(lower.left, lower.left + 
        small.line.height, lower.left + small.line.height, lower.left), col = "#1c1b23")

    polygon(c(-100, -100, left.anchor, left.anchor), c(upper.left, upper.left + 
        small.line.height, upper.left + small.line.height, upper.left), col = "#1c1b23")

    polygon(c(-10, -10, left.anchor, left.anchor), c(upper.left + small.line.height, 
        12, 12, upper.left + small.line.height), col = "#cbccd4")

    polygon(c(-10, -10, left.anchor, left.anchor), c(-1, lower.left, lower.left, 
        -1), col = "#cbccd4")



    upper.right = runif(1, 7, 9)

    polygon(c(left.anchor, left.anchor, 200, 200), c(upper.right, upper.right + 
        small.line.height, upper.right + small.line.height, upper.right), col = "#1c1b23")

    polygon(c(left.anchor + width, left.anchor + width, 200, 200), c(upper.right + 
        small.line.height, 20, 20, upper.right + small.line.height), col = "#db5b2c")

    polygon(c(-10, -10, left.anchor, left.anchor), c(lower.left + small.line.height, 
        upper.left, upper.left, lower.left + small.line.height), col = "#273f70")

    lowest.left = runif(1, 0.1, 1.2)

    polygon(c(-100, -100, left.anchor, left.anchor), c(lowest.left, lowest.left + 
        small.line.height, lowest.left + small.line.height, lowest.left), col = "#1c1b23")

    lh.ref = runif(1, 5, 20)

    polygon(c(lh.ref, lh.ref, lh.ref + width, lh.ref + width), c(lowest.left, 
        upper.left, upper.left, lowest.left), col = "#1c1b23")


    lh.mid = lower.left + runif(1, 0.1, 0.4) * (upper.left - lower.left)

    polygon(c(lh.ref, lh.ref, 200, 200), c(lh.mid, lh.mid + small.line.height, 
        lh.mid + small.line.height, lh.mid), col = "#1c1b23")


    lowest.right = runif(1, 0.1, 1.6)

    polygon(c(left.anchor + width, left.anchor + width, 105, 105), c(lowest.right, 
        lowest.right + small.line.height, lowest.right + small.line.height, 
        lowest.right), col = "#1c1b23")

    rh.ref = runif(1, 75, 90)

    polygon(c(left.anchor + width, left.anchor + width, rh.ref, rh.ref), c(-1, 
        lowest.right, lowest.right, -1), col = "#1c1b23")

    polygon(c(rh.ref, rh.ref, 200, 200), c(-1, lowest.right, lowest.right, -1), 
        col = "#dccd1e")

    polygon(c(right.anchor, right.anchor, 200, 200), c(lowest.right + small.line.height, 
        lh.mid, lh.mid, lowest.right + small.line.height), col = colors()[358])

    polygon(c(right.anchor, right.anchor, 200, 200), c(lh.mid + small.line.height, 
        upper.right, upper.right, lh.mid + small.line.height), col = "#cbccd4")

    top.left = runif(1, 0.1, 0.3) * (10 - upper.left) + upper.left + small.line.height

    polygon(c(-100, -100, left.anchor, left.anchor), c(top.left, top.left + 
        small.line.height, top.left + small.line.height, top.left), col = "#1c1b23")

    polygon(c(-100, -100, left.anchor, left.anchor), c(upper.left + small.line.height, 
        top.left, top.left, upper.left + small.line.height), col = colors()[358])

}

Friday, November 1, 2013

Quarterback Wonderlic Scores by Institution (Academic) Strength

## geom_smooth: method="auto" and size of largest group is <1000, so using
## loess. Use 'method = x' to change the smoothing method.

plot of chunk unnamed-chunk-1

I remember my dad telling me that when he was at Northwestern in the mid-70s, the team was essentially winless. As a small consolation, he remembered that the football team had actually been full of good students.

Some time ago, I stumbled across nflcombineresults.com. Among other measures, the site reports Wonderlic scores for a bunch of players entering the NFL.

By adding data on a school's institutional strength (from US News and World Report), I can look for an association between a quarterback's Wonderlic (a measure of cognitive ability), and the (academic) strength of a quarterback's alma mater.

Results While this is fairly rough, it looks like there is a relationship here – quarterbacks who attend better schools have higher Wonderlic scores; this relationship seems to only hold for top-50 schools.

There are a bunch of causal relationships that could give rise to this pattern, and we really don't have the data to separate these stories. It could be that better students choose better schools, or that attending some schools will increase a Wonderlic score. Alternatively, it may be that students at top schools are more likely to attend college for 4 years.

While the assessment is really preliminary, it looks like there might be something here.

I've attached my code below. At some point, I'll add the csv file with institutional strength (this information is freely available at US News as well).


# loading libraries:
library(plyr)
library(ggplot2)


# gathering wonderlic data:
library(XML)
url = "http://nflcombineresults.com/nflcombinedata.php?year=&pos=&college="
test = readHTMLTable(url)

dat = test[[1]]

## cleaning data:
names(dat) = tolower(names(dat))

# replacing spaces in variable names:
names(dat) = gsub(x = names(dat), pattern = "\\s", ".")

# adjusting vert.leap.(in)
names(dat)[10] = "vert.leap.in"

# cleaning individual columns:
dat$year = as.numeric(as.character(dat$year))
dat$name = as.character(dat$name)
dat$wonderlic = as.numeric(as.character(dat$wonderlic))
dat$bench.press = as.numeric(as.character(dat$bench.press))
dat$vert.leap.in = as.numeric(as.character(dat$vert.leap.in))


# separating out the individuals with wonderlic scores:
dat.sub = dat[!is.na(dat$wonderlic), ]

# reordering
dat.sub = dat.sub[order(dat.sub$wonderlic, decreasing = TRUE), ]

# examining the scores by position:
pos.dat = ddply(dat.sub, .(pos), summarise, mean.wonderlic = mean(wonderlic, 
    na.rm = TRUE), count = length(wonderlic))

# note: not really enough to compare by position.

qb.dat = dat.sub[dat.sub$pos == "QB", ]


# reading in qbschools:
qb.schools = read.csv("qbschools.csv")




# merging:
qb.dat$college = as.character(qb.dat$college)
qb.schools$school = as.character(qb.schools$school)

merged = merge(qb.dat, qb.schools, by.x = "college", by.y = "school")

names(merged) = tolower(names(merged))

merged$usnewsrank = as.numeric(as.character(merged$usnewsrank))


# to generate plot: p = ggplot(merged, aes(x = usnewsrank, y = wonderlic)) +
# geom_point() + geom_smooth() p + opts(title = 'QB Wonderlic score \n by
# (academic) strength of undergraduate institution') + xlab('US News and
# World Report institution rank (as of 2013)') + ylab ('Wonderlic score')


# ggsave('wonderlic.jpg')