Thursday, November 7, 2013

College Basketball: Presence in the NBA over Time

Interested in practicing a bit of web-scraping, I decided to make use of a nice dataset provided by Databasebasketball.com in order to examine the representation of various college programs in the NBA/ABA over time. This dataset only includes retired players, and ends in 2010, so I decided to only plot data through 2000.

Originally, I was excited to try out a googleVis motion chart using this data, but the result turned out less exciting that I expected.

Here, I've restricted my attention to teams which (at some point) have at least 11 players in the league simultaneously – this turns out limit the inclusion to a handful of programs.

While enthusiasts of NBA history surely will not need this plot to recall these periods of schools' strong presence in the league, I think the plot nicely captures the story behind several programs. It's easy to see the relatively recent emergence of Georgia Tech and Arizona, the slow climb of UNC and Michigan, the powerhouse years of Kentucky (1950s), and UCLA (1980s).

Generating code is below.

# data scrape:
site = "http://www.databasebasketball.com/players/playerbycollege.htm"
 
# turn off warnings:
options(warn = -1)
 
# readlines:
tab = readLines(site)
 
trim = function(x){
  temp = substr(x,9,nchar(x)-8)
  temp2 = strsplit(temp,split = ">")[[1]][1]
  paste("http://www.databasebasketball.com",temp2,sep = "")
}
 
sub = tab[81:553]
sites = sapply(sub, trim)
 
# find lines around players:
 
dates.grab = function(s1){
  temp = readLines(s1)
  start = grep("listed separately",temp)
  end = grep("font class=foot",temp)
  sub = temp[(start+3):(end-2)]
  pattern = "[[:digit:]]+-[[:digit:]]+"
  m = gregexpr(pattern, sub)
  unlist(regmatches(sub,m))
}
 
 
df = data.frame(unlist(lapply(sites,dates.grab)))
 
 
names(df) = c("years")
 
 
test = rownames(df)[1]
 
clean.school = function(name){
temp = strsplit(name,split = ">")[[1]][2]
substr(temp,1,nchar(temp)-3)
}
 
df$school = unlist(lapply(rownames(df),clean.school))
rownames(df) = 1:nrow(df)
 
df$year.start = unlist(lapply(df$years, function(x){substr(x,1,4)}))
df$year.end = unlist(lapply(df$years, function(x){substr(x,6,10)}))
 
df = df[,2:4]
 
df$year.start = as.numeric(df$year.start)
df$year.end = as.numeric(df$year.end)
 
min(df$year.start)
max(df$year.end)
 
 
#looking for players in 1946:
 
was.playing.func = function(years,test.year){
  as.numeric(test.year %in% years[1]:years[2])
}
 
 
# 65 years
 
mat = matrix(rep(NA,nrow(df)*65),ncol = 65)
 
for(i in 1:65){
  mat[,i] = apply(df[,2:3],1,function(x){was.playing.func(x,(i + 1945))})
}
 
 
copy = df
 
copy = cbind(copy, mat)
 
names(copy)[4:ncol(copy)] = 1946:2010
 
 
 
library(reshape)
mdata = melt(copy, id = "school")
 
mdata = mdata[-which(mdata$variable %in% c("year.start","year.end")),]
 
names(mdata) = c("school","year","players")
 
mdata$year = as.numeric(as.character(mdata$year))
 
 
library(plyr)
comb = ddply(mdata,.(school,year),summarise,tot.players = sum(players))
 
 
# looking at a subset:
comb2 = comb[comb$year<2001,]
top.sub = unique(comb2$school[which(comb2$tot.players > 10)])
 
df2 = comb2[which(comb2$school %in% top.sub),]
 
library(ggplot2)
 
p = ggplot(df2, aes(x = year, y = tot.players, col = school)) + geom_line(lwd = 2) +
  facet_grid(school~.)
print(p + ylab("players in the NBA/ABA")) + opts(strip.text.y = theme_blank())
 
 
ggsave(file = "topCollegeNBA.png",height = 8)

Created by Pretty R at inside-R.org

No comments:

Post a Comment