Wednesday, November 6, 2013

Twitter follower counts are log-normal

Continuing my investigation of Jeff Gentry's twitteR package, I decided to take a look at the distribution of twitter users' followers.

As a rough place to start, I examined the distribution of followers for those who follow me – that is, I first gather a dataframe with all my followers, then I look at the number of followers those users have. Fortunately, Jeff's package makes this really easy (see my code below).

Since I was expecting a distribution with a very long right-tail, I decided to plot the logarithm of the number of followers.

The result was an almost perfect normal distribution, which was surprising given my small sample-size (I have about 650 followers).

To give a sense of reference, I added the log-follower count for some famous folks (and plotted my own as well).

# note: you'll need to save the 'credentials' file, and load
# it before you can access twitter data.  
# for help with this, see this post: 
me = getUser("M_T_Patterson", cainfo = "cacert.pem")
#this works
# What can I learn about a user?
me$getFavorites(cainfo = "cacert.pem")
fl = me$getFollowers(cainfo = "cacert.pem")
df = data.frame(name = sapply(fl,function(x) x$screenName),
                id = sapply(fl,function(x) x$id),
       = sapply(fl,function(x) x$lastStatus$created),
                followers = sapply(fl,function(x) x$followersCount),
                location = sapply(fl,function(x) x$location))
# sorting by number of followers:
df.f = df[order(df$followers,decreasing = TRUE),]
#(Not run)
#p = ggplot(df.f, aes(x = log(followers))) + geom_density()
#p + geom_text(aes(log(refs$followers), y = 0.3, label = refs$name, fill = "blue", size = 5))
 # this is interesting -- it looks like a log-normal distribution.
# adding some references:
refs = data.frame(
  name = c("Graduate Student:\nMark Patterson",
           "Famous R Statistician:\nHadley Wickham",
           "Famous Journalist:\nThomas Friedman",
           "Famous Heartthrob:\nJustin Bieber"),
  followers = c(656,5446,234686,46602072))
# a bit more on the density of the distribution at various points:
dens = density(log(df.f$followers))
refs$log.followers = log(refs$followers)
# find.closest dens.value:
dens.lookup = function(val){
  dens$y[which.min(abs(val - dens$x))]
refs$dens = sapply(refs$log.followers, function(x){dens.lookup(x)})
p = ggplot(df.f, aes(x = log(followers))) + geom_density()
p + geom_text(aes(log(refs$followers), y = refs$dens, label = refs$name, size = 5),color = "blue")+
  theme(legend.position = "none") + scale_x_continuous(limits = c(0,20)) +
  labs(title = "Density of log(Followers) on twitter")

Created by Pretty R at