Friday, April 11, 2014

Rblogger Posting Patterns Analyzed with R

I've been a big fan of rbloggers for quite some time, but have only recently started contributing myself. After my first post yesterday, I immidiately started wondering how long most other bloggers go between posts.

I decided to gather the list of past posts to rbloggers to investigate a bit. I've posted the data (as of yesterday evening) here – I'm a bit new to github, but the file (RBloggersData.csv) should be there.

I started by using plyr to calculate the average delay between each author's posts. It turns out that this distribution has a ton of right-skew, and looks fairly normal (or at least mound-shaped.. see plot above) when logged. Depending on how 0s are handled, the average (log) delay between posts is around 3.5 to 3.75, meaning most people post around once each month.

Next, still pretty new to blogging, I wondered which day of the week most people are posting. The distrubution we get shows that weekends have markedly fewer posts than weekdays, and there's a fairly strong downward trend over the course of the week. I'm guessing most people (like me) end up experimenting with data over the weekends, and scaping together a post for Monday. (See first figure below)

Finally, even though I've been seeing the feed of rbloggers posts for a while, I'd never really tracked the total number of posts per day. When I collected the data at the day level, I was surprised to find what explosive growth the site had starting around 2009. After fitting a nonparametric line (see second figure below), we can see the average posts per day roughly double from 2009 to 2010, and double again between 2010 and 2012! Below are the figures and code used to generate.

load("C:/Users/Mark/Desktop/RInvest/WebScraping/rblogger.RData")
 
library(ggplot2)
library(plyr)
library(lubridate)
library(np)
 
find.avg = function(post.inputs){
  if(length(post.inputs) == 1){out = NA} else {
    diffs.raw = difftime(post.inputs,c(post.inputs[-1],tail(post.inputs,1)),
                         units = "days")
    diffs = diffs.raw[-length(diffs.raw)]
    out = mean(diffs)}
  return(out)
}
 
 
delay.frame = ddply(base,.(author),summarize,
           avg.delay = round(as.numeric(find.avg(date.format)),2),
           tot.posts = length(date.format))
 
 
p = ggplot(delay.frame,aes(x = log(avg.delay))) + geom_density()
p + xlab("average delay between posts (log days)") + theme_bw()
ggsave("avgDelay.png")
 
 
log.delay = log(delay.frame$avg.delay)
log.delay[which(log.delay == -Inf)] = 0
 
mean(log.delay,na.rm = TRUE)
 
 
 
base$dow = wday(base$date.format, label = T)
base$month = month(base$date.format, label = T)
base$year = year(base$date.format)
 
 
p = ggplot(base, aes(x = dow)) + geom_bar(fill = "blue")
p + theme_bw() + xlab("day of week") + ylab("total posts")
ggsave("dayOfWeek.png")
 
 
## how many posts per day?
 
post.per.day.frame = ddply(base,.(date.format),
                           summarize,
                           tot.posts = length(title))
 
 
post.per.day.frame$time = as.numeric(difftime(post.per.day.frame$date.format, 
                                   rep(min(post.per.day.frame$date.format),nrow(post.per.day.frame)),
                                   units = "days"))
 
np.1 = npreg(tot.posts ~ time, data = post.per.day.frame)
 
post.per.day.frame$pred = predict(np.1, newdata = post.per.day.frame)
 
p = ggplot(post.per.day.frame, aes(x = date.format,
                                   y = tot.posts)) + geom_point() +
  geom_line(aes(x = date.format, y = pred), color = "red", size = 2)
 
p + theme_bw() + xlab("date") + ylab("total posts")
ggsave("postsPerDay.png")

Created by Pretty R at inside-R.org

No comments:

Post a Comment