Sunday, May 18, 2014

T-Shirts ... designed with R!

On Friday, I saw David Smith's post on a competition to design this year's useR! conference t-shirt. The goal is to create a design generated using an R script, which will be featured on the back of the shirt.

Having a bit of time this weekend, I decided to try plotting the R logo, using base graphics, represented by a scatter of points – one for each package published on CRAN. Having recently posted on a very similar idea for visualizing twitter followers, I realized I could take advantage of my past code. With a bit of tweaking, I came up with the image above.

The code I used is below – you should feel free to tweak / improve / experiment with it! Before running, you'll need to install the EBImage package available on bioconductor. Roughly, the script works by first downloading a copy of the R logo (this step might make the entry illegal for the purposes of the contest..), as well as the current number of R packages. Next, there are a few functions to simplify the colors presented in the image – this part probably isn't necessary, but I think it makes the final result look a bit better. Finally, the image is actually generated by sampling pixels from the modified image, and replotting.

If you're interested in trying out your own ideas (which you definitely should!) you can submit entires to the contest as pull requests on Github.

# Script by Mark T Patterson
# May 17, 2014
# twitter: @M_T_Patterson
 
# General Notes:
 
# This script creates an image of the R logo 
# represented by n points, 
# where n is the current number of packages on CRAN
 
 
# note: this script requries the EBImage package
# available from bioconductor:
# http://bioconductor.wustl.edu/bioc/html/EBImage.html
 
# approximate run time: 2 mins
 
#### initialize ####
 
# clear workspace
rm(list = ls())
 
 
# load libraries
library(EBImage)
 
# coordinate the version of the program:
set.seed(2014)
 
#### gather web data: reference image and CRAN package count ####
 
# load the R logo, save the rgb values:
img = readImage("http://www.thinkr.spatialfiltering.com/images/Rlogo.png")
img.2 = img[,,1:3]
 
cran.site = "http://cran.r-project.org/web/packages/"
lns = readLines(cran.site)
ref.line = grep(lns, pattern = "CRAN package repository features")
package.count = as.numeric(strsplit(lns[ref.line],split = "\\s")[[1]][7])
 
 
#### helper functions ####
 
# functions for color simplification:
num.to.let = function(x1){
  ref.dat = data.frame(num = 10:15, let = LETTERS[1:6])
  out = as.character(x1)
  if(x1 %in% 10:15){out = as.character(ref.dat$let[which(ref.dat$num == x1)])}
  return(out)
}
 
rgb.func = function(vec){
  #note: vec is a triple of color intensities
  r1 = floor(255*vec[1])
  g1 = floor(255*vec[2])
  b1 = floor(255*vec[3])
 
  x1 = r1 %/% 16
  x2 = r1 %% 16
  x3 = g1 %/% 16
  x4 = g1 %% 16
  x5 = b1 %/% 16
  x6 = b1 %% 16
 
  x1 = num.to.let(x1)
  x2 = num.to.let(x2)
  x3 = num.to.let(x3)
  x4 = num.to.let(x4)
  x5 = num.to.let(x5)
  x6 = num.to.let(x6)
 
  out = paste("#",x1,x2,x3,x4,x5,x6, sep = "")
  return(out)
 
}
 
 
im.func.1 = function(image, k.cols = 5, samp.val = 3000){
  # creating a dataframe:
  test.mat = matrix(image,ncol = 3)
  df = data.frame(test.mat)
  colnames(df) = c("r","g","b")
  df$y = rep(1:dim(image)[1],dim(image)[2])
  df$x = rep(1:dim(image)[2], each = dim(image)[1])
 
  samp.indx = sample(1:nrow(df),samp.val)
  work.sub = df[samp.indx,]
 
  # extracting colors:
  k2 = kmeans(work.sub[,1:3],k.cols)
 
  # adding centers back:
  fit.test = fitted(k2)
 
  work.sub$r.pred = fit.test[,1]
  work.sub$g.pred = fit.test[,2]
  work.sub$b.pred = fit.test[,3]
 
  return(work.sub)
 
}
 
add.cols = function(dat){
  apply(dat,1,rgb.func)
}
 
# general plotting function
plot.func = function(dat){
  # assumes dat has colums x, ym cols
  plot(dat$y,max(dat$x) - dat$x, col = dat$cols,
       main = "A point for each CRAN package",
       xaxt='n',
       yaxt="n",
       xlab = "useR!",
       ylab = "2014",
       cex.lab=1.5,
       cex.axis=1.5,
       cex.main=1.5,
       cex.sub=1.5)
}
 
#### simplify colors; sample n points ###
 
temp = im.func.1(img.2, samp.val = 25000, k = 12)
temp$cols = add.cols(temp[,6:8])
 
final = temp[sample(1:nrow(temp), package.count),]
 
 
#### generate plot ####
 
plot.func(final)

Created by Pretty R at inside-R.org

Friday, May 2, 2014

Function for rounding a group of numbers

Sometimes when I'm creating summary statistics for factor variables (usually demographics), I find I need to round percentages a bit. If I round each number individually, I occasionally (and frustratingly) change the total sum. For example, suppose I've got information on how many individuals are in each of four groups:

group.totals = c(13, 39, 16, 11)

and I'd like to report the distribution as a share of the total number of individuals:

(tab = prop.table(group.totals))
## [1] 0.1646 0.4937 0.2025 0.1392

however, I only want to report 2 significant digits after the decimal:

(rounded.tab = round(tab, 2))
## [1] 0.16 0.49 0.20 0.14

Here, the rounding process (annoyingly) changes the sum:

sum(tab)
## [1] 1
sum(rounded.tab)
## [1] 0.99

To fix this (a bit), here's a quick function which rounds a group of numbers together:

round.group = function(vec, digits) {
    r.vec = round(vec, digits)
    total.resid = sum(vec) - sum(r.vec)
    sq.diffs = ((r.vec + total.resid) - vec)^2
    indx = which.min(sq.diffs)
    r.vec.copy = r.vec
    r.vec.copy[indx] = r.vec.copy[indx] + total.resid
    out = r.vec.copy
    return(out)
}

This solves some of the problems:

(group.rounded.tab = round.group(tab, 2))
## [1] 0.17 0.49 0.20 0.14
sum(group.rounded.tab)
## [1] 1

But has sort of unusual behavior for some inputs:

bug.vec = c(0.4, 0.4, 0.4, 0.4, 9.2, 9.2)
round.group(bug.vec, 0)
## [1] 2 0 0 0 9 9

Despite being a bit buggy, this function does well enough for my purposes.. if you'd like to find a better version, or are generally interested, here's a link to a nice discussion on group rounding at stackoverflow.

http://stackoverflow.com/questions/792460/how-to-round-floats-to-integers-while-preserving-their-sum