Decisions and R

Sunday, February 8, 2015

Morse Code Converter

A few months ago, I finally got a chance to see The Imitation Game (the new Alan Turing movie), which gave me an idea for a Sunday morning R hacking session. The movie features a bunch of scenes with bustling rooms full or workers intercepting (and documenting) encrypted radio transmissions, which are then passed along to Turing’s decryption device (bombe). The process seemed to be:

Listen to Morse code from a live wire
Write down the series of short and long beeps on a piece of paper
Hand the paper to someone who runs the code over to another tent, ultimately to be sent to the decryption device.

All this got me thinking – wouldn’t it be great if the ‘wire listening’ part of this process could be automated too? So here’s the toy problem all this made me think about: could I write an R function which would take a sound file (.wav) with Morse code, and return decrypted text?
To start on all of this, I found a site with a bunch of example Morse code sound files, maintained by The National Association for Amateur Radio (there are a bunch of files here – http://www.arrl.org/code-practice-files).

To get these into R, I used the readWave function from Uwe Ligges’ very cool tuneR package. The only bother here was needing to convert the sound files from .mp3 to .wav first.. ultimately not that bad though.
Figuring out how to deal with the converted audio file was actually a lot of fun – after a few hours of tinkering, I arrived at a solution (copied below) which converts audio Morse code to text. It’s still pretty fragile because the example sound files I’ve been using are computer generated, so the function won’t really work on Morse code ‘in the wild’ yet.

If you’d like to give it a spin, try downloading one of the example morse code files e.g. http://www.arrl.org/files/file/Morse/Archive/10%20WPM/140625_10WPM.mp3

Next, convert the mp3 file to .wav (and make sure to change the file path for the sf.1 object to wherever the .wav file is stored on your machine), then execute my code. The solution starts with a bit of a strange stamp, but the rest should be pretty easy to read.

If you’ve got ideas for how to improve things, definitely let me know on twitter – I’m @M_T_Patterson.
Here’s the code:

#### initialize ####
 
# clear workspace:
rm(list = ls())
 
## loading libraries:
library(tuneR)
library(RCurl)
 
morseref.url <- getURL("https://raw.githubusercontent.com/MarkTPatterson/Blog/master/Morse/morseref.csv", 
                       ssl.verifypeer = FALSE)
ref.df <- read.csv(text = morseref.url)
 
# helper function:
var_find = function(vec, t, s){
  var.out = var(vec[(t-s):(t+s)])
  return(var.out)}
 
 
## loading reference files
 
## note: you'll need to change the file path for the sf.1 file
 
## sound file can be downloaded here:
## http://www.arrl.org/files/file/Morse/Archive/10%20WPM/140625_10WPM.mp3
 
## note: you'll need to convert the file to .wav for the function to work.
 
 
sf.1 = readWave("C:/Users/Mark/Desktop/RInvest/Morse Code/Sound Files/140625_10WPM.wav")
 
 
# defining the morse to text function:
m.to.text.func = function(sound.file){
 
  # read data into a dataframe
  df = data.frame(indx = 1:length(sound.file@left), vec = sound.file@left)
 
 
  # points to sample:
  sample.points = seq(from = 100, by = 100, to = length(df$vec))
 
  # applying the variance finder at the sampled points:
  tiny.df = data.frame(var = sapply(sample.points, 
                                    function(x){var_find(vec = df$vec,
                                                         t = x,
                                                         s = 50)}))
 
  # decide which points are 'on'
  tiny.df$on = as.numeric(tiny.df$var > 100000)
  tiny.df$indx = 1:nrow(tiny.df)
 
 
 
  # create a vector of changes in on:
  raw.vec = diff(tiny.df$on)
 
  # create indices for change instances -- these will be 1 and -1
  beep.start.vals = which(raw.vec == 1)
  beep.stop.vals = which(raw.vec == -1)
 
  # converting indices to durations:
  beep.durs = beep.stop.vals - beep.start.vals
  pause.durs = beep.start.vals[-1] - beep.stop.vals[-length(beep.stop.vals)]
 
 
  ## note: for some files, there seems to be a few 
  ## few beep durs that are only 1; for now, hard coding these out:
 
 
  beep.durs = beep.durs[beep.durs>1]
  pause.durs = pause.durs[pause.durs>1]
 
 
 
  ## recoding beep durs 
 
  ## note: this step needs to  take the beep.durs data and the pause.durs data
  ## and return duration barriers.  
 
 
  ## first, creating pause barriers:
 
  raw.tab = table(pause.durs)
 
 
  pause.centers.raw = kmeans(as.numeric(names(raw.tab[raw.tab > 5])),3)$centers[,1]
 
  pause.centers = pause.centers.raw[order(pause.centers.raw,decreasing = F)]
  pause.levels = as.vector(pause.centers)
 
 
  # determining separator values:
  pause.sep.1 = mean(pause.levels[1:2])
  pause.sep.2 = mean(pause.levels[2:3])
 
 
  ## similar exercise for beep.durs:
  raw.tab = table(beep.durs)
  beep.centers.raw = kmeans(as.numeric(names(raw.tab[raw.tab > 5])),2)$centers[,1]
  beep.centers = beep.centers.raw[order(beep.centers.raw,decreasing = F)]
  beep.levels = as.vector(beep.centers)
 
  beep.sep = mean(beep.levels[1:2])
 
 
  ## creating the letter and word end vectors:
  letter.ends = which(pause.durs > pause.sep.1)
  word.ends = which(as.numeric(pause.durs[pause.durs > pause.sep.1] > pause.sep.2) == 1)
 
 
 
  # recoding beep durations to long and short:
  beep.durs.let = beep.durs
  beep.durs.let[beep.durs.let > beep.sep] = "l"
  beep.durs.let[beep.durs.let < beep.sep] = "s"
 
 
 
 
  ## grouping the beep duration letters (l's and s's) into letters
  ## based on the letter ends vector
  empty.list = list()
  start.val = 1
  for(i in 1:length(letter.ends)){
    cur.points = beep.durs.let[start.val:letter.ends[i]]
    empty.list[[i]] = paste(cur.points,collapse = "")
    start.val = letter.ends[i] + 1  
  }
 
  letter.vec = unlist(lapply(empty.list, function(x){ref.df$letter[which(ref.df$code == x)]}))
 
 
  ## grouping letters into words based on word.ends vec:
  start.val = 1
  empty.list = list()
  for(i in 1:length(word.ends)){
    cur.points = letter.vec[start.val:word.ends[i]]
    empty.list[[i]] = paste(cur.points,collapse = "")
    start.val = word.ends[i] + 1  
  }
 
 
  ## saving as a new vector, with spacing:
  out = paste(unlist(empty.list),collapse = " ")
 
 
  return(out)
}
 
# examples:
m.to.text.func(sf.1)

Created by Pretty R at inside-R.org

Sunday, October 26, 2014

Quarterback Completion Heatmap Using dplyr

Several months ago, I found Bryan Povlinkski's (really nicely cleaned) dataset with 2013 NFL play-by-play information, based on data released by Brian Burke at Advanced Football Analytics.

I decided to browse QB completion rates based on Pass Location (Left, Middle, Right), Pass Distance (Short or Deep), and Down. I ended up focusing on the 5 quarterbacks with the most passing attempts.

The plot above (based on code below) shows a heatmap based on completion rate. Darker colors correspond to a better completion percentage.

Because we've only got data from one year, even looking at the really high-volume passers means that the data are pretty sparse for some combinations of these variables. It's a little rough, but in these cases, I deced not to plot anything. This plot could definitely be improved by plotting gray areas instead of white.

There are a few patterns here – first, it's iteresting to look at each player's success with Short compared to Deep passes. Every player, as we would expect, has more success with Short rather than Deep passes, but this difference seems especially pronounced for Drew Brees (who seems to have more success with Short passes compared to the other players). Brees seems to have pretty uniform completion rates across the three pass locations at short distance too – most other players have slightly better completion rates to the outside, espeically at short distance.

As we would expect, we can also see a fairly pronounced difference in completion rates for deep throws on 3rd down vs. 1st and 2nd down. The sample size is small, so the estimates aren't very precise, this pattern is definitely there – probably best exemplified by Tom Brady and Peyton Manning's data.

As a next step, it would be interesting to make the same plot with pass attempts rather than completion rates.

library(dplyr)
library(ggplot2)
 
# note: change path to the dataset
df = read.csv("C:/Users/Mark/Desktop/RInvest/nflpbp/2013 NFL Play-by-Play Data.csv",
              stringsAsFactors = F)
 
passers = df %>% filter(Play.Type == "Pass") %>% group_by(Passer) %>% summarize(n.obs = length(Play.Type)) %>% arrange(desc(n.obs))
top.passers = head(passers$Passer,5)
 
 
df %>% filter(Play.Type == "Pass",
              Passer %in% top.passers) %>% 
  mutate(Pass.Distance = factor(Pass.Distance, levels = c("Short","Deep"))) %>%
  group_by(Down,Passer,Pass.Location, Pass.Distance) %>%  summarize(share = (sum(Pass.Result == "Complete") / length(Pass.Result)),
                                                             n.obs = length(Pass.Result)) %>%
  filter(n.obs > 5) %>%
  ggplot(., aes(Pass.Location, Pass.Distance)) + geom_tile(aes(fill = share),
                                                           colour = "white") + 
  facet_wrap(Passer ~ Down, ncol = 3) +
  scale_fill_gradient(low = "white", high = "steelblue", limits = c(0,1)) + theme_bw() +
  ggtitle("NFL QB completion by Pass Distance, Location, and Down")

Created by Pretty R at inside-R.org

Wednesday, August 27, 2014

The first rule of brainstorming is..

About a year ago I was at a workshop on behavioral economics and public policy, and Mike Norton, who was leading the session, laid out a 'first rule' of brainstorming that has really resonated with me ever since. Mike's first rule was taken from the rules of improv comedy -- no matter how ridiculous / impractical / nonsensical the idea a person comes up with, you respond with "Yes, and..." There's nothing that kills a brainstorming session like someone pointing out that a particular candidate solution won't work, or doesn't quite solve the initial problem. As well, it's often the really off-the-wall ideas that actually lead to solutions.

Tuesday, August 26, 2014

Replication wiki for econ papers

I found this on Andrew Gelman's Blog -- a page for replications of experimental results in economics. This seems like a great idea! Here's the link

Prizes in public health -- the new kaggle?

This morning I saw a short piece by James Surowiecki on the absence of a vaccination for the Ebola virus. Surowiecki points out that the incentive structure for pharmaceutical companies rewards work on drugs that are likely to be taken by i) a large number of people, ii) Westerners, and iii) are likely to be taken over a long period of time. This means, Surowiecki argues, that under the present incentive scheme we are unlikely to develop cures for things like Ebola, which is essentially confined to the developing world, and up until recently did not affect many people.

As a possible solution, Surowiecki offers 'prizes' -- sponsored by governments -- which compensate firms in exchange for the right to manufacture the resulting pharmaceuticals. Put differently, the government can intervene on the market value of development in these areas by paying 'more than the going rate.'

I think Surowiecki is surely right about the potential benefits of this sort of approach -- paying companies more will definitely give them an incentive to change their research priorities. Thinking about all of this, though, reminded me of the kaggle-style data competitions that are growing increasingly popular. Here, a problem is posted (sometimes with a large prize for the best solution), and data scientists of all stripe work on solutions. The competition that really vaulted these into the mainstream was the Netflix Prize -- offering $1 Million for the best improvement in predictive film ratings. I remember going to a talk a few years ago by the 'BellKor' team that ended up winning the competition, where the winners remarked that while the prize seemed big, the tremendous time the team put into the challenge meant that they were actually working at a really low wage-rate.. and they were the one's who actually won!

I bring all this up because it seems that, at least in the case of data competitions, it's not really the money that's driving entry (or work) on these problems. The competition, collaboration, and social value of doing well seem like much more important causes. Now, data science is quite different from pharmaceutical research -- the start up costs are MUCH lower, and it can be a much more individual activity -- but after reading Surowiecki's article, I'm especially curious to see whether some of the non-monitary incentives that we're seeing at work in the data science world might emerge if public health adopts a similar incentive structure around sponsored prizes.

Friday, June 20, 2014

dplyr tutorial with baseball data

Sunday, May 18, 2014

T-Shirts ... designed with R!

On Friday, I saw David Smith's post on a competition to design this year's useR! conference t-shirt. The goal is to create a design generated using an R script, which will be featured on the back of the shirt.

Having a bit of time this weekend, I decided to try plotting the R logo, using base graphics, represented by a scatter of points – one for each package published on CRAN. Having recently posted on a very similar idea for visualizing twitter followers, I realized I could take advantage of my past code. With a bit of tweaking, I came up with the image above.

The code I used is below – you should feel free to tweak / improve / experiment with it! Before running, you'll need to install the EBImage package available on bioconductor. Roughly, the script works by first downloading a copy of the R logo (this step might make the entry illegal for the purposes of the contest..), as well as the current number of R packages. Next, there are a few functions to simplify the colors presented in the image – this part probably isn't necessary, but I think it makes the final result look a bit better. Finally, the image is actually generated by sampling pixels from the modified image, and replotting.

If you're interested in trying out your own ideas (which you definitely should!) you can submit entires to the contest as pull requests on Github.

# Script by Mark T Patterson
# May 17, 2014
# twitter: @M_T_Patterson
 
# General Notes:
 
# This script creates an image of the R logo 
# represented by n points, 
# where n is the current number of packages on CRAN
 
 
# note: this script requries the EBImage package
# available from bioconductor:
# http://bioconductor.wustl.edu/bioc/html/EBImage.html
 
# approximate run time: 2 mins
 
#### initialize ####
 
# clear workspace
rm(list = ls())
 
 
# load libraries
library(EBImage)
 
# coordinate the version of the program:
set.seed(2014)
 
#### gather web data: reference image and CRAN package count ####
 
# load the R logo, save the rgb values:
img = readImage("http://www.thinkr.spatialfiltering.com/images/Rlogo.png")
img.2 = img[,,1:3]
 
cran.site = "http://cran.r-project.org/web/packages/"
lns = readLines(cran.site)
ref.line = grep(lns, pattern = "CRAN package repository features")
package.count = as.numeric(strsplit(lns[ref.line],split = "\\s")[[1]][7])
 
 
#### helper functions ####
 
# functions for color simplification:
num.to.let = function(x1){
  ref.dat = data.frame(num = 10:15, let = LETTERS[1:6])
  out = as.character(x1)
  if(x1 %in% 10:15){out = as.character(ref.dat$let[which(ref.dat$num == x1)])}
  return(out)
}
 
rgb.func = function(vec){
  #note: vec is a triple of color intensities
  r1 = floor(255*vec[1])
  g1 = floor(255*vec[2])
  b1 = floor(255*vec[3])
 
  x1 = r1 %/% 16
  x2 = r1 %% 16
  x3 = g1 %/% 16
  x4 = g1 %% 16
  x5 = b1 %/% 16
  x6 = b1 %% 16
 
  x1 = num.to.let(x1)
  x2 = num.to.let(x2)
  x3 = num.to.let(x3)
  x4 = num.to.let(x4)
  x5 = num.to.let(x5)
  x6 = num.to.let(x6)
 
  out = paste("#",x1,x2,x3,x4,x5,x6, sep = "")
  return(out)
 
}
 
 
im.func.1 = function(image, k.cols = 5, samp.val = 3000){
  # creating a dataframe:
  test.mat = matrix(image,ncol = 3)
  df = data.frame(test.mat)
  colnames(df) = c("r","g","b")
  df$y = rep(1:dim(image)[1],dim(image)[2])
  df$x = rep(1:dim(image)[2], each = dim(image)[1])
 
  samp.indx = sample(1:nrow(df),samp.val)
  work.sub = df[samp.indx,]
 
  # extracting colors:
  k2 = kmeans(work.sub[,1:3],k.cols)
 
  # adding centers back:
  fit.test = fitted(k2)
 
  work.sub$r.pred = fit.test[,1]
  work.sub$g.pred = fit.test[,2]
  work.sub$b.pred = fit.test[,3]
 
  return(work.sub)
 
}
 
add.cols = function(dat){
  apply(dat,1,rgb.func)
}
 
# general plotting function
plot.func = function(dat){
  # assumes dat has colums x, ym cols
  plot(dat$y,max(dat$x) - dat$x, col = dat$cols,
       main = "A point for each CRAN package",
       xaxt='n',
       yaxt="n",
       xlab = "useR!",
       ylab = "2014",
       cex.lab=1.5,
       cex.axis=1.5,
       cex.main=1.5,
       cex.sub=1.5)
}
 
#### simplify colors; sample n points ###
 
temp = im.func.1(img.2, samp.val = 25000, k = 12)
temp$cols = add.cols(temp[,6:8])
 
final = temp[sample(1:nrow(temp), package.count),]
 
 
#### generate plot ####
 
plot.func(final)

Created by Pretty R at inside-R.org