Sunday, February 8, 2015

Morse Code Converter

A few months ago, I finally got a chance to see The Imitation Game (the new Alan Turing movie), which gave me an idea for a Sunday morning R hacking session. The movie features a bunch of scenes with bustling rooms full or workers intercepting (and documenting) encrypted radio transmissions, which are then passed along to Turing’s decryption device (bombe). The process seemed to be:
  1. Listen to Morse code from a live wire
  2. Write down the series of short and long beeps on a piece of paper
  3. Hand the paper to someone who runs the code over to another tent, ultimately to be sent to the decryption device.
All this got me thinking – wouldn’t it be great if the ‘wire listening’ part of this process could be automated too? So here’s the toy problem all this made me think about: could I write an R function which would take a sound file (.wav) with Morse code, and return decrypted text?
To start on all of this, I found a site with a bunch of example Morse code sound files, maintained by The National Association for Amateur Radio (there are a bunch of files here – http://www.arrl.org/code-practice-files).

To get these into R, I used the readWave function from Uwe Ligges’ very cool tuneR package. The only bother here was needing to convert the sound files from .mp3 to .wav first.. ultimately not that bad though.
Figuring out how to deal with the converted audio file was actually a lot of fun – after a few hours of tinkering, I arrived at a solution (copied below) which converts audio Morse code to text. It’s still pretty fragile because the example sound files I’ve been using are computer generated, so the function won’t really work on Morse code ‘in the wild’ yet.

If you’d like to give it a spin, try downloading one of the example morse code files e.g. http://www.arrl.org/files/file/Morse/Archive/10%20WPM/140625_10WPM.mp3

Next, convert the mp3 file to .wav (and make sure to change the file path for the sf.1 object to wherever the .wav file is stored on your machine), then execute my code. The solution starts with a bit of a strange stamp, but the rest should be pretty easy to read.

If you’ve got ideas for how to improve things, definitely let me know on twitter – I’m @M_T_Patterson.
Here’s the code:
#### initialize ####
 
# clear workspace:
rm(list = ls())
 
## loading libraries:
library(tuneR)
library(RCurl)
 
morseref.url <- getURL("https://raw.githubusercontent.com/MarkTPatterson/Blog/master/Morse/morseref.csv", 
                       ssl.verifypeer = FALSE)
ref.df <- read.csv(text = morseref.url)
 
# helper function:
var_find = function(vec, t, s){
  var.out = var(vec[(t-s):(t+s)])
  return(var.out)}
 
 
## loading reference files
 
## note: you'll need to change the file path for the sf.1 file
 
## sound file can be downloaded here:
## http://www.arrl.org/files/file/Morse/Archive/10%20WPM/140625_10WPM.mp3
 
## note: you'll need to convert the file to .wav for the function to work.
 
 
sf.1 = readWave("C:/Users/Mark/Desktop/RInvest/Morse Code/Sound Files/140625_10WPM.wav")
 
 
# defining the morse to text function:
m.to.text.func = function(sound.file){
 
  # read data into a dataframe
  df = data.frame(indx = 1:length(sound.file@left), vec = sound.file@left)
 
 
  # points to sample:
  sample.points = seq(from = 100, by = 100, to = length(df$vec))
 
  # applying the variance finder at the sampled points:
  tiny.df = data.frame(var = sapply(sample.points, 
                                    function(x){var_find(vec = df$vec,
                                                         t = x,
                                                         s = 50)}))
 
  # decide which points are 'on'
  tiny.df$on = as.numeric(tiny.df$var > 100000)
  tiny.df$indx = 1:nrow(tiny.df)
 
 
 
  # create a vector of changes in on:
  raw.vec = diff(tiny.df$on)
 
  # create indices for change instances -- these will be 1 and -1
  beep.start.vals = which(raw.vec == 1)
  beep.stop.vals = which(raw.vec == -1)
 
  # converting indices to durations:
  beep.durs = beep.stop.vals - beep.start.vals
  pause.durs = beep.start.vals[-1] - beep.stop.vals[-length(beep.stop.vals)]
 
 
  ## note: for some files, there seems to be a few 
  ## few beep durs that are only 1; for now, hard coding these out:
 
 
  beep.durs = beep.durs[beep.durs>1]
  pause.durs = pause.durs[pause.durs>1]
 
 
 
  ## recoding beep durs 
 
  ## note: this step needs to  take the beep.durs data and the pause.durs data
  ## and return duration barriers.  
 
 
  ## first, creating pause barriers:
 
  raw.tab = table(pause.durs)
 
 
  pause.centers.raw = kmeans(as.numeric(names(raw.tab[raw.tab > 5])),3)$centers[,1]
 
  pause.centers = pause.centers.raw[order(pause.centers.raw,decreasing = F)]
  pause.levels = as.vector(pause.centers)
 
 
  # determining separator values:
  pause.sep.1 = mean(pause.levels[1:2])
  pause.sep.2 = mean(pause.levels[2:3])
 
 
  ## similar exercise for beep.durs:
  raw.tab = table(beep.durs)
  beep.centers.raw = kmeans(as.numeric(names(raw.tab[raw.tab > 5])),2)$centers[,1]
  beep.centers = beep.centers.raw[order(beep.centers.raw,decreasing = F)]
  beep.levels = as.vector(beep.centers)
 
  beep.sep = mean(beep.levels[1:2])
 
 
  ## creating the letter and word end vectors:
  letter.ends = which(pause.durs > pause.sep.1)
  word.ends = which(as.numeric(pause.durs[pause.durs > pause.sep.1] > pause.sep.2) == 1)
 
 
 
  # recoding beep durations to long and short:
  beep.durs.let = beep.durs
  beep.durs.let[beep.durs.let > beep.sep] = "l"
  beep.durs.let[beep.durs.let < beep.sep] = "s"
 
 
 
 
  ## grouping the beep duration letters (l's and s's) into letters
  ## based on the letter ends vector
  empty.list = list()
  start.val = 1
  for(i in 1:length(letter.ends)){
    cur.points = beep.durs.let[start.val:letter.ends[i]]
    empty.list[[i]] = paste(cur.points,collapse = "")
    start.val = letter.ends[i] + 1  
  }
 
  letter.vec = unlist(lapply(empty.list, function(x){ref.df$letter[which(ref.df$code == x)]}))
 
 
  ## grouping letters into words based on word.ends vec:
  start.val = 1
  empty.list = list()
  for(i in 1:length(word.ends)){
    cur.points = letter.vec[start.val:word.ends[i]]
    empty.list[[i]] = paste(cur.points,collapse = "")
    start.val = word.ends[i] + 1  
  }
 
 
  ## saving as a new vector, with spacing:
  out = paste(unlist(empty.list),collapse = " ")
 
 
  return(out)
}
 
# examples:
m.to.text.func(sf.1)

Created by Pretty R at inside-R.org