I have been asked to create a classifier to identify the speaker of recordings of individual phrases spoken by either Di or Souhaib. Each sound sample consists of 1,000 amplitude values, which have been downsampled from the original recordings. For my final classifier, I have researched and exploited some of the properties of frequency spectra of sound waves, as well as other techniques in audio analysis.

Before we embark on this project, though:

set.seed(7) #For luck!

Data Exploration

First, let’s plot some of our training data.

marrangeGrob(basic_plots, nrow=2, ncol=1, top=NULL)

plot_training_words(tr_long)

A problem is immediately evident. It appears the waveform of the same word spoken by Di and Souhaib have a relatively high covariance, compared with two different words spoken by the same speaker. This pattern is evident in the full training data, plotted as overlaid waveforms:

ggplot(tr_long, aes(time, value, colour=name, alpha=name)) + 
  geom_line() + 
  scale_alpha_discrete(range=c(1, 0.7)) +
  labs(title="All waveforms overlaid", y="amplitude")

We will therefore need to try to extract features of the data that have low variance between different words, but high variance between different speakers.

Before transforming our data however, let’s see if there are any sections of the raw waveforms that have any predictive power. We will define a function, separation_plot(), that takes a two column data frame, fits a linear Support Vector Machine, and plots the resulting classifier. This will allow us to rapidly investigate potential features.

We will define tr as the waveforms of the training data, and ts as the waveforms of the test data.

Classification using the raw waveform

First up: by eyeballing the above waveform plot, the following separation was found on the raw waveforms. Note that we are only considering the absolute values of the waveforms, as a negative value on a waveform represents the same amplitude as a positive value.

When we create each linear SVM classifier, we will also keep a running count of wrong predictions in the training set, wrong_tally.

amplitude_separation <- data.frame(apply(tr, 1, function(x){mean(subset(abs(x[600:825]), abs(x[600:825]) > .1))}), 
                                   apply(tr, 1, function(x){mean(x[-(600:825)])}))
separation_plot(amplitude_separation, 
                "Mean of amplitudes > 0.1 at 599 < time < 826", 
                "Mean of remaining amplitudes")

svm_model <- svm(name~., data.frame(amplitude_separation, name=tr_answers), kernel="linear")
wrong_tally <- as.numeric(predict(svm_model, amplitude_separation) != tr_answers)

Interestingly it appears that we can achieve a fairly good separation using the mean of some of the amplitudes of the waveforms alone. As expected however, when these are excluded, the remaining amplitudes plotted on the y-axis are next to useless for classification.

We can incorporate more of this information in the simple 2-variable plot using principal components.

pca <- prcomp(abs(tr))
pca_separation <- data.frame(
  colSums(abs(apply(tr, 1, function(x){as.numeric(pca$rotation[,1]*abs(x))}))),
  colSums(abs(apply(tr, 1, function(x){as.numeric(pca$rotation[,2]*abs(x))})))
  )
separation_plot(pca_separation, "PC1", "PC2")

wrong_tally <- update_tally(pca_separation, wrong_tally)

Once again, we see that the absolute value of the waveforms give a fairly good split of the data, so we should include this in our model.

Transformation and Feature Extraction

The Fourier Transform

We can think of our waveforms as essentially continuous time series. An important mathematical transformation for continuous functions over time is the Fourier Transform, which converts a signal from a function of time to a function of frequency. Therefore, by taking the Fourier Transform of each signal, we will get a representation of the relative power of a series of sine waves in the additive signal.

The Fourier Transform returns a real and a complex element. The complex element represents the phase shift of each component frequency, which is of little use in our analysis. We will only consider the absolute magnitude of the Fourier Transform, which represents the magnitude of each component sine wave.

This is likely to be a very powerful technique for feature extraction, as Souhaib has a deeper voice than Di, so the distribution of the Fourier Transform of each of Souhaib’s words is likely to be more positively skewed than Di’s.

One thing to note: because of the way the test data was encoded, the frequency of each sample is variable. An important implication of this is that our Fourier Transforms can only be considered rough estimates of the “true” frequency distribution.

To illustrate this problem, when listening to the recordings, we notice:

listen(as.numeric(tr[1,]), f=1400)  # Di: analysis. The pitch sounds approximately correct at f = 1400.
listen(as.numeric(tr[75,]), f=1650)  # Di: time. This word is shorter, so requires a higher frequency rate = 1650 to sound right.
  1. To achieve a correct pitch for Di saying analysis, we set the frequency at around 1400.
  2. To achieve a correct pitch for Di saying time, we set the frequency at around 1650.

We will later consider ways to remove this variation between samples.

As before, let’s look at the plots of the Fourier Transform on a sample of each of our words:

fft_wide <- cbind(tr_wide[,1:4], t(apply(tr, 1, function(x){abs(fft(x))}))[,1:500])
fft_long <- melt(fft_wide, id.var=c("id", "name", "phrase", "rep"), variable.name="time")
fft_long$time <- rep(1:500, each=80)
fft_long$value <- sqrt(fft_long$value)

plot_training_words(fft_long)

Our initial plots of the Fourier Transform look promising. There looks to be a good deal of separation between the same word spoken by Di and Souhaib.

For example, compare the following overlays of waveforms and Fourier Transforms:

ggplot(tr_long, aes(time, sqrt(abs(value)), colour=name, alpha=name)) +
  scale_alpha_discrete(range=c(1, 0.7)) +
  geom_line() +
  labs(title="Di and Souhaib say analysis", y="amplitude")

ggplot(fft_long, aes(time, value, colour=name, alpha=name)) +
  scale_alpha_discrete(range=c(1, 0.7)) +
  geom_line() +
  labs(title="Magnitude of Fourier Transforms")

ggplot(filter(fft_long, time > 30, time < 75), aes(time, value, colour=name, alpha=name)) +
  scale_alpha_discrete(range=c(1, 0.7)) + 
  geom_point() +
  labs(title="Magnitude of Fourier Transforms")

fft_separation <- data.frame(apply(fft_wide, 1, function(x){mean(as.numeric(x[424:474]))}), 
                             apply(fft_wide, 1, function(x){mean(as.numeric(x[34:79]))}))
separation_plot(fft_separation, 
                "Mean of absolute value of Fourier transform at 423 < t < 475", 
                "Mean of absolute value of Fourier transform at 33 < t < 80")

wrong_tally <- update_tally(fft_separation, wrong_tally)

This appears to be the greatest separation we have achieved yet. But we can take our frequency transformations one step further…

The Spectrogram

A spectrogram is essentially a plot over time of a local Fourier Transform. This gives us the best of both worlds: the raw Fourier Transform is time independent, but by using a moving local window over the sample, we can track how a sound’s frequency profile evolves over the course of a word.

The following is the spectrogram of samples 5 and 51, which are good examples of the common features of Di and Souhaib’s spectrograms. With some trial and error, a time window of length 120 was determined to give the best separation of our samples.

spectro(as.numeric(tr[5,]), f=1000, wl=120, ovlp=0, tlab="Time", flab="Frequency", oma=c(0,0,2,0))
title("Di: analytics", outer=TRUE)

spectro(as.numeric(tr[51,]), f=1000, wl=120, ovlp=0, tlab="Time", flab="Frequency", oma=c(0,0,2,0))
title("Souhaib: learn", outer=TRUE)

These plots are quite complicated, and there are many common features that stand out. For example, the dominant frequencies of Di’s words tend to vary less over time than Souhaib’s, which tend to trend upwards. This implies that Souhaib’s upper inflection at the end of words is more pronounced than Di’s. Furthermore, the relative intensity of lower frequencies is greater for Souhaib than Di, reflecting Souhaib’s deeper voice.

Interpretation of the frequency axis

I have intentionally suppressed the axis label “kHz”, because as discussed earlier, the frequency across our sound samples is not constant, so the axis label would be inaccurate. Instead, the frequency axis can be interpreted as the reciprocal of the number of samples a component sine wave takes to complete a cycle: the wavelength. To illustrate, consider the waveform 0, 1, 0, -1, which has a wavelength 4. The frequency is then 1/4, which can be read directly from the spectrogram:

spectro(rep(c(0, 1, 0, -1), 250), f=1000, wl=499, ovlp=0, tlab="Time", flab="Frequency", oma=c(0,0,2,0), flog=TRUE)

It is interesting to note a fundamental property of waves that arises here: the Heisenberg Uncertainty Principle. We need to choose a tradeoff between time resolution (small window length wl) and frequency resolution (large wl).

For example, if we wanted to know the approximate frequency distribution at a highly specific point in time, we would have to use the following spectrogram, which is much “fuzzier” in the frequency axis:

spectro(rep(c(0, 1, 0, -1), 250), f=1000, wl=16, ovlp=0, tlab="Time", flab="Frequency", oma=c(0,0,2,0), flog=TRUE)

As our raw Fourier Transform already gives us high resolution frequency data, to extract more information from our sound data, we will use a relatively low window length of 120. This allows us to track frequency changes over time.

Observations about the spectrograms of the training data

From these spectrograms, I noticed an interesting feature. Sibilant sounds, such as s and sh, tend to be difficult to tell apart between Di and Souhaib. Consider the following sound samples:

spectro(as.numeric(tr[72,]), f=1000, wl=120, ovlp=0, tlab="Time", flab="Frequency", oma=c(0,0,2,0), flog=TRUE)
title("Di: spline", outer=TRUE)

spectro(as.numeric(tr[74,]), f=1000, wl=120, ovlp=0, tlab="Time", flab="Frequency", oma=c(0,0,2,0), flog=TRUE)
title("Souhaib: spline", outer=TRUE)

Notice that towards the left of the plot, the frequency distribution is almost uniform. This is because sibilant sounds are spectrally similar to white noise, which is by definition noise with a uniform frequency distribution!

Let’s compare four cases, all of different words to see if there are any common patterns.

layout(matrix(c(1,2,3,4), 2, 2))
par(oma = c(3,2.2,0,0) + 0.1, mar = c(0,0.4,2,0) + 0.1)
par(xaxt="n", yaxt="s")
spectro(as.numeric(tr[9,100:900]), f=800, wl=120, ovlp=0, scale=FALSE, flab="Relative Frequency", main="Di: bagging")
par(xaxt="s", yaxt="s")
spectro(as.numeric(tr[60,100:900]), f=800, wl=120, ovlp=0, scale=FALSE, flab="Relative Frequency", main="Di: model")
par(xaxt="n", yaxt="n")
spectro(as.numeric(tr[6,100:900]), f=800, wl=120, ovlp=0, scale=FALSE, flab="Relative Frequency", main="Souhaib: analysis")
par(xaxt="s", yaxt="n")
spectro(as.numeric(tr[76,100:900]), f=800, wl=120, ovlp=0, scale=FALSE, flab="Relative Frequency", main="Souhaib: time")

What a fantastic result! Even over different words, the main features of Di and Souhaib’s spectrogram’s appear to be relatively constant. Recall that the major problem we first identified with our waveforms was that they were highly variable between words, but not between speakers. We have now flipped this: the variance between speakers is high, but not between words!

To illustrate how helpful the spectrogram will be to our classifier, consider the following basic dimension reduction of a waveform to two variables: the two most acute peaks of its spectrum.

layout(matrix(c(1,2), ncol=2))
par(oma = c(3,0,0,0) + 0.1, mar = c(0,0.4,2,0) + 0.1)
fpeaks(meanspec(as.numeric(tr[1,]), f=1000, wl=120, plot=FALSE), nmax=2, title=FALSE) 
title("Di: analysis")
fpeaks(meanspec(as.numeric(tr[2,]), f=1000, wl=120, plot=FALSE), nmax=2, title=FALSE) #peaks with highest slopes
title("Souhaib: analysis")

spec_peaks <- data.frame(apply(tr, 1, function(x){fpeaks(meanspec(x, f=1000, wl=90, plot=FALSE), nmax=2, plot=FALSE)[2,1]}), 
                         apply(tr, 1, function(x){fpeaks(meanspec(x, f=1000, wl=90, plot=FALSE), nmax=2, plot=FALSE)[1,1]}))

separation_plot(spec_peaks, "High frequency peak", "Low frequency peak")

wrong_tally <- update_tally(spec_peaks, wrong_tally)

By taking a window of this spectrum over time, we tap into a particularly rich source of differentiation between Di and Souhaib’s speech.

The Zero Crossing Rate

In the course of our spectrogram analysis, we noticed that we had some difficulty differentiating between the “s” sounds of Di and Souhaib. This sound is known as an unvoiced sound, as the vocal cords do not vibrate to produce it.

The instantaneous zero crossing rate is a simple (but rough) way of classifying speech as voiced or unvoiced. The total crossing rate, then, is a rough proxy for the amount of “aspiration” the word spoken. Unfortunately, this statistic is likely to vary across words quite a lot, however for our purposes will be better than nothing.

Due to the influence of unvoiced sounds on the spectrogram, it will be useful to quantify roughly how aspirated a phrase is in our test data. As we will see in the final model, by interacting our spectrogram values with our zero crossing rate for each sample, we gain a crucial way of differentiating the more difficult to classify sounds. I believe this unusual combination of variables was invaluable in achieving my final Kaggle score.

Consider the following function:

zero_crosses <- function(x) {
  total <- 0
  previous <- x[1] > 0
  for (i in 2:length(x)) {
    current <- x[i] > 0
    total <- total + (previous == current)
    previous <- current
  }
  return(total)
}

Then plotting the zero crossing rate for our training samples:

tr_zc <- data.frame(name=tr_answers, zc=apply(tr, 1, zero_crosses))
tr_zc <- arrange(tr_zc, zc)
ggplot(tr_zc, aes(x=1:80, y=zc, fill=name)) + 
  geom_bar(stat="identity") + 
  labs(x="words", y="zero crossing rate") +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())

wrong_tally <- wrong_tally + as.numeric(factor(ifelse(apply(tr, 1, zero_crosses) > median(tr_zc$zc), "souhaib", "di")) != tr_answers)

Interestingly, it appears the zero crossing rate appears to itself have some predictive power. There are two possible explanations here:

  1. Souhaib’s pronounciation of words is more aspirated than Di’s.
  2. Souhaib’s word samples have longer silent sections on average. Silence also tends to have a higher zero crossing rate.

Personally, I favour the second explanation, but it is likely that some characteristic of speech also contributes to this effect. However, the zero crossing rate alone did not contribute to the predictive power of our classifier, so it appears only as an interaction term in the final model.

Skewness

Our final feature that we will extract is the skewness of the absolute value of the waveform. As we already have a significant amount of information on the Fourier Transformed data, we will not include its skewness in the final model. It is plotted here for comparison only:

tr_skewness <- data.frame(apply(tr, 1, function(x){skewness(abs(fft(x)))}), apply(tr, 1, skewness))
separation_plot(tr_skewness, "Skewness of the Fourier Transform", "Skewness of the raw waveform")

wrong_tally <- update_tally(tr_skewness, wrong_tally)

Such a simple statistic is a relatively poor metric, but it does give us new information about our sounds about the change in amplitude, so will be helpful in our final model.

Summary of extracted features

These features will be all we need in our final model. But before we construct the model, let’s investigate our tally of wrong predictions we have kept from these simples two-dimensional measures.

wrong_tally <- data.frame(tr_wide[,1:3], wrong=wrong_tally) %>%
  filter(wrong != 0)

wrong_tally$phrase <- paste0(wrong_tally$phrase, " (", wrong_tally$id, ")")

ggplot(wrong_tally, aes(x=reorder(phrase, wrong), y=wrong, fill=name)) + 
  geom_bar(stat="identity", position="stack") +
  labs(x="words", y="Number of SVMs that predicted incorrectly") + 
  theme(axis.text.x = element_text(angle=90))

This is a good result: Of six two-variable linear SVMs fitted, every word was classified correctly at least once, and only one word was classified correctly only once. With a good non-linear model, we should be able to achieve a good classification score.

The Model

Through the course of my participation in the Kaggle competition, I attempted to fit many different types of models. These including OLS regression, k-Nearest-Neighbour, Support Vector Machines, decision trees, random forests and boosted forests.

Ultimately, the random forest was the best model for two reasons:

  1. It is relatively robust to low-importance observations. This is important for our classifier as our extracted features are relatively high-dimensional.
  2. It allows for complex non-linearity. The relationship between the spectrogram and the speaker in particular is highly complex, so the ability of random forests to encode complex relationships is particularly appealing.

The best model with our explored features, selected based on the OOB error rate, includes the following variables:

tr_predictors <- cbind(tr, tr_fft, tr_spec, tr_spec * tr_zcr, tr_skew)
names(tr_predictors) <- paste0("X", 1:2221)
ts_predictors <- cbind(ts, ts_fft, ts_spec, ts_spec * ts_zcr, ts_skew)
names(ts_predictors) <- paste0("X", 1:2221)

Now that we have our training and test set data frames with the same variable name structure, we can plug them into our random forest algorithm. The values for ntree, the number of trees grown, and mtry, the number of variables considered in each tree, were selected through trial and error. A more robust method would be to tune these parameters using cross-validation, however due to the computational complexity of these random forests of 2,000+ variables, this would take a long time!

model <- randomForest(x=tr_predictors, y=tr_answers, ntree=1499, mtry=20)
model
## 
## Call:
##  randomForest(x = tr_predictors, y = tr_answers, ntree = 1499,      mtry = 20) 
##                Type of random forest: classification
##                      Number of trees: 1499
## No. of variables tried at each split: 20
## 
##         OOB estimate of  error rate: 6.25%
## Confusion matrix:
##         di souhaib class.error
## di      35       5       0.125
## souhaib  0      40       0.000

Our model has a 6.25% “out of bag” error rate. As the out of bag error rate excludes trees that were trained on the specific sample, we can be confident that test-set predictions will have a similar error rate, assuming the test set data is similar to the training set data.

Further Analysis

Let’s examine the training samples that our random forest misclassified.

wrong <- which(model$predicted != tr_answers)
tr_wide[wrong, 1:4]
##    id   phrase name rep
## 15 15 business   di   6
## 19 19 business   di   6
## 21 21 classify   di   8
## 23 23 classify   di   8
## 34 34     data   di   6

Interestingly, the model’s misclassifications were all instances of words spoken by Di that the model predicted as words spoken by Souhaib. It also appears that, with the exception of “data”, all misclassified words have a strong “s” sound, which we identified earlier is likely to cause difficulty due to the similarity of this sound and white noise.

Using the seewave::listen() function, it seems that these samples are particularly noisy compared to others.

I also noticed that the phrase “data” seems to be a relatively short compared with others recordings. In fact, when listening to it with f=1400, which was about right for “analysis”, the pitch sounds far too low for a sample of Di, even to the human ear. This is likely why our model assumed it was a sample of Souhaib.

Let’s now examine the confidence scores of the model:

confidence <- data.frame(id=1:80, phrase=paste0(tr_wide$phrase, " (", 1:80, ")"), name=tr_answers, model$votes, diff=0.5*abs(model$votes[,1] - model$votes[,2])) %>%
  arrange(desc(diff))

ggplot(confidence, aes(reorder(phrase, di), di, fill=name)) + 
  geom_bar(stat="identity", position="dodge") +
  labs(x="Phrase (sound ID)", y="Probability prediction of Di") + 
  theme(axis.text.x = element_text(angle=90)) +
  geom_abline(data=NULL, slope=0, intercept=0.5, linetype="dashed")

subset(confidence, id %in% wrong)
##    id        phrase name        di   souhaib       diff
## 49 21 classify (21)   di 0.4011299 0.5988701 0.09887006
## 61 34     data (34)   di 0.4409006 0.5590994 0.05909944
## 63 23 classify (23)   di 0.4467005 0.5532995 0.05329949
## 66 15 business (15)   di 0.4550360 0.5449640 0.04496403
## 75 19 business (19)   di 0.4664247 0.5335753 0.03357532

The “most wrong” OOB prediction was for id 21, classify spoken by Di, which our random forest predicted as Souhaib with 60% confidence. This was the same observation that was incorrectly predicted by 5 of 6 of our simple SVM models.

Finally, here is the raw output of the Random Forest model on the test set data.

predict(model, ts_predictors)
##       1       2       3       4       5       6       7       8       9 
## souhaib      di      di      di      di      di souhaib souhaib      di 
##      10      11      12      13      14      15      16      17      18 
## souhaib      di      di souhaib      di      di      di      di souhaib 
##      19      20      21      22      23      24      25      26      27 
##      di souhaib      di      di souhaib souhaib souhaib      di      di 
##      28      29      30      31      32      33      34      35      36 
## souhaib souhaib souhaib souhaib      di      di souhaib souhaib      di 
##      37      38      39      40      41      42 
## souhaib      di souhaib souhaib souhaib souhaib 
## Levels: di souhaib