Speech Recognition by using Deep Learning

4 min readNov 11, 2021

Speech recognition refers the ability of the machine to convert spoken word to readable text. For this blog I will be referring to a kaggle competition- TensorFlow Speech Recognition Challenge (https://www.kaggle.com/c/tensorflow-speech-recognition-challenge) organized by Google Brain. This data contains 65,000 one-second long utterances of 30 short words, by thousands of different people.

In this blog I have dealt with

Creation of useful Features
Model Architecture

The first task we have in our hand is to convert voice into a mathematical form so that further machine learning model can be built over it. An audio file has audio waveform which contains amplitude of the wave at every instant of time(analog). In order to carry on computation we need to discretize it or in other words we need to decide at what all instants (interval) of time we want output of that audio waveform (Digital).

We need to provide sampling rate to get this discretization. For example if audio file is of 1 sec and sampling rate is 16000 then in this case 16000 points will be extracted. Number of data point =sampling rate * time

So with this we have a digital interpretation of an audio file but we want to further extract some useful features. Voice has three features mainly loudness, pitch and timber. Now these features has to be extracted mathematically to do computation over it.

Loudness- Loudness is directly related to amplitude. As discussed above, this information is already present in the file

Pitch- Pitch is related to frequency. This relation is stablished by mel scale

Relation between Frequency vs Pitch in mels

Further detail on Mel scale can be studied from https://www.sfu.ca/sonic-studio-webdav/handbook/Mel.html. We will use Fourier transform to get frequencies.

Timbre- Timbre depends on the waveform. Timbre is the quality of sound which help in distinguishing one sound with another.

Now as discussed above we require frequency but what we have is amplitude at each instant of time so we need to convert a time domain signal to a frequency domain signal. Thankfully this can be done by Fourier Transform.

If anybody wants to understand Maths behind Fourier transform then I recommend this Youtube channel -https://www.youtube.com/watch?v=iCwMQJnKk2c&list=PL-wATfeyAMNqIee7cH3q1bh4QJFAaeNv0

But with this Fourier transform all information about time series is lost. For this a spectogram is used which divides entire time into multiple windows and perform Fourier transform on each of those. (https://medium.com/analytics-vidhya/understanding-the-mel-spectrogram-fca2afa2ce53)

image from https://www.mathworks.com/help/dsp/ref/dsp.stft.html

Now we have all the feature to start preparing model. Lets see how to implement these things in python

samples, sample_rate = librosa.load('path', sr=16000)
spectrum = librosa.feature.melspectrogram(y=samples, sr=16000, n_mels=128)
samples = librosa.power_to_db(S=spectrum, ref=np.max)

n_mels refer to mel number which determines for which all frequencies Fourier transform needs to be calculated.

Another parameter is hop_length which is how many data points needs to be there for each window of time. Default value is 512 so if length of an audio file is 1 sec, n_mels selected is 64 and sampling rate is 16000 then in that case output of spectogram will be (n_mels,(time*sampling rate)/hop_length) which is (64,(16000*1)/512)=(64,32)

Model Architecture

Now as discussed above output of the mel spectrogram is (n_mels,t) so one important decision to make is that should we use it as it is or apply transpose and make it as (t,n_mels). I have build different models to check which is better.

For building model I started with conv2d on referring to this url https://www.tensorflow.org/tutorials/audio/simple_audio but as this is time series data I was thinking of using lstm and then I came across this blog

https://towardsdatascience.com/tensorflow-speech-recognition-challenge-solution-outline-9c42dbd219c9 which used conv2d and lstm both. This is a very good idea as this is treating this as image and recognizing pattern in it and also utilizes time series capability of the mel spectrogram. This sounds better than using just lstm or just conv2d. Now these all things needs to be put into test.

In order to check I made 4 models:-

Mel (n_mels,t) with conv2d and lstm
Mel (n_mels,t) with only lstm
Mel (t,n_mels) with conv2d and lstm
Mel (t,n_mels) with only lstm

I used following network as base model for trying out these iterations

Results

This shows transpose mel spectrogram along with conv2d+lstm network works better with conv2d+lstm network.

Speech Recognition by using Deep Learning

Model Architecture

Results

Written by Alekh Sinha