Book Image

Python Deep Learning Cookbook

By : Indra den Bakker
Book Image

Python Deep Learning Cookbook

By: Indra den Bakker

Overview of this book

Deep Learning is revolutionizing a wide range of industries. For many applications, deep learning has proven to outperform humans by making faster and more accurate predictions. This book provides a top-down and bottom-up approach to demonstrate deep learning solutions to real-world problems in different areas. These applications include Computer Vision, Natural Language Processing, Time Series, and Robotics. The Python Deep Learning Cookbook presents technical solutions to the issues presented, along with a detailed explanation of the solutions. Furthermore, a discussion on corresponding pros and cons of implementing the proposed solution using one of the popular frameworks like TensorFlow, PyTorch, Keras and CNTK is provided. The book includes recipes that are related to the basic concepts of neural networks. All techniques s, as well as classical networks topologies. The main purpose of this book is to provide Python programmers a detailed list of recipes to apply deep learning to common and not-so-common scenarios.
Table of Contents (21 chapters)
Title Page
About the Author
About the Reviewer
Customer Feedback

Identifying speakers with voice recognition

Next to speech recognition, there is we can do with sound fragments. While speech recognition focuses on converting speech (spoken words) to digital data, we can also use fragments to identify the person who is speaking. This is also known as voice recognition. Every individual has different characteristics when speaking, caused by differences in anatomy and behavioral patterns. Speaker verification and speaker identification are getting more attention in this digital age. For example, a home digital assistant can automatically detect which person is speaking.

In the following recipe, we'll be using the same data as in the previous recipe, where we implemented a speech recognition pipeline. However, this time, we will be classifying the speakers of the spoken numbers. 

How to do it...

  1. In this recipe, we start by importing all libraries:
import glob
import numpy as np
import random
import librosa
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelBinarizer

import keras
from keras.layers import LSTM, Dense, Dropout, Flatten
from keras.models import Sequential
from keras.optimizers import Adam
from keras.callbacks import EarlyStopping, ModelCheckpoint
  1. Let's set SEED and the location of the .wav files:
SEED = 2017
DATA_DIR = 'Data/spoken_numbers_pcm/' 
  1. Let's split the .wav files in a training set and a validation set with scikit-learn's train_test_split function:
files = glob.glob(DATA_DIR + "*.wav")
X_train, X_val = train_test_split(files, test_size=0.2, random_state=SEED)

print('# Training examples: {}'.format(len(X_train)))
print('# Validation examples: {}'.format(len(X_val)))
  1. To extract and print all unique labels, we use the following code:
labels = []
for i in range(len(X_train)):
    label = X_train[i].split('/')[-1].split('_')[1]
    if label not in labels:
  1. We can now define our one_hot_encode function as follows:
label_binarizer = LabelBinarizer()

def one_hot_encode(x): return label_binarizer.transform(x)
  1. Before we can feed the data to our network, some preprocessing needs to be done. We use the following settings:
n_features = 20
max_length = 80
n_classes = len(labels)
  1. We can now our batch generator. The generator all preprocessing tasks, such as reading a .wav file and transforming it into usable input:
def batch_generator(data, batch_size=16):
    while 1:
        X, y = [], []
        for i in range(batch_size):
            wav = data[i]
            wave, sr = librosa.load(wav, mono=True)
            label = wav.split('/')[-1].split('_')[1]
            mfcc = librosa.feature.mfcc(wave, sr)
            mfcc = np.pad(mfcc, ((0,0), (0, max_length-
            len(mfcc[0]))), mode='constant', constant_values=0) 
        yield np.array(X), np.array(y)


Please note the difference in our batch generator compared to the previous recipe.

  1. Let's define the hyperparameters before defining our network architecture:
learning_rate = 0.001
batch_size = 64
n_epochs = 50
dropout = 0.5

input_shape = (n_features, max_length)
steps_per_epoch = 50
  1. The network architecture we will use is quite straightforward. We will stack an LSTM layer on top of a dense layer, as follows:
 model = Sequential()
 model.add(LSTM(256, return_sequences=True, input_shape=input_shape,
 model.add(Dense(128, activation='relu'))
 model.add(Dense(n_classes, activation='softmax'))
  1. Next, we set the function, compile the model, and a summary of our model:
opt = Adam(lr=learning_rate)
 model.compile(loss='categorical_crossentropy', optimizer=opt,
  1. To prevent overfitting, we will be using early stopping and automatically store the model that has the highest validation accuracy:
callbacks = [ModelCheckpoint('checkpoints/voice_recognition_best_model_{epoch:02d}.hdf5', save_best_only=True),
            EarlyStopping(monitor='val_acc', patience=2)]
  1. We are ready to start training and we will store the results in history:
 history = model.fit_generator(
   generator=batch_generator(X_train, batch_size),
   validation_data=batch_generator(X_val, 32),

In the following figure, the training accuracy and validation accuracy are plotted against the epochs:

Figure 9.1: Training and validation accuracy