Book Image

Data Science with Python

By : Rohan Chopra, Aaron England, Mohamed Noordeen Alaudeen
Book Image

Data Science with Python

By: Rohan Chopra, Aaron England, Mohamed Noordeen Alaudeen

Overview of this book

Data Science with Python begins by introducing you to data science and teaches you to install the packages you need to create a data science coding environment. You will learn three major techniques in machine learning: unsupervised learning, supervised learning, and reinforcement learning. You will also explore basic classification and regression techniques, such as support vector machines, decision trees, and logistic regression. As you make your way through the book, you will understand the basic functions, data structures, and syntax of the Python language that are used to handle large datasets with ease. You will learn about NumPy and pandas libraries for matrix calculations and data manipulation, discover how to use Matplotlib to create highly customizable visualizations, and apply the boosting algorithm XGBoost to make predictions. In the concluding chapters, you will explore convolutional neural networks (CNNs), deep learning algorithms used to predict what is in an image. You will also understand how to feed human sentences to a neural network, make the model process contextual information, and create human language processing systems to predict the outcome. By the end of this book, you will be able to understand and implement any new data science algorithm and have the confidence to experiment with tools or libraries other than those covered in the book.
Table of Contents (10 chapters)

Chapter 8: Tips and Tricks of the Trade

Activity 21: Classifying Images using InceptionV3

Solution:

  1. Create functions to get images and labels. Here PATH variable contains the path to the training dataset.

    from PIL import Image

    def get_input(file):

        return Image.open(PATH+file)

    def get_output(file):

        class_label = file.split('.')[0]

        if class_label == 'dog': label_vector = [1,0]

        elif class_label == 'cat': label_vector = [0,1]

        return label_vector

  2. Set SIZE and CHANNELS. SIZE is the dimension of the square image input. CHANNELS is the number of channels in the training data images. There are 3 channels in a RGB image.

    SIZE = 200

    CHANNELS = 3

  3. Create a function to preprocess and augment images:

    def preprocess_input(image):

        

        # Data preprocessing

        image = image.resize((SIZE,SIZE))

        image = np.array(image).reshape(SIZE,SIZE,CHANNELS)

        

        # Normalize image

        image = image/255.0

        

        return image

  4. Finally, develop the generator that will generate the batches:

    import numpy as np

    def custom_image_generator(images, batch_size = 128):

        

        while True:

            # Randomly select images for the batch

            batch_images = np.random.choice(images, size = batch_size)

            batch_input = []

            batch_output = []

            

            # Read image, perform preprocessing and get labels

            for file in batch_images:

                # Function that reads and returns the image

                input_image = get_input(file)

                # Function that gets the label of the image

                label = get_output(file)

                # Function that pre-processes and augments the image

                image = preprocess_input(input_image)

     

                batch_input.append(image)

                batch_output.append(label)

     

            batch_x = np.array(batch_input)

            batch_y = np.array(batch_output)

     

            # Return a tuple of (images,labels) to feed the network

            yield(batch_x, batch_y)

  5. Next, we will read the validation data. Create a function to read the images and their labels:

    from tqdm import tqdm

    def get_data(files):

        data_image = []

        labels = []

        for image in tqdm(files):

            label_vector = get_output(image)

            

            img = Image.open(PATH + image)

            img = img.resize((SIZE,SIZE))

            

            labels.append(label_vector)

            img = np.asarray(img).reshape(SIZE,SIZE,CHANNELS)

            img = img/255.0

            data_image.append(img)

            

        data_x = np.array(data_image)

        data_y = np.array(labels)

            

        return (data_x, data_y)

  6. Read the validation files:

    import os

    files = os.listdir(PATH)

    random.shuffle(files)

    train = files[:7000]

    test = files[7000:]

    validation_data = get_data(test)

    7. Plot a few images from the dataset to see whether you loaded the files correctly:

    import matplotlib.pyplot as plt

    plt.figure(figsize=(20,10))

    columns = 5

    for i in range(columns):

        plt.subplot(5 / columns + 1, columns, i + 1)

        plt.imshow(validation_data[0][i])

    A random sample of the images is shown here:

    Figure 8.16: Sample images from the loaded dataset
    Figure 8.16: Sample images from the loaded dataset
  7. Load the Inception model and pass the shape of the input images:

    from keras.applications.inception_v3 import InceptionV3

    base_model = InceptionV3(weights='imagenet', include_top=False, input_shape=(SIZE,SIZE,CHANNELS))

  8. Add the output dense layer according to our problem:

    from keras.layers import GlobalAveragePooling2D, Dense, Dropout

    from keras.models import Model

    x = base_model.output

    x = GlobalAveragePooling2D()(x)

    x = Dense(256, activation='relu')(x)

    x = Dropout(0.5)(x)

    predictions = Dense(2, activation='softmax')(x)

     

    model = Model(inputs=base_model.input, outputs=predictions)

  9. Next, compile the model to make it ready for training:

    model.compile(loss='categorical_crossentropy',

                  optimizer='adam',

                  metrics = ['accuracy'])

    And then perform the training of the model:

    EPOCHS = 50

    BATCH_SIZE = 128

     

    model_details = model.fit_generator(custom_image_generator(train, batch_size = BATCH_SIZE),

                        steps_per_epoch = len(train) // BATCH_SIZE,

                        epochs = EPOCHS,

                        validation_data= validation_data,

                        verbose=1)

  10. Evaluate the model and get the accuracy:

    score = model.evaluate(validation_data[0], validation_data[1])

    print("Accuracy: {0:.2f}%".format(score[1]*100))

    The accuracy is as follows:

Figure 8.17: Model accuracy
Figure 8.17: Model accuracy

Activity 22: Using Transfer Learning to Predict Images

Solution:

  1. First, set the random number seed so that the results are reproducible:

    from numpy.random import seed

    seed(1)

    from tensorflow import set_random_seed

    set_random_seed(1)

  2. Set SIZE and CHANNELS

    SIZE is the dimension of the square image input. CHANNELS is the number of channels in the training data images. There are 3 channels in a RGB image.

    SIZE = 200

    CHANNELS = 3

  3. Create functions to get images and labels. Here PATH variable contains the path to the training dataset.

    from PIL import Image

    def get_input(file):

        return Image.open(PATH+file)

    def get_output(file):

        class_label = file.split('.')[0]

        if class_label == 'dog': label_vector = [1,0]

        elif class_label == 'cat': label_vector = [0,1]

        return label_vector

  4. Create a function to preprocess and augment images:

    def preprocess_input(image):

        

        # Data preprocessing

        image = image.resize((SIZE,SIZE))

        image = np.array(image).reshape(SIZE,SIZE,CHANNELS)

        

        # Normalize image

        image = image/255.0

        

        return image

  5. Finally, create the generator that will generate the batches:

    import numpy as np

    def custom_image_generator(images, batch_size = 128):

        

        while True:

            # Randomly select images for the batch

            batch_images = np.random.choice(images, size = batch_size)

            batch_input = []

            batch_output = []

            

            # Read image, perform preprocessing and get labels

            for file in batch_images:

                # Function that reads and returns the image

                input_image = get_input(file)

                # Function that gets the label of the image

                label = get_output(file)

                # Function that pre-processes and augments the image

                image = preprocess_input(input_image)

     

                batch_input.append(image)

                batch_output.append(label)

     

            batch_x = np.array(batch_input)

            batch_y = np.array(batch_output)

     

            # Return a tuple of (images,labels) to feed the network

            yield(batch_x, batch_y)

  6. Next, we will read the development and test data. Create a function to read the images and their labels:

    from tqdm import tqdm

    def get_data(files):

        data_image = []

        labels = []

        for image in tqdm(files):

            

            label_vector = get_output(image)

            

     

            img = Image.open(PATH + image)

            img = img.resize((SIZE,SIZE))

            

           

            labels.append(label_vector)

            img = np.asarray(img).reshape(SIZE,SIZE,CHANNELS)

            img = img/255.0

            data_image.append(img)

            

        data_x = np.array(data_image)

        data_y = np.array(labels)

            

        return (data_x, data_y)

  7. Now read the development and test files. The split for the train/dev/test set is 70%/15%/15%.

    import random

    random.shuffle(files)

    train = files[:7000]

    development = files[7000:8500]

    test = files[8500:]

    development_data = get_data(development)

    test_data = get_data(test)

  8. Plot a few images from the dataset to see whether you loaded the files correctly:

    import matplotlib.pyplot as plt

    plt.figure(figsize=(20,10))

    columns = 5

    for i in range(columns):

        plt.subplot(5 / columns + 1, columns, i + 1)

        plt.imshow(validation_data[0][i])

    Check the output in the following screenshot:

    Figure 8.18: Sample images from the loaded dataset
    Figure 8.18: Sample images from the loaded dataset
  9. Load the Inception model and pass the shape of the input images:

    from keras.applications.inception_v3 import InceptionV3

    base_model = InceptionV3(weights='imagenet', include_top=False, input_shape=(200,200,3))

    10. Add the output dense layer according to our problem:

    from keras.models import Model

    from keras.layers import GlobalAveragePooling2D, Dense, Dropout

    x = base_model.output

    x = GlobalAveragePooling2D()(x)

    x = Dense(256, activation='relu')(x)

    keep_prob = 0.5

    x = Dropout(rate = 1 - keep_prob)(x)

    predictions = Dense(2, activation='softmax')(x)

     

    model = Model(inputs=base_model.input, outputs=predictions)

  10. This time around, we will freeze the first five layers of the model to help with the training time:

    for layer in base_model.layers[:5]:

        layer.trainable = False

  11. Compile the model to make it ready for training:

    model.compile(loss='categorical_crossentropy',

                  optimizer='adam',

                  metrics = ['accuracy'])

  12. Create callbacks for Keras:

    from keras.callbacks import ModelCheckpoint, ReduceLROnPlateau, EarlyStopping, TensorBoard

    callbacks = [

        TensorBoard(log_dir='./logs',

                    update_freq='epoch'),

        EarlyStopping(monitor = "val_loss",

                     patience = 18,

                     verbose = 1,

                     min_delta = 0.001,

                     mode = "min"),

        ReduceLROnPlateau(monitor = "val_loss",

                         factor = 0.2,

                         patience = 8,

                         verbose = 1,

                         mode = "min"),

        ModelCheckpoint(monitor = "val_loss",

                       filepath = "Dogs-vs-Cats-InceptionV3-{epoch:02d}-{val_loss:.2f}.hdf5",

                       save_best_only=True,

                       period = 1)]

    Note

    Here, we are making use of four callbacks: TensorBoard, EarlyStopping, ReduceLROnPlateau, and ModelCheckpoint.

    Perform training on the model. Here we train our model for 50 epochs only and with a batch size of 128:

    EPOCHS = 50

    BATCH_SIZE = 128

    model_details = model.fit_generator(custom_image_generator(train, batch_size = BATCH_SIZE),

                       steps_per_epoch = len(train) // BATCH_SIZE,

                       epochs = EPOCHS,

                       callbacks = callbacks,

                       validation_data= development_data,

                       verbose=1)

    The training logs on TensorBoard are shown here:

    Figure 8.19: Training set logs from TensorBoard
    Figure 8.19: Training set logs from TensorBoard
  13. You can now fine-tune the hyperparameters taking accuracy of the development set as the metric.

    The logs of the development set from the TensorBoard tool are shown here:

    Figure 8.20: Validation set logs from TensorBoard
    Figure 8.20: Validation set logs from TensorBoard

    The learning rate decrease can be observed from the following plot:

    Figure 8.21: Learning rate log from TensorBoard
    Figure 8.21: Learning rate log from TensorBoard
  14. Evaluate the model on the test set and get the accuracy:

    score = model.evaluate(test_data[0], test_data[1])

    print("Accuracy: {0:.2f}%".format(score[1]*100))

    To understand fully, refer to the following output screenshot:

Figure 8.22: The final accuracy of the model on the test set
Figure 8.22: The final accuracy of the model on the test set

As you can see, the model gets an accuracy of 93.6% on the test set, which is different from the accuracy of the development set (93.3% from the TensorBoard training logs). The early stopping callback stopped training when there wasn't a significant improvement in the loss of the development set; this helped us save some time. The learning rate was reduced after nine epochs, which helped training, as can be seen here:

Figure 8.23: A snippet of the training logs of the model
Figure 8.23: A snippet of the training logs of the model