Book Image

Data Science with Python

By : Rohan Chopra, Aaron England, Mohamed Noordeen Alaudeen
Book Image

Data Science with Python

By: Rohan Chopra, Aaron England, Mohamed Noordeen Alaudeen

Overview of this book

Data Science with Python begins by introducing you to data science and teaches you to install the packages you need to create a data science coding environment. You will learn three major techniques in machine learning: unsupervised learning, supervised learning, and reinforcement learning. You will also explore basic classification and regression techniques, such as support vector machines, decision trees, and logistic regression. As you make your way through the book, you will understand the basic functions, data structures, and syntax of the Python language that are used to handle large datasets with ease. You will learn about NumPy and pandas libraries for matrix calculations and data manipulation, discover how to use Matplotlib to create highly customizable visualizations, and apply the boosting algorithm XGBoost to make predictions. In the concluding chapters, you will explore convolutional neural networks (CNNs), deep learning algorithms used to predict what is in an image. You will also understand how to feed human sentences to a neural network, make the model process contextual information, and create human language processing systems to predict the outcome. By the end of this book, you will be able to understand and implement any new data science algorithm and have the confidence to experiment with tools or libraries other than those covered in the book.
Table of Contents (10 chapters)

Chapter 7: Processing Human Language

Activity 19: Predicting Sentiments of Movie Reviews

Solution:

  1. Read the IMDB movie review dataset using pandas in Python:

    import pandas as pd

    data = pd.read_csv('../../chapter 7/data/movie_reviews.csv', encoding='latin-1')

  2. Convert the tweets to lowercase to reduce the number of unique words:

    data.text = data.text.str.lower()

    Note

    Keep in mind that "Hello" and "hellow" are not the same to a computer.

  3. Clean the reviews using RegEx with the clean_str function:

    import re

    def clean_str(string):

        

        string = re.sub(r"https?\://\S+", '', string)

        string = re.sub(r'\<a href', ' ', string)

        string = re.sub(r'&amp;', '', string)

        string = re.sub(r'<br />', ' ', string)

        string = re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', string)

        string = re.sub('\d','', string)

        string = re.sub(r"can\'t", "cannot", string)

        string = re.sub(r"it\'s", "it is", string)

        return string

    data.SentimentText = data.SentimentText.apply(lambda x: clean_str(str(x)))

  4. Next, remove stop words and other frequently occurring unnecessary words from the reviews:

    Note

    To see how we found these, words refer to Exercise 51.

  5. This step converts strings into tokens (which will be helpful in the next step):

    from nltk.corpus import stopwords

    from nltk.tokenize import word_tokenize,sent_tokenize

    stop_words = stopwords.words('english') + ['movie', 'film', 'time']

    stop_words = set(stop_words)

    remove_stop_words = lambda r: [[word for word in word_tokenize(sente) if word not in stop_words] for sente in sent_tokenize(r)]

    data['SentimentText'] = data['SentimentText'].apply(remove_stop_words)

  6. Create the word embedding of the reviews with the tokens created in the previous step. Here, we will use genism Word2Vec to create these embedding vectors:

    from gensim.models import Word2Vec

    model = Word2Vec(

            data['SentimentText'].apply(lambda x: x[0]),

            iter=10,

            size=16,

            window=5,

            min_count=5,

            workers=10)

    model.wv.save_word2vec_format('movie_embedding.txt', binary=False)

  7. Combine the tokens to get a string and then drop any review that does not have anything in it after stop word removal:

    def combine_text(text):

        try:

            return ' '.join(text[0])

        except:

            return np.nan

     

    data.SentimentText = data.SentimentText.apply(lambda x: combine_text(x))

    data = data.dropna(how='any')

  8. Tokenize the reviews using the Keras Tokenizer and convert them into numbers:

    from keras.preprocessing.text import Tokenizer

    tokenizer = Tokenizer(num_words=5000)

    tokenizer.fit_on_texts(list(data['SentimentText']))

    sequences = tokenizer.texts_to_sequences(data['SentimentText'])

    word_index = tokenizer.word_index

  9. Finally, pad the tweets to have a maximum of 100 words. This will remove any words after the 100-word limit and add 0s if the number of words is less than 100:

    from keras.preprocessing.sequence import pad_sequences

    reviews = pad_sequences(sequences, maxlen=100)

  10. Load the created embedding to get the embedding matrix using the load_embedding function discussed in the Text Processing section:

    import numpy as np

     

    def load_embedding(filename, word_index , num_words, embedding_dim):

        embeddings_index = {}

        file = open(filename, encoding="utf-8")

        for line in file:

            values = line.split()

            word = values[0]

            coef = np.asarray(values[1:])

            embeddings_index[word] = coef

        file.close()

        

        embedding_matrix = np.zeros((num_words, embedding_dim))

        for word, pos in word_index.items():

            if pos >= num_words:

                continue

            embedding_vector = embeddings_index.get(word)

            if embedding_vector is not None:

                embedding_matrix[pos] = embedding_vector

        return embedding_matrix

     

    embedding_matrix = load_embedding('movie_embedding.txt', word_index, len(word_index), 16)

  11. Convert the label into one-hot vector using pandas' get_dummies function and split the dataset into testing and training sets with an 80:20 split:

    from sklearn.model_selection import train_test_split

    labels = pd.get_dummies(data.Sentiment)

    X_train, X_test, y_train, y_test = train_test_split(reviews,labels, test_size=0.2, random_state=9)

  12. Create the neural network model starting with the input and embedding layers. This layer converts the input words into their embedding vectors:

    from keras.layers import Input, Dense, Dropout, BatchNormalization, Embedding, Flatten

    from keras.models import Model

    inp = Input((100,))

    embedding_layer = Embedding(len(word_index),

                        16,

                        weights=[embedding_matrix],

                        input_length=100,

                        trainable=False)(inp)

  13. Create the rest of the fully connected neural network using Keras:

    model = Flatten()(embedding_layer)

    model = BatchNormalization()(model)

    model = Dropout(0.10)(model)

    model = Dense(units=1024, activation='relu')(model)

    model = Dense(units=256, activation='relu')(model)

    model = Dropout(0.5)(model)

    predictions = Dense(units=2, activation='softmax')(model)

    model = Model(inputs = inp, outputs = predictions)

  14. Compile and train the model for 10 epochs. You can modify the model and the hyperparameters to try and get a better accuracy:

    model.compile(loss='binary_crossentropy', optimizer='sgd', metrics = ['acc'])

    model.fit(X_train, y_train, validation_data = (X_test, y_test), epochs=10, batch_size=256)

  15. Calculate the accuracy of the model on the test set to see how well our model performs on previously unseen data by using the following:

    from sklearn.metrics import accuracy_score

    preds = model.predict(X_test)

    accuracy_score(np.argmax(preds, 1), np.argmax(y_test.values, 1))

    The accuracy of the model is:

    Figure 7.39: Model accuracy
    Figure 7.39: Model accuracy
  16. Plot the confusion matrix of the model to get a proper sense of the model's prediction:

    y_actual = pd.Series(np.argmax(y_test.values, axis=1), name='Actual')

    y_pred = pd.Series(np.argmax(preds, axis=1), name='Predicted')

    pd.crosstab(y_actual, y_pred, margins=True)

    Check the following

    Figure 7.40: Confusion matrix of the model (0 = negative sentiment, 1 = positive sentiment)
    Figure 7.40: Confusion matrix of the model (0 = negative sentiment, 1 = positive sentiment)
  17. Check the performance of the model by seeing the sentiment predictions on random reviews using the following code:

    review_num = 111

    print("Review: \n"+tokenizer.sequences_to_texts([X_test[review_num]])[0])

    sentiment = "Positive" if np.argmax(preds[review_num]) else "Negative"

    print("\nPredicted sentiment = "+ sentiment)

    sentiment = "Positive" if np.argmax(y_test.values[review_num]) else "Negative"

    print("\nActual sentiment = "+ sentiment)

    Check that you receive the following output:

Figure 7.41: A review from the IMDB dataset
Figure 7.41: A review from the IMDB dataset

Activity 20: Predicting Sentiments from Tweets

Solution:

  1. Read the tweet dataset using pandas and rename the columns with those given in the following code:

    import pandas as pd

    data = pd.read_csv('tweet-data.csv', encoding='latin-1', header=None)

    data.columns = ['sentiment', 'id', 'date', 'q', 'user', 'text']

  2. Drop the following columns as we won't be using them. You can analyze and use them if you want when trying to improve the accuracy:

    data = data.drop(['id', 'date', 'q', 'user'], axis=1)

  3. We perform this activity only on a subset (400,000 tweets) of the data to save time. If you want, you can work on the whole dataset:

    data = data.sample(400000).reset_index(drop=True)

  4. Convert the tweets to lowercase to reduce the number of unique words. Keep in mind that "Hello" and "hellow" are not the same to a computer:

    data.text = data.text.str.lower()

  5. Clean the tweets using the clean_str function:

    import re

    def clean_str(string):

        string = re.sub(r"https?\://\S+", '', string)

        string = re.sub(r"@\w*\s", '', string)

        string = re.sub(r'\<a href', ' ', string)

        string = re.sub(r'&amp;', '', string)

        string = re.sub(r'<br />', ' ', string)

        string = re.sub(r'[_"\-;%()|+&=*%.,!?:#$@\[\]/]', ' ', string)

        string = re.sub('\d','', string)

        return string

     

    data.text = data.text.apply(lambda x: clean_str(str(x)))

  6. Remove all the stop words from the tweets, as was done in the Text Preprocessing section:

    from nltk.corpus import stopwords

    from nltk.tokenize import word_tokenize,sent_tokenize

    stop_words = stopwords.words('english')

    stop_words = set(stop_words)

    remove_stop_words = lambda r: [[word for word in word_tokenize(sente) if word not in stop_words] for sente in sent_tokenize(r)]

    data['text'] = data['text'].apply(remove_stop_words)

     

    def combine_text(text):

        try:

            return ' '.join(text[0])

        except:

            return np.nan

     

    data.text = data.text.apply(lambda x: combine_text(x))

     

    data = data.dropna(how='any')

  7. Tokenize the tweets and convert them to numbers using the Keras Tokenizer:

    from keras.preprocessing.text import Tokenizer

    tokenizer = Tokenizer(num_words=5000)

    tokenizer.fit_on_texts(list(data['text']))

    sequences = tokenizer.texts_to_sequences(data['text'])

    word_index = tokenizer.word_index

  8. Finally, pad the tweets to have a maximum of 50 words. This will remove any words after the 50-word limit and add 0s if the number of words is less than 50:

    from keras.preprocessing.sequence import pad_sequences

    tweets = pad_sequences(sequences, maxlen=50)

  9. Create the embedding matrix from the GloVe embedding file that we downloaded using the load_embedding function:

    import numpy as np

    def load_embedding(filename, word_index , num_words, embedding_dim):

        embeddings_index = {}

        file = open(filename, encoding="utf-8")

        for line in file:

            values = line.split()

            word = values[0]

            coef = np.asarray(values[1:])

            embeddings_index[word] = coef

        file.close()

        

        embedding_matrix = np.zeros((num_words, embedding_dim))

        for word, pos in word_index.items():

            if pos >= num_words:

                continue

            embedding_vector = embeddings_index.get(word)

            if embedding_vector is not None:

                embedding_matrix[pos] = embedding_vector

        return embedding_matrix

     

    embedding_matrix = load_embedding('../../embedding/glove.twitter.27B.50d.txt', word_index, len(word_index), 50)

  10. Split the dataset into training and testing sets with an 80:20 spilt. You can experiment with different splits:

    from sklearn.model_selection import train_test_split

    X_train, X_test, y_train, y_test = train_test_split(tweets, pd.get_dummies(data.sentiment), test_size=0.2, random_state=9)

  11. Create the LSTM model that will predict the sentiment. You can modify this to create your own neural network:

    from keras.models import Sequential

    from keras.layers import Dense, Dropout, BatchNormalization, Embedding, Flatten, LSTM

    embedding_layer = Embedding(len(word_index),

                               50,

                               weights=[embedding_matrix],

                               input_length=50,

                                trainable=False)

    model = Sequential()

    model.add(embedding_layer)

    model.add(Dropout(0.5))

    model.add(LSTM(100, dropout=0.2))

    model.add(Dense(2, activation='softmax'))

     

    model.compile(loss='binary_crossentropy', optimizer='sgd', metrics = ['acc'])

  12. Train the model. Here, we train it only for 10 epochs. You can increase the number of epochs to try and get a better accuracy:

    model.fit(X_train, y_train, validation_data = (X_test, y_test), epochs=10, batch_size=256)

  13. Check how well the model is performing by predicting the sentiment of a few tweets in the test set:

    preds = model.predict(X_test)

    review_num = 1

    print("Tweet: \n"+tokenizer.sequences_to_texts([X_test[review_num]])[0])

    sentiment = "Positive" if np.argmax(preds[review_num]) else "Negative"

    print("\nPredicted sentiment = "+ sentiment)

    sentiment = "Positive" if np.argmax(y_test.values[review_num]) else "Negative"

    print("\nActual sentiment = "+ sentiment)

    The output is as follows:

Figure 7.42: Positive (left) and negative (right) tweets and their predictions
Figure 7.42: Positive (left) and negative (right) tweets and their predictions