Book Image

Machine Learning Techniques for Text

By : Nikos Tsourakis
Book Image

Machine Learning Techniques for Text

By: Nikos Tsourakis

Overview of this book

With the ever-increasing demand for machine learning and programming professionals, it's prime time to invest in the field. This book will help you in this endeavor, focusing specifically on text data and human language by steering a middle path among the various textbooks that present complicated theoretical concepts or focus disproportionately on Python code. A good metaphor this work builds upon is the relationship between an experienced craftsperson and their trainee. Based on the current problem, the former picks a tool from the toolbox, explains its utility, and puts it into action. This approach will help you to identify at least one practical use for each method or technique presented. The content unfolds in ten chapters, each discussing one specific case study. For this reason, the book is solution-oriented. It's accompanied by Python code in the form of Jupyter notebooks to help you obtain hands-on experience. A recurring pattern in the chapters of this book is helping you get some intuition on the data and then implement and contrast various solutions. By the end of this book, you'll be able to understand and apply various techniques with Python for text preprocessing, text representation, dimensionality reduction, machine learning, language modeling, visualization, and evaluation.
Table of Contents (13 chapters)

Performing classification

Up until this point, we have learned how to represent and preprocess text data. It’s time to make use of this knowledge and create the spam classifier. First, we put all the pieces together using a publicly available corpus. Before we proceed to the training of the classifier, we need to follow a series of typical steps that include the following:

  1. Getting the data
  2. Splitting it into a training and test set
  3. Preprocessing its content
  4. Extracting the features

Let’s examine each step one by one.

Getting the data

The SpamAssassin public mail corpus ( is a selection of email messages suitable for developing spam filtering systems. It offers two variants for the messages, either in plain text or HTML formatting. For simplicity, we will use only the first type in this exercise. Parsing HTML text requires special handling – for example, implementing your own regexps! The term coined to describe the opposite of spam emails is ham since the two words are related to meat products (spam refers to canned ham). The dataset contains various examples divided into different folders according to their complexity. This exercise uses the files within these two folders: spam_2 ( for spam and hard_ham ( for ham.

Creating the train and test sets

Initially, we read the messages for the two categories (ham and spam) and split them into training and testing groups. As a rule of thumb, we can choose a 75:25 ( split between the two sets, attributing a more significant proportion to the training data. Note that other ratios might be preferable depending on the size of the dataset. Especially for massive corpora (with millions of labeled instances), we can create test sets with just 1% of the data, which is still a significant number. To clarify this process, we divide the code into the following steps:

  1. First, we load the ham and spam datasets using the train_test_split method. This method controls the size of the training and test sets for each case and the samples that they include:
    import email
    import glob
    import numpy as np
    from operator import is_not
    from functools import partial
    from sklearn.model_selection import train_test_split
    # Load the path for each email file for both categories.
    ham_files = train_test_split(glob.glob('./data/20030228_hard_ham/hard_ham/*'), random_state=123)
    spam_files = train_test_split(glob.glob('./data/20050311_spam_2/spam_2/*'), random_state=123)
  2. Next, we read the content of each email and keep the ones without HTML formatting:
    # Method for getting the content of an email.
    def get_content(filepath):
        file = open(filepath, encoding='latin1')
        message = email.message_from_file(file)
        for msg_part in message.walk():
            # Keep only messages with text/plain content.
            if msg_part.get_content_type() == 'text/plain':
                return msg_part.get_payload()
    # Get the training and testing data.
    ham_train_data = [get_content(i) for i in ham_files[0]]
    ham_test_data = [get_content(i) for i in ham_files[1]]
    spam_train_data = [get_content(i) for i in spam_files[0]]
    spam_test_data = [get_content(i) for i in spam_files[1]]
  3. For our analysis, we exclude emails with empty content. The filter method with None as the first argument removes any element that includes an empty string. Then, the filtered output is used to construct a new list using list:
    # Keep emails with non-empty content.
    ham_train_data = list(filter(None, ham_train_data))
    ham_test_data = list(filter(None, ham_test_data))
    spam_train_data = list(filter(None, spam_train_data))
    spam_test_data = list(filter(None, spam_test_data))
  4. Now, let’s merge the spam and ham training sets into one (do the same for their test sets):
    # Merge the train/test files for both categories.
    train_data = np.concatenate((ham_train_data, spam_train_data))
    test_data = np.concatenate((ham_test_data, spam_test_data))
  5. Finally, we assign a class label for each of the two categories (ham and spam) and merge them into common training and test sets:
    # Assign a class for each email (ham = 0, spam = 1).
    ham_train_class = [0]*len(ham_train_data)
    ham_test_class = [0]*len(ham_test_data)
    spam_train_class = [1]*len(spam_train_data)
    spam_test_class = [1]*len(spam_test_data)
    # Merge the train/test classes for both categories.
    train_class = np.concatenate((ham_train_class, spam_train_class))
    test_class = np.concatenate((ham_test_class, spam_test_class))

Notice that in step 1, we also pass random_state in the train_test_split method to make all subsequent results reproducible. Otherwise, the method performs a different data shuffling in each run and produces random splits for the sets.

In this section, we have learned how to read text data from a set of files and keep the information that makes sense for the problem under study. After this point, the datasets are suitable for the next processing phase.

Preprocessing the data

It’s about time to use the typical data preprocessing techniques that we learned earlier. These include tokenization, stop word removal, and lemmatization. Let’s examine the steps one by one:

  1. First, let’s tokenize the train or test data:
    from nltk.stem import WordNetLemmatizer
    from nltk.tokenize import word_tokenize
    from sklearn.feature_extraction.text import ENGLISH_STOP_WORDS
    # Tokenize the train/test data.
    train_data = [word_tokenize(i) for i in train_data]
    test_data = [word_tokenize(i) for i in test_data]
  2. Next, we remove the stop words by iterating over the input examples:
    # Method for removing the stop words.
    def remove_stop_words(input):
        result = [i for i in input if i not in ENGLISH_STOP_WORDS]
        return result
    # Remove the stop words.
    train_data = [remove_stop_words(i) for i in train_data]
    test_data = [remove_stop_words(i) for i in test_data]
  3. Now, we create the lemmatizer and apply it to the words:
    # Create the lemmatizer.
    lemmatizer = WordNetLemmatizer()
    # Method for lemmatizing the text.
    def lemmatize_text(input):
        return [lemmatizer.lemmatize(i) for i in input]
    # Lemmatize the text.
    train_data = [lemmatize_text(i) for i in train_data]
    test_data = [lemmatize_text(i) for i in test_data]
  4. Finally, we reconstruct the data in the two sets by joining the words separated by a space and return the concatenated string:
    # Reconstruct the data.
    train_data = [" ".join(i) for i in train_data]
    test_data = [" ".join(i) for i in test_data]

As a result, we have at our disposal two Python lists, namely train_data and test_data, containing the initial text data in a processed form suitable for proceeding to the next phase.

Extracting the features

We continue with the extraction of the features of each sentence in the previously created datasets. This step uses tf-idf vectorization after training the vectorizer with the training data. There is a problem though, as the vocabulary in the training and test sets might differ. In this case, the vectorizer ignores unknown words, and depending on the mismatch level, we might get suboptimal representations for the test set. Hopefully, as more data is added to any corpus, the mismatch becomes smaller, so ignoring a few words has a negligible practical impact. An obvious question is – why not train the vectorizer with the whole corpus before the split? However, this engenders the risk of getting performance measures that are too optimistic later in the pipeline, as the model has seen the test data at some point. As a rule of thumb, always keep the test set separate and only use it to evaluate the model.

In the code that follows, we vectorize the data in both the training and test sets using tf-idf:

from sklearn.feature_extraction.text import TfidfVectorizer
# Create the vectorizer.
vectorizer = TfidfVectorizer()
# Fit with the train data.
# Transform the test/train data into features.
train_data_features = vectorizer.transform(train_data)
test_data_features = vectorizer.transform(test_data)

Now, the training and test sets are transformed from sequences of words to numerical vectors. From this point on, we can apply any sophisticated algorithm we wish, and guess what? This is what we are going to do in the following section!

Introducing the Support Vector Machines algorithm

It’s about time that we train the first classification model. One of the most well-known supervised learning algorithms is the Support Vector Machines (SVM) algorithm. We could dedicate a whole book to demystifying this family of methods, so we will visit a few key elements in this section. First, recall from Chapter 1, Introducing Machine Learning for Text, that any ML algorithm creates decision boundaries to classify new data correctly. As we cannot sketch high-dimensional spaces, we will consider a two-dimensional example. Hopefully, this provides some of the intuition behind the algorithm.

Figure 2.9 shows the data points for the spam and ham instances along with two features, namely and :

Figure 2.9 – Classification of spam and ham emails using the SVM

Figure 2.9 – Classification of spam and ham emails using the SVM

The line in the middle separates the two classes and the dotted lines represent the borders of the margin. The SVM has a twofold aim to find the optimal separation line (a one-dimensional hyperplane) while maximizing the margin. Generally, for an n-dimensional space, the hyperplane has (n-1) dimensions. Let’s examine our space size in the code that follows:

>> (670, 28337)

Each of the 670 emails in the training set is represented by a feature vector with a size of 28337 (the number of unique words in the corpus). In this sparse vector, the non-zero values signify the tf-idf weights for the words. For the SVM, the feature vector is a point in a 28,337-dimensional space, and the problem is to find a 28,336-dimensional hyperplane to separate those points. One crucial consideration within the SVM is that not all the data points contribute equally to finding the optimal hyperplane, but mostly those close to the margin boundaries (depicted with a dotted circle in Figure 2.9). These are called support vectors, and if they are removed, the position of the dividing hyperplane alters. For this reason, we consider them the critical part of the dataset.

The general equation of a line in two-dimensional space is expressed with the formula . Three examples are shown in Figure 2.10:

Figure 2.10 – Examples of line equations

Figure 2.10 – Examples of line equations

In the same sense, the middle line in Figure 2.9 and the two margin boundaries have the following equations, respectively:

, and

Defining the vectors and , we can rewrite the previous equations using the dot product that we encountered in the Calculating vector similarity section:

, , and

To find the best hyperplane, we need to estimate w and b, referred to as the weight and bias in ML terminology. Among the infinite number of lines that separate the two classes, we need to find the one that maximizes the margin, the distance between the closest data points of the two classes. In general, the distance between two parallel lines and is equal to the following:

So, the distance between the two margin boundaries is the following:

To maximize the previous equation, we need to minimize the denominator, namely the quantity that represents the Euclidean norm of vector w. The technique for finding w and b is beyond the scope of this book, but we need to define and solve a function that penalizes any misclassified examples within the margin. After this point, we can classify a new example, , based on the following model (the sign function takes any number as input and returns +1 if it is positive or -1 if it is negative):

The example we are considering in Figure 2.9 represents an ideal situation. The data points are arranged in the two-dimensional space in such a way that makes them linearly separable. However, this is often not the case, and the SVM incorporates kernel functions to cope with nonlinear classification. Describing the mechanics of these functions further is beyond the scope of the current book. Notice that different kernel functions are available, and as in all ML problems, we have to experiment to find the most efficient option in any case. But before using the algorithm, we have to consider two important issues to understand the SVM algorithm better.

Adjusting the hyperparameters

Suppose two decision boundaries (straight lines) can separate the data in Figure 2.11.

Figure 2.11 – Two possible decision boundaries for the data points using the SVM

Figure 2.11 – Two possible decision boundaries for the data points using the SVM

Which one of those boundaries do you think works better? The answer, as in most similar questions, is that it depends! For example, the line in the left plot has a higher classification error, as one opaque dot resides on the wrong side. On the other hand, the margin in the right plot (distance between the dotted lines) is small, and therefore the model lacks generalization. In most cases, we can tolerate a small number of misclassifications during SVM training in favor of a hyperplane with a significant margin. This technique is called soft margin classification and allows some samples to be on the wrong side of the margin.

The SVM algorithm permits the adjustment of the training accuracy versus generalization tradeoff using the hyperparameter C. A frequent point of confusion is that hyperparameters and model parameters are the same things, but this is not true. Hyperparameters are parameters whose values are used to control the learning process. On the other hand, model parameters update their value in every training iteration until we obtain a good classification model. We can direct the SVM to create the most efficient model for each problem by adjusting the hyperparameter C. The left plot of Figure 2.11 is related to a lower C value compared to the one on the right.

Let’s look at another dataset for which two possible decision boundaries exist (Figure 2.12). Which one seems to work better this time?

Figure 2.12 – Generalization (left) versus overfitting (right)

Figure 2.12 – Generalization (left) versus overfitting (right)

At first glance, the curved line in the plot on the right perfectly separates the data into two classes. But there is a problem. Getting too specific boundaries entails the risk of overfitting, where your model learns the training data perfectly but fails to classify a slightly different example correctly.

Consider the following real-world analogy of overfitting. Most of us have grown in a certain cultural context, trained (overfitted) to interpret social signals such as body posture, facial expressions, or voice tone in a certain way. As a result, when socializing with people of diverse backgrounds, we might fail to interpret similar social signals correctly (generalization).

Besides the hyperparameter C that can prevent overfitting, we can combine it with gamma, which defines how far the influence of a single training example reaches. In this way, the curvature of the decision boundary can also be affected by points that are pretty far from it. Low gamma values signify far reach of the influence, while high values cause the opposite effect. For example, in Figure 2.12 (right), the points inside a dotted circle have more weight, causing the line’s intense curvature around them. In this case, the hyperparameter gamma has a higher value than the left plot.

The takeaway here is that both C and gamma hyperparameters help us create more efficient models but identifying their best values demands experimentation. Equipped with the basic theoretical foundation, we are ready to incorporate the algorithm!

Putting the SVM into action

In the following Python code, we use a specific implementation of the SVM algorithm, the C-Support Vector Classification. By default, it uses the Radial Basis Function (RBF) kernel:

from sklearn import svm
# Create the classifier.
svm_classifier = svm.SVC(kernel="rbf", C=1.0, gamma=1.0, probability=True)
# Fit the classifier with the train data., train_class)
# Get the classification score of the train data.
svm_classifier.score(train_data_features.toarray(), train_class)
>> 0.9970149253731343

Now, use the test set:

# Get the classification score of the test data.
svm_classifier.score(test_data_features.toarray(), test_class)
>> 0.8755760368663594

Observe the classifier’s argument list, including the kernel and the two hyperparameters, gamma and C. Then, we evaluate its performance for both the training and test sets. We are primarily interested in the second result, as it quantifies the accuracy of our model on unseen data – essentially, how well it generalizes. On the other hand, the performance on the training set indicates how well our model has learned from the training data. In the first case, the accuracy is almost 100%, whereas, for unseen data, it is around 88%.

Equipped with the necessary understanding, you can rerun the preceding code and experiment with different values for the hyperparameters; typically, 0.1 < C < 100 and 0.0001 < gamma < 10. In the following section, we present another classification algorithm based on a fundamental theorem in ML.

Understanding Bayes’ theorem

Imagine a pool of 100 shapes (including squares, triangles, circles, and rhombuses). These can have either an opaque or a transparent fill. If you pick a circle shape from the pool, what is the probability of it having a transparent fill? Looking at Figure 2.13, we are interested to know which shapes in set A (circles) are also members of set B (all transparent shapes):

Figure 2.13 – The intersection of the set with circles with the set of all transparent shapes

Figure 2.13 – The intersection of the set with circles with the set of all transparent shapes

The intersection of the two sets is highlighted with the grid texture and mathematically written as AB or BA. Then, we construct Table 2.6, which helps us perform some interesting calculations:

Table 2.6 – The number of shapes by type and fill

Table 2.6 – The number of shapes by type and fill

First, notice that the total number of items in the table equals 100 (the number of shapes). We can then calculate the following quantities:

  • The probability of getting a circle is
  • The probability of getting a transparent fill is
  • The probability of getting a transparent fill when the shape is a circle is
  • The probability of getting a circle when the fill is transparent is

The symbol signifies conditional probability. Based on these numbers, we can identify a relationship between the probabilities:

The previous equation suggests that if the probability P(transparent|circle) is unknown, we can use the others to calculate it. We can also generalize the equation as follows:

Exploring the different elements, we reach the famous formula known as Bayes’ theorem, which is fundamental in information theory and ML:

The exercise in Table 2.6 was just an example to introduce the fundamental reasoning behind the theorem; all quantities are available to calculate the corresponding probabilities. However, this is not the case in most practical problems, and this is where Bayes’ theorem comes in handy. For example, consider the following situation: you are concerned that you have a severe illness and decide to go to the hospital for a test. Sadly, the test is positive. According to your doctor, the test has 99% reliability; for 99 out of 100 sick people, the test is positive, and for 99 out of 100 healthy people, the test is negative. So, what is the probability of you having the disease? Most people logically answer 99%, but this is not true. The Bayesian reasoning tells us why.

We have a population of 10,000 people, and according to statistics, 1% (so, 100 people) of this population has the illness. Therefore, based on a 99% reliability, this is what we know:

  • Of the 100 sick subjects, 99 times (99%), the test is positive, and 1 is negative (1% and therefore wrong).
  • Of the 9,900 healthy subjects, 99 times (1%), the test is positive, and 9,801 is negative (99% and therefore correct).

As before, we construct Table 2.7, which helps us with the calculations:

Table 2.7 – The number of people by health condition and test outcome

Table 2.7 – The number of people by health condition and test outcome

At first glance, the numbers suggest that there is a non-negligible likelihood that you are healthy and that the test is wrong (99/10000≈1%). Next, we are looking for the probability of being sick given a positive test. So, let’s see the following:

This percentage is much smaller than the 99% initially suggested by most people. What is the catch here? The probability P(sick) in the equation is not something we know exactly, as in the case of the shapes in the previous example. It’s merely our best estimate on the problem, which is called prior probability. The knowledge we have before delving into the problem is the hardest part of the equation to figure out.

Conversely, the probability P(sick|positive test) represents our knowledge after solving the problem, which is called posterior probability. If, for example, you retook the test and this showed to be positive, the previous posterior probability becomes your new prior one – the new posterior increases, which makes sense. Specifically, you did two tests, and both were positive.


Bayesian reasoning tells us how to update our prior beliefs in light of new evidence. As new evidence surfaces, your predictions become better and better.

Remember this discussion the next time you read an article on the internet about an illness! Of course, you always need to interpret the percentages in the proper context. But let’s return to the spam filtering problem and see how to apply the theorem in this case.

Introducing the Naïve Bayes algorithm

Naïve Bayes is a classification algorithm based on Bayes’ theorem. We already know that the features in the emails are the actual words, so we are interested in calculating each posterior probability P(spam|word) with the help of the following theorem:

where .

Flipping a coin once gives a ½ probability of getting tails. The probability of getting tails two consecutive times is ¼, as P(tail first time)P(tail second time)=(½)(½)=¼. Thus, repeating the previous calculation for each word in the email (N words in total), we just need to multiply their individual probabilities:

As with the SVM, it is straightforward to incorporate Naïve Bayes using sklearn. In the following code, we use the algorithm’s MultinomialNB implementation to suit the discrete values (word counts) used as features better:

from sklearn import naive_bayes
# Create the classifier.
nb_classifier = naive_bayes.MultinomialNB(alpha=1.0)
# Fit the classifier with the train data., train_class)
# Get the classification score of the train data.
nb_classifier.score(train_data_features.toarray(), train_class)
>> 0.8641791044776119

Next, we incorporate the test set:

# Get the classification score of the test data.
nb_classifier.score(test_data_features.toarray(), test_class)
>> 0.8571428571428571

The outcome suggests that the performance of this classifier is inferior. Also, notice the result on the actual training set, which is low and very close to the performance on the test set. These numbers are another indication that the created model is not working very well.

Clarifying important key points

Before concluding this section, we must clarify some points concerning the Naïve Bayes algorithm. First of all, it assumes that a particular feature in the data is unrelated to the presence of any other feature. In our case, the assumption is that the words in an email are conditionally independent of each other, given that the type of the email is known (either spam or ham). For example, encountering the word deep does not suggest the presence or the absence of the word learning in the same email. Of course, we know that this is not the case, and many words tend to appear in groups (remember the discussion about n-grams). Most of the time, the assumption of independence of words is false and naive, and this is what the algorithm’s name stems from. In reality, of course, the assumption of independence allows us to solve many practical problems.

Another issue is when a word appears in the ham emails but is not present in the spam ones (say, covid). Then, according to the algorithm, its conditional probability is zero, P(“covid”|spam) = 0, which is rather inconvenient since we are going to multiply it with the other probabilities (making the outcome equal to zero). This situation is often known as the zero-frequency problem. The solution is to apply smoothing techniques such as Laplace smoothing, where the word count starts at 1 instead of 0.

Let’s see an example of this problem. In a corpus of 10,000 emails, 6,000 are ham and 4,000 are spam. The word heritage appears in 37 emails of the first category and 453 of the second one. Its conditional probabilities are the following:


For an email that contains both words (heritage and covid), we need to multiply their individual probabilities (the symbol “…” signifies the other factors in the multiplication):

To overcome this problem, we apply Laplace smoothing, adding 1 in the numerator and 2 in the denominator. As a result, the smoothed probabilities now become the following:

Notice that Laplace smoothing is a hyperparameter that you can specify before running the classification algorithm. For example, in the Python code used in the previous section, we constructed the MultinomialNB classifier using the alpha=1.0 smoothing parameter in the argument list.

This section incorporated two well-known classification algorithms, the SVM and Naïve Bayes, and implemented two versions of a spam detector. We saw how to acquire and prepare the text data for training models, and we got a better insight into the trade-offs while adjusting the hyperparameters of the classifiers. Finally, this section provided some preliminary performance scores, but we still lack adequate knowledge to assess the two models. This discussion is the topic of the next section.